Matt Peperell's Blog

The difficulty of debugging without logs

Written: 04 Dec 2021 (Index by date)

Tags: tech  fault-finding  (Index by tag)

According to a colleague I mentored around 10 years ago, “What do the logs say?” is one of my mantras. I guess that’s fair. It probably still applies now. [Current colleagues, feel free to pass comment!] Logs are designed to be a useful source of diagnostic information and they’re often the first thing I check, after any message immediately output in stderr or stdout.

But what happens when logs don’t contain useful info? This story covers the investigation I took when trying to track down the cause of a problem at a previous client.

Context

One of the functions my employer provides at this client is the building of Ubuntu laptops used by developers and civil servants. This is not at the hardware level; we use a hardware reseller for that. Rather, we perform software installation/updates and machine hardening such that they are allowed on production networks and to access customer data.

The build environment takes the form of a network segregated from the main one used by workstations on a day-to-day basis. We’d relax the boot options of laptops to permit a network build, peform that network build, and than re-restrict the boot options to prevent unauthorised boot media from being used. The build server has several components, the primary ones being: DHCP, TFTP, Cobbler, and Apache.

The problem

I had updated the build server with a newer kernel. But attempting to boot from this newer kernel image was unsuccessful. Normally the visible effect on the laptop that was being built was a series of dots, which act as a progress bar, except that there was no completion figure given; instead it would uncompress and execute the kernal. (I think it’s probably 1 dot per kilobyte, but that detail isn’t important). The effect when using this problematic boot menu option is that no progress dots were shown.

Previous boot menu options remained working.

Investigation and findings

Checking the system logs on the boot sever showed that of the two files required for each of our boot menu options, for this problematic menu option only the first file was being requested by TFTP. Other boot menu options continued to work unimpeded. We knew that the network cabling was functional and correct because the to-be-built machine was able to perform the network boot to acquire an IP address and load the menu of available options.

A tcpdump showed the TFTP request coming in and (at least a partial) response going out. On the cobbler server, an md5sum comparison between the new image and an older one showed that the kernels were identical, and that the initial ramdisk images were also identical. We could see the first file being requested (for the problematic option) via TFTP in the logs but two things stood out: one that the response was short (and there was no followup traffic), and two, that the second file was not requested.

Eventually I had the idea to run tcpdump in verbose mode and then debugged the individual bytes in the response. It was here I determined that the short response was a denial. (There was no easy way to get the client to log more verbosely because it was still in the pre-execution phase; no OS had yet been started by the client). I then looked more closely at the file on the boot server and found that root was the owner and the filemode was 0640. The TFTP service ran as a different user (I think nobody).

Summary of cause, along with highlighted suggestions for improvements

The kernel (and initial ramdisk) images had been given too restrictive permissions. Relaxing the file mode to 644 was enough to resolve the problem.

The “simple” diagnostic approach would have been to run tftpd in verbose mode, to make it be more chatty about permission errors. But there’s also the so-called principe of least surprise: one would normally expect an application to log when it cannot fullfil a request, at the default log level. Apache and nginx do, for example. The “complex” answer would be to submit a patch to tftp such that it logs inability to accede a request at the base log level. This approach also requires that patch to be accepted which may or may not be a simple battle, depending on the software in question.

RSS RSS feed