According to a colleague I mentored around 10 years ago, “What do the logs say?” is one of my mantras. I guess that’s fair. It probably still applies now. [Current colleagues, feel free to pass comment!] Logs are designed to be a useful source of diagnostic information and they’re often the first thing I check, after any message immediately output in stderr or stdout.
But what happens when logs don’t contain useful info? This story covers the investigation I took when trying to track down the cause of a problem at a previous client.
Context
One of the functions my employer provides at this client is the building of Ubuntu laptops used by developers and civil servants. This is not at the hardware level; we use a hardware reseller for that. Rather, we perform software installation/updates and machine hardening such that they are allowed on production networks and to access customer data.
The build environment takes the form of a network segregated from the main one used by workstations on a day-to-day basis. We’d relax the boot options of laptops to permit a network build, peform that network build, and than re-restrict the boot options to prevent unauthorised boot media from being used. The build server has several components, the primary ones being: DHCP, TFTP, Cobbler, and Apache.
The problem
I had updated the build server with a newer kernel. But attempting to boot from this newer kernel image was unsuccessful. Normally the visible effect on the laptop that was being built was a series of dots, which act as a progress bar, except that there was no completion figure given; instead it would uncompress and execute the kernal. (I think it’s probably 1 dot per kilobyte, but that detail isn’t important). The effect when using this problematic boot menu option is that no progress dots were shown.
Previous boot menu options remained working.
Investigation and findings
Checking the system logs on the boot sever showed that of the two files required for each of our boot menu options, for this problematic menu option only the first file was being requested by TFTP. Other boot menu options continued to work unimpeded. We knew that the network cabling was functional and correct because the to-be-built machine was able to perform the network boot to acquire an IP address and load the menu of available options.
A tcpdump
showed the TFTP request coming in and (at least a partial) response
going out. On the cobbler server, an md5sum
comparison between the new image
and an older one showed that the kernels were identical, and that the initial
ramdisk images were also identical. We could see the first file being requested
(for the problematic option) via TFTP in the logs but two things stood out: one
that the response was short (and there was no followup traffic), and two, that
the second file was not requested.
Eventually I had the idea to run tcpdump
in verbose mode and then debugged
the individual bytes in the response. It was here I determined that the short
response was a denial. (There was no easy way to get the client to log more
verbosely because it was still in the pre-execution phase; no OS had yet been
started by the client). I then looked more closely at the file on the boot
server and found that root was the owner and the filemode was 0640. The TFTP
service ran as a different user (I think nobody
).
Summary of cause, along with highlighted suggestions for improvements
The kernel (and initial ramdisk) images had been given too restrictive permissions. Relaxing the file mode to 644 was enough to resolve the problem.
The “simple” diagnostic approach would have been to run tftpd
in verbose
mode, to make it be more chatty about permission errors. But there’s also the
so-called principe of least surprise: one would normally expect an application
to log when it cannot fullfil a request, at the default log level. Apache and
nginx do, for example. The “complex” answer would be to submit a patch to tftp
such that it logs inability to accede a request at the base log level. This
approach also requires that patch to be accepted which may or may not be a
simple battle, depending on the software in question.