Kernel bugs manifest in several different ways. Some bugs trigger a panic, but other bugs may result in a hang or a partial loss of function. For example, if several threads are locked in a deadlock, they may not hang the entire machine but only impair certain operations. These differing consequences require different strategies for finding the bug.
Kernel crashes can often be investigated with a very straightforward approach. Often, the panic message itself points to the problem. For those crashes, the context of the panic in the source is sufficient to determine the cause of the crash.
Some crashes are an indirect result of a bug, however. For example, a corrupted data structure will usually result in a memory protection exception such as a page fault. For these crashes, simply examining the source line where the crash occurred will usually lead to the data structure that is in an invalid state. Inspecting the data structure more closely as well as the code around the crash point is often sufficient to determine the cause of the bug.
Another crash that can be a secondary effect is a crash due to exhausting the space in the ``kmem'' virtual memory map. The ``kmem'' virtual memory map is used to provide virtual address space for memory allocated via malloc(9) or uma(9) in the kernel. On architectures with a direct map such as amd64, ``kmem'' is only used for allocations larger than a page. On other architectures ``kmem'' is used for all allocations. If the amount of virtual address space in the ``kmem'' map is exhausted, then the kernel will crash. This can sometimes be the result of resource exhaustion. For example, if kern.ipc.nmbclusters is set to a high value and a m_getcl(M_WAIT) invocation causes the ``kmem'' map to be exhausted before the nmbclusters limit is reached, then the kernel will panic.
Sometimes the ``bug'' can actually be faulty hardware. For example, a pointer might have a bit error. This can result in a page fault for a NULL pointer. One way to verify if a crash on an x86 machine was the result of a hardware error is to check the system event log. This can usually be examined from the BIOS setup. For systems with a BMC, the ipmitool [11] utility can be used to examine the system event log at runtime. Lack of a corresponding entry in the system event log doesn't necessarily disprove a hardware failure, but if an entry is present it can confirm failing hardware as the panic's cause.
Kernel hangs tend to require a bit more sleuthing. One reason for this is that it can sometimes take a bit of investigating to figure out the true extent of the hang. Here are a few things to try to start the investigation of a hang.
First, check for resource starvation. For example, check for messages on the console about the kern.maxfiles or maxproc limits being exceeded. Sometimes a machine that is overloaded will appear to be hung because it is unable to fork a new process for a remote login, for example. Login to the box on the console if possible and check for other resource exhaustion issues using commands like netstat(1) and vmstat(1).
The next step is generally to break into DDB. The ps command in DDB can give a very useful overview of the system. For example, if all of the CPUs are idle, then there may be a deadlock. The ps command can be used to look for suspect threads which can then be investigated further. On the other hand, if all of the CPUs are busy, then that may indicate a livelock condition (or an overloaded box).
If the hang's cause is still unknown, then the panic command can be used from DDB to explicitly panic the machine. If the machine is configured for crashdumps, then it will write out a crash. After the machine has rebooted the crashdump can be used to examine the hang further. For example, if logging into the box to run netstat was not possible, then netstat can be run against the crashdump.