Kernel Debugging

When things go wrong

The Linux kernel is not perfect, but it is getting better all the time. Occasionally bugs even creep into the stable kernel series as "improvements" are made. What course of action is appropriate if a problem is discovered? If you are using an unmodified kernel, try going down this checklist before posting a bug report to the kernel mailing list:

  • Is the problem reproducible in the latest stable kernel?

  • Has the problem always existed? If so, report it as a bug in the latest kernel. If not, test successive kernels until you have found the version that introduces the problem.

  • Search the kernel mailing list archives for similar reports. If any come to light, try to concentrate your efforts on providing additional information to that already supplied.

  • Check the changelog for clues. If anything in the changelog looks suspicious, examine the patch for that kernel version and find out what relevant code changes occur in that version.

Some (but not all) problems give rise to a screen dump of cryptic debugging information, also known as an "oops".

Oops Analysis

What is an oops?

When the kernel detects that a serious anomalous condition exists, an "oops" is triggered. An oops has two main functions:

  • To dump useful debugging information that can be used to diagnose the cause of problem.

  • To try and prevent the kernel from going out of control and causing data corruption, or worse, damage to hardware (although this is very rare).

To the uninitiated, an oops appears completely incomprehensible; a lines of hex values and seemingly cryptic, even amusing error messages:

![CDATA[
CPU:    0
EIP:    0010:<c011933c>    Tainted: P
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010002
eax: 00000ce0   ebx: 00001000   ecx: c778a510   edx: 00000610
esi: 00000002   edi: 00000000   ebp: c02165c0   esp: c6663f58
ds: 0018   es: 0018   ss: 0018
Process pcmcia (pid: 1003, stackpage=c6663000)
Stack: 00000000 c02165a0 00000000 c02165c0 c6663fc4 c01193cf c010ac96 c0116406
       c0116340 00000000 00000001 c02165c0 fffffffe c011616a c02165c0 00000000
       c0214900 00000000 c6663fbc 00000046 c010817d 00000000 080caa18 00000000
Call Trace: <c01193cf><c010ac96><c0116406><c0116340><c011616a><c010817d><c0109f48>
Code: 89 42 04 89 10 c7 41 04 00 00 00 00 c7 01 00 00 00 00 fb 53 
> > EIP; c011933c <timer_bh></timer_bh>+228/27c>   =====
Trace; c01193cf <do_timer></do_timer>+3f/70>
Trace; c010ac96 <timer_interrupt></timer_interrupt>+62/110>
Trace; c0116406 <bh_action></bh_action>+1a/48>
Trace; c0116340 <tasklet_hi_action></tasklet_hi_action>+40/60>
Trace; c011616a <do_softirq></do_softirq>+5a/ac>
Trace; c010817d <do_irq></do_irq>+a1/b4>
Trace; c0109f48 <call_do_irq></call_do_irq>+5/d>
Code;  c011933c <timer_bh></timer_bh>+228/27c>
00000000 <_eip>:
Code;  c011933c <timer_bh></timer_bh>+228/27c>   =====
   0:   89 42 04                  mov    %eax,0x4(%edx)   =====
Code;  c011933f <timer_bh></timer_bh>+22b/27c>
   3:   89 10                     mov    %edx,(%eax)
Code;  c0119341 <timer_bh></timer_bh>+22d/27c>
   5:   c7 41 04 00 00 00 00      movl   $0x0,0x4(%ecx)
Code;  c0119348 <timer_bh></timer_bh>+234/27c>
   c:   c7 01 00 00 00 00         movl   $0x0,(%ecx)
Code;  c011934e <timer_bh></timer_bh>+23a/27c>
  12:   fb                        sti
Code;  c011934f <timer_bh></timer_bh>+23b/27c>
  13:   53                        push   %ebx  
0>Kernel panic: Aiee, killing interrupt handler! 
3 warnings issued.  Results may not be reliable.
]]>
</_eip></c0109f48></c010817d></c011616a></c0116340></c0116406></c010ac96></c01193cf></c011933c>

Anatomy of an oops

TODO.

Decoding an oops

The information provided by an oops is in a very "raw" form, some of which is specific to the kernel image that generated it. Therefore, some post-processing needs to be carried to obtain useful information on where to start with the debugging process. This section will contain a step by step to decoding an oops. A good place to start is OOPS! An Introduction to Linux Kernel Debugging.

Using a Debugger

A Word about debuggers

Use of a debugger is generally looked down on by the likes of Linus. Consider these quotes of his from the Linux kernel mailing list:

 

'Use the Source, Luke, use the Source. Be one with the code.'. Think of Luke Skywalker discarding the automatic firing system when closing on the deathstar, and firing the proton torpedo (or whatever) manually. _Then_ do you have the right mindset for fixing kernel bugs.

 
--Linus  

Also:

 

I'm afraid that I've seen too many people fix bugs by looking at debugger output, and that almost inevitably leads to fixing the symptoms rather than the underlying problems.

 
--Linus  

So are there any good reasons why you should use a debugger? Stop for a moment and consider how the top kernel programmers go about locating and fixing a problem; how do they do it? The answer is that they have many years of programming experience to bring to bare on the situation; chances are they have seen something like this before. They have those "hunches" that semi-automatically lead them to the right place; to the real root of the problem.

So how does the "up and coming" kernel hacker nurture skills like these to maturity, especially when time pressure demands a quick solution? The answer comes in the form of intelligent use of a debugger.

  • Use the debugger to collect the evidence surrounding the problem area(s).

  • Study the code and think about what is going on. hard.

  • Try to concentrate on thinking about possible causes of the symptoms you are seeing in the debugger. Then think about the causes of the causes all the way down to the real root of the problem. Write a list of the possibilities, placing them in order of perceived likelihood and rule them out in turn, one by one. The process of clarifying thoughts to write them down can be valuable.

  • Until you have some experience, you may need to use the debugger to try some ideas out on the fly by changing variable values etc.

Note that we are using the debugger here as a tool to stimulate rational, logical thought on what is going on in the code. As you get more experienced at tracking bugs down, you will be able to use the debugger less; you won't need as many clues before you see the problem. If you use the debugger wisely, you'll gain the expertise of the "hardcore" kernel hackers, but in less time.

In summary, some do's and don'ts:

Do:

  • Study the code before you set to with the debugger; you will be more productive if you have thought about the code first.

  • Use the debugger to test your assumptions; bugs often come about as a result of incorrect assumptions (have you ever seen those "we should never get here" debugging messages?).

  • If your assumptions are proved wrong, make it your business to get to the bottom of why you got it wrong and make a mental note for next time.

  • Discard the debugger as soon as possible; think of Luke Skywalker again here!

Don't:

  • Ignore the real cause of problems that come to light. When you find something amiss, don't just shrug your shoulders and fix things up so that everything looks OK. A classic example is adding another section to a switch statement to cover an eventuality you hadn't thought of. Do you understand why that scenario is occurring? Consider the possibility that your approach and/or assumptions are flawed. Don't go "wall-papering over the cracks"; you'll only have to fix it properly later on (or worse, someone else will).

  • Blindly use the debugger to narrow a problem down; you could well come to the wrong conclusion and you won't learn as much in the process.

Now that we have some idea of what debuggers are good for (and not so good for), some of the options available to you will be presented in the next section.

Debugging Techniques

There are four different approaches to kernel debugging (apart from the printk() method which doesn't really count):

  • Local kernel debugging (done by running gdb with /proc/kcore as the core file), this approach is of very limited use; amongst other limitations, no breakpoints can be set.

  • User Mode Linux; a way to run Linux inside Linux. This is good for general hacking as no extra hardware is required but not so good for troubleshooting a hardware specific problem. For more information on User Mode Linux, see: the User Mode Linux Homepage

  • Local kernel debugging using kdb.

  • Remote kernel debugging using gdb. Here the kernel under test runs on a physically separate machine and communicates with the controlling machine running the debugger, via a serial link. This is the approach described in this document.

If you have never used the gdb debugger, I strongly suggest you familiarise yourself with it in userspace before delving into kernel debugging. This gdb reference card is very useful. While there are some graphical front ends for gdb (including xxgdb and ddd), it is a good idea to get used to using plain gdb on the console; it is better not to be running X while kernel hacking/debugging; you may miss crucial console output or worse, an oops. Kernel compiles can also be faster on a machine without X running, especially on machines with less than 64MB RAM, as X uses a significant amount of RAM.

Hardware Setup

The two machines in this example will be referred to as "kernighan" (workstation/development machine) and ``ritchie'' (testing machine that actually runs the development kernels).

  • First of all, you need a machine to be your testing box. Note that it does not have to have a monitor, keyboard and mouse. My testing machine was cobbled together from spare parts! A keyboard can be useful though, in order to be able to use the magic SysRq key.

  • Set up ethernet between the two machines. Check you can ping "ritchie" by using it's hostname.

  • Build two serial cables using the pin out given below. One cable will be used by gdb, the other will be used to give access to the console of testing. This saves having another monitor on the desk and makes capturing Oopsen a lot easier. The ethernet connection is useful for multiple remote logins and quickly transferring kernel images etc. Here is the serial cable pin out: Din 9 pin to Din 9 pin:

    ![CDATA[
         Solder-side pins:
    
         \-----------------------/
          \  1   2   3   4   5  /
           \   6   7   8   9   /
            \-----------------/
    
         Wiring: (use 7 or 10 wire foil screened cable)
    
         1
         |
         6---------------4
    
         2---------------3
    
         3---------------2
    
         4---------------6
                         |
                         1
    
         5---------------5
    
         7---------------8
    
         8---------------7
    ]]>
    

    Connect the cable screen to the chassis of one of the connectors, this will help prevent an earth loop between the two machines. This pin out was taken from the text-terminal-howto, I decided to reproduce it here to remove the confusion over which of the many pin outs to use. a standard null-modem cable may work if you don't want to build the cable(s) yourself, however i had to modify mine to the above configuration to make it work. If you go out and buy a null-modem cable especially, I'd advise getting one with connectors that can be disassembled (i.e. not moulded on connectors) to make modification possible.

    If you have a Radio Shack store nearby, here are some part numbers:

    • Serial cable: 26-152B (Female DB9 - Female DB9)

    • Null Modem adapter: 26-264B (Female DB9 - Male DB9)

    Thanks to Pragnesh Sampat for providing this information.

  • connect the two machines together using the two serial cables, com1 to com1, com2 to com2.

Software Setup

  • Install ssh on "kernighan".

  • Install sshd on "ritchie" (this is part of the ssh package for some distributions, including Debian).

  • Check you can log in via ssh to "ritchie" (note: you won't be able to log in as root, but you can su to root once logged in as a user, should you need to).

  • Give yourself read/write access to dev/ttyS0 and /dev/ttyS1 on both machines.

  • Install minicom on both machines.

Compilation

  • Download and unpack the kernel source to your home directory on "kernighan".

  • Download the kgdb patch, to the top level kernel source directory from the kgdb home page

  • Apply the patch: bash-2.03\$ cat kgdb\_2.2.18.diff |patch -p2

  • make menuconfig (or copy your standard .config file and make oldconfig).

  • Select the usual configuration options, adding:

    o support for console on serial port under character devices
    o kernel support for gdb (new) under kernel hacking.
    

    I would advise compiling all additional options directly into the kernel, rather than as modules to start with.

  • If you have a keyboard directly attached to your testing machine, you may also want to add magic sysrq key support, be sure to read (and maybe print out) Documentation/sysrq.txt.

  • Proceed with a normal compilation: make dep bzImage.

  • Copy the image over to "ritchie": scp arch/i386/boot/bzImage ritchie:bzimage-2.2.18-kgdb (note: you will need to have the same user name on ``kernighan'' and ``ritchie'' for this command to work). Alternatively you could setup and use FTP (less secure).

  • ssh into "ritchie", become root and move the image into /boot.

  • create a new entry in /etc/lilo.conf (on "ritchie"):

    ![CDATA[
    image=/boot/bzImage-2.2.18-kgdb
    	label=kgdb
    	root=/dev/hda1
    	read-only
    	append=" gdb gdbttys=0 console=ttys1"
    ]]>
    

    The extra command line options tell the gdb debugging stub to listen on /dev/ttyS0, and to tell the kernel to use /dev/ttyS1 as a serial console. You can also control lilo (i.e. choose which image to boot) from the serial console if you add:

    serial = 0,9600n8
    

    to the top of your lilo configuration file.

  • Run lilo (on testing).

  • If you decided not to make the debugging kernel the default kernel, run lilo -R kgdb to make the new image boot following the next reboot only. (That way if the debugging kernel fails to boot for whatever reason, the machine will boot a working kernel next time to enable you to resolve the problem).

  • Create a file named .gdbinit in the top level directory of the kernel source tree on "kernighan", containing the following:

    define rmt
    set remotebaud 38400
    target remote /dev/ttyS0
    end
    
  • Run (as root) minicom -s on "kernighan", go to serial port setup and select these options:

    ![CDATA[
    Serial Device          : /dev/ttyS1
    Lockfile Location      : /var/lock
    Callin Program         :
    Callout Program        :
    Bps/Par/Bits           : 9600 8N1
    Hardware Flow Control  : No
    Software Flow Control  : No
    ]]>
    
  • Go to "Save setup as dfl", and save settings as default before going to Exit, leaving minicom awaiting input.

The Debugging Session

  • Shutdown and reboot "ritchie".

  • After the BIOS Power On Self Test the debug kernel should load on testing, giving the following output in minicom on "kernighan":

    ![CDATA[
    Linux version 2.2.18serialgdb (jfreak@kernighan) (gcc version 2.95.2 20000220 (Debian GNU/Linux)) #6 Fri Jun 15 17:02:55 BST 2001
    Detected 167046 kHz processor.
    Console: colour VGA+ 80x25
    Calibrating delay loop... 333.41 BogoMIPS
    Memory: 63556k/65536k available (704k kernel code, 408k reserved, 824k data, 44k init)
    Dentry hash table entries: 8192 (order 4, 64k)
    Buffer cache hash table entries: 65536 (order 6, 256k)
    Page cache hash table entries: 16384 (order 4, 64k)
    CPU: Intel Pentium 75 - 200 stepping 0c
    Checking 386/387 coupling... OK, FPU using exception 16 error reporting.
    Checking 'hlt' instruction... OK.
    Intel Pentium with F0 0F bug - workaround enabled.
    POSIX conformance testing by UNIFIX
    Trying to free free IRQ4
    Waitng for connection from remote gdb on ttyS0
    ]]>
    
  • On "kernighan", type gdb vmlinux> (in the top level kernel source directory). gdb will start and you should see a license notice followed by a (gdb) prompt.

  • Type rmt at the gdb prompt (and press enter). This reads the rmt command from your .gdbinit file. If the serial link is working correctly, gdb should give the following output:

    ![CDATA[
    (gdb) rmt
    0xc010da29 in breakpoint () at gdb.c:701
    701       if (initialized) BREAKPOINT();
    (gdb)
    ]]>
    

    Note: the hex address will be different for each kernel image. At this point, gdb is paused awaiting user input. You may set breakpoints, watch expressions etc. here before giving the continue command c.

  • The debug kernel will continue to boot (giving further output in minicom on "kernighan").

  • Once booting is finished, check that you can ssh into "ritchie".

  • You may then set up a test case to cause the debug kernel to run the code to be debugged.

  • You may add breakpoints by using Ctrl-C to get a gdb prompt.

Troubleshooting

If the serial link does not appear to be working, try going down this check list:

  • Check that the serial cables are connected to the right ports.

  • Check the serial ports have correct baud rate, parity settings etc.

  • Double check cable wiring against pin out diagram.

  • If possible, check the continuity of the serial cable with a multi-meter or continuity tester.

  • ssh into "ritchie" and set up a minicom to minicom session between the two machines. Typing in one minicom session should produce output on the other.

If you have a Palm Pilot, it can be used as a serial console in conjunction with the cradle (a very useful tool for trouble shooting). You'll need a copy of ptelnet, available from the ptelnet home page. Use the following options in Options->Terminal:

  • Mode: Serial

  • Return: CR

And in Options->Serial:

  • Baud: 9600

  • Parity: N

  • Word: 8

  • StopBits: 1

  • Xon/Xoff: not checked

  • RTS/CRS: not checked

If you are unable to ssh into "ritchie" when running the debug kernel, check the following:

  • You compiled in support for your ethernet card (not as a module!).

  • You compiled in sufficient network support to enable ethernet use.

  • Make sure you aren't root (or use ssh -l username to stop ssh from trying to log in as root).

Misc. Notes

  • The gdb interface defaults to 38400bps, this can be increased right up to 115200bps which may speed up debugging sessions, at the expense of stability.

  • The serial console defaults to 9600bps, this too can be increased up to 115200bps which is useful if you want to run curses based programs on the console, as redraws are painfully slow at 9600bps!

Using the logs

Of course, it is possible debug the kernel without a debugger, simply by printing out lots of tracing to the log files and carrying out a post-mortem afterwards. One problem with this approach is that if your code is called a lot, you can end up flooding the log files to the extent that the machine becomes unusable and/or you fill your disk up! One way to avoid this problem is to put your printk statements inside conditional code that is only executed by a "special" user (e.g. a user with a UID of 9999), then call/exercise that code as that special user:

...  if (current->uid == 9999)
		printk("your debug info");
...

Like most techniques, this one is good in some situations and not so good in others; try it out and find how/when it works best for you. Many thanks to Andrew Morton for posting this tip to the Linux Kernel Mailing list.