How to use the Pintos GDB macros

Written by Godmar Back <gback@cs.vt.edu>, March 2006
for CS 3204 - Operating Systems.

This page contains examples of how to use the Pintos gdb macros I wrote. These macros add user-defined commands to gdb that can help you debug your Pintos kernel. You can type help user-defined to see help on these commands. First, you need to make those commands known to gdb with the source command. To pick up the latest version on the Virginia Tech machines, use:

source /home/courses/cs3204/gb/pintos-gdb-macros
after you start gdb.

The following table gives an overview of the commands this file provides:

debugpintos Attach debugger to a waiting pintos process on the same machine. Short for target remote localhost:1234.
dumplist Prints the elements of a Pintos list. Takes three parameters: A struct list, the declared type of the list elements (without the word "struct"!), and the name of the list_elem field in that struct used to link the list elements.

Example: dumplist all_list thread all_elem prints all elements of type "struct thread" that are linked in "struct list all_list" using the "struct list_elem all_elem" which is part of "struct thread".

btthread Shows the backtrace of a thread. Takes one parameter, which is a pointer to the "struct thread" of the thread whose backtrace it should show. For the current thread, this is identical to the "bt" or backtrace command. However, it also works for threads that are suspended in schedule(), provided you know where their kernel stack page is located.
btthreadlist Shows the backtraces of all threads in a list. Takes two parameters: the struct list in which the threads are kept, and the list_elem field used inside struct thread to link the threads together.

Example: btthreadlist all_list all_elem shows the backtraces of all threads contained in "struct list all_list", linked together by "all_elem". This command is useful to determine where your threads are stuck when a deadlock occurs. Please see the example scenario below.

btpagefault Print a backtrace of the current thread after a page fault exception. Takes no parameters. Normally, when a page fault exception occurs, gdb will stop with a message that might say:
Program received signal 0, Signal 0.
0xc0102320 in intr0e_stub ()
In that case, the "bt" command might not give a useful backtrace. Use btpagefault instead.

btpagefault gives you a usable backtrace for page faults that occur in kernel code or in user code. Usually, you will need to track down exceptions occurring in kernel code, but should you need to track down exceptions occurring in a user process, you may also wish to load its symbol table using add-symbol-file, as outlined in E.5 gdb.

hook-stop hook-stop is not a command you invoke, it is a command that gdb invokes every time your kernel stops because a page fault exception occurred. Page fault exceptions occur whenever a process accesses memory to which it does not have permissions according to the currently active page directory. There are multiple cases, but they break down into two groups: your process was executing user code when it accessed an invalid address, or your process was executing kernel code.

The stop hook will show you which is the case. If the exception occurred from user code, it will say:

pintos-debug: a page fault exception occurred in user mode
pintos-debug: hit 'c' to continue, or 's' to step to intr_handler
Page faults in user mode require the following actions: In Project 2, they lead to the termination of the process. You should expect those page faults to occur in the robustness tests where we test that your kernel properly terminates processes that try to access invalid addresses. To debug those, set a break point in page_fault in exception.c, which you will need to modify accordingly.

In Project 3, such page faults no longer automatically lead to the termination of a process, rather you may have to page in the page containing the address the process was trying to access, either because it was swapped out or because this is the first time it's accessed. In either case, you will reach page_fault and need to program the appropriate action there.

If the page fault did not occur in user mode while executing user code, then it occurred in kernel mode while executing kernel code. In this case, the stop hook will print this message:

pintos-debug: a page fault occurred in kernel mode
followed by the output of the btpagefault command.

In Project 2, page fault exceptions in kernel code are always bugs in your kernel - because your kernel should never crash. In Project 3, the same applies, unless you chose the get_user/put_user strategy to verify user memory accesses as outlined in 4.1.5 Accessing User Memory, which as a reminder we recommend only for the more daring among you. Please see the example scenario below.

Example scenarios follow. Input you'd type into gdb is shown in brown. Output by Bochs/Pintos is shown in blue. Output by gdb shown in red.

Using btthreadlist to detect where Pintos is stuck

In this example, I introduced a bug in my solution to Project 1 where I would occasionally forget to wake up threads that called timer_sleep(). Consequently, tests such as mlfqs_load_1 get stuck.

To debug this, I start pintos with the --gdb option:

gback@rambutan [3](~/pintos/pintos-solution2/src/threads/build) > pintos -v --gdb -- -q -mlfqs run  mlfqs-load-1
Writing command line to /tmp/gDAlqTB5Uf.dsk...
bochs -q
========================================================================
                       Bochs x86 Emulator 2.2.5
             Build from CVS snapshot on December 30, 2005
========================================================================
00000000000i[     ] reading configuration from bochsrc.txt
00000000000i[     ] Enabled gdbstub
00000000000i[     ] installing nogui module as the Bochs GUI
00000000000i[     ] using log file bochsout.txt
Waiting for gdb connection on localhost:1234
Then I open a second window on the same machine (here "rambutan") and start gdb:
gback@rambutan [5](~/pintos/pintos-solution2/src/threads/build) > gdb kernel.o
GNU gdb Red Hat Linux (6.3.0.0-1.84rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/libthread_db.so.1".
Next, I include the gdb macros:
(gdb) source /home/courses/cs3204/public_html/spring2006/gback/pintos-gdb-macros
I tell gdb to attach to the waiting Pintos emulator:
(gdb) debugpintos
Remote debugging using localhost:1234
0x0000fff0 in ?? ()
Reply contains invalid hex digit 78
Now I tell Pintos to run by hitting 'c' (for continue) twice:
(gdb) c
Continuing.
Reply contains invalid hex digit 78
(gdb) c
Continuing.
Now Pintos will continue and output:
Pintos booting with 4,096 kB RAM...
Kernel command line: -q -mlfqs run mlfqs-load-1
374 pages available in kernel pool.
373 pages available in user pool.
Calibrating timer...  102,400 loops/s.
Boot complete.
Executing 'mlfqs-load-1':
(mlfqs-load-1) begin
(mlfqs-load-1) spinning for up to 45 seconds, please wait...
(mlfqs-load-1) load average rose to 0.5 after 42 seconds
(mlfqs-load-1) sleeping for another 10 seconds, please wait...
... until it gets stuck because of the bug I had introduced. I hit "Ctrl-C" in the debugger window:
Program received signal 0, Signal 0.
0xc010168c in next_thread_to_run () at ../../threads/thread.c:649
649	  while (i <= PRI_MAX && list_empty (&ready_list[i]))
(gdb) 
The thread that was running when I interrupted Pintos was the idle thread. If I run "backtrace", it shows this backtrace:
(gdb) bt
#0  0xc010168c in next_thread_to_run () at ../../threads/thread.c:649
#1  0xc0101778 in schedule () at ../../threads/thread.c:714
#2  0xc0100f8f in thread_block () at ../../threads/thread.c:324
#3  0xc0101419 in idle (aux=0x0) at ../../threads/thread.c:551
#4  0xc010145a in kernel_thread (function=0xc01013ff , aux=0x0)
    at ../../threads/thread.c:575
#5  0x00000000 in ?? ()
Not terribly useful. What I really like to know is what's up with the other thread (or threads). Since I keep all threads in a linked list called all_list, linked together by a struct list_elem all_elem, I can use the btthreadlist command, which is only available in gdb after sourcing the gdb macros I wrote. btthreadlist iterates through the list of threads and prints the backtrace for each thread:
(gdb) btthreadlist all_list all_elem
pintos-debug: dumping backtrace of thread 'main' @0xc002f000
#0  0xc0101820 in schedule () at ../../threads/thread.c:722
#1  0xc0100f8f in thread_block () at ../../threads/thread.c:324
#2  0xc0104755 in timer_sleep (ticks=1000) at ../../devices/timer.c:141
#3  0xc010bf7c in test_mlfqs_load_1 () at ../../tests/threads/mlfqs-load-1.c:49
#4  0xc010aabb in run_test (name=0xc0007d8c "mlfqs-load-1")
    at ../../tests/threads/tests.c:50
#5  0xc0100647 in run_task (argv=0xc0110d28) at ../../threads/init.c:281
#6  0xc0100721 in run_actions (argv=0xc0110d28) at ../../threads/init.c:331
#7  0xc01000c7 in main () at ../../threads/init.c:140

pintos-debug: dumping backtrace of thread 'idle' @0xc0116000
#0  0xc010168c in next_thread_to_run () at ../../threads/thread.c:649
#1  0xc0101778 in schedule () at ../../threads/thread.c:714
#2  0xc0100f8f in thread_block () at ../../threads/thread.c:324
#3  0xc0101419 in idle (aux=0x0) at ../../threads/thread.c:551
#4  0xc010145a in kernel_thread (function=0xc01013ff , aux=0x0)
    at ../../threads/thread.c:575
#5  0x00000000 in ?? ()
In this case, there are only two threads, the idle thread and the main thread. The kernel stack pages (to which the struct thread points) are at 0xc0116000 and 0xc002f000, respectively. The main thread is stuck in timer_sleep(), called from test_mlfqs_load_1.

Knowing where threads are stuck can be tremendously useful, for instance when diagnosing deadlocks or unexplained hangups.

The full output of this session is here: what I typed in gdb and what I typed to run Pintos. (Note that this output was captured with a slightly older version of those macros, so it looks slightly different than it would now.)

Using hookstop to detect the cause of a page fault exception

In this example, I (re-)introduced a bug in my solution to Project 4 where Pintos would crash early on while moving a file from the scratch disk onto its newly formatted file system. You may encounter bugs like this in your own implementation in projects 2 to 4.

I start pintos with the --gdb switch. (As an aside, you can copy and paste the output of "make check", replacing "-T 60" with "--gdb", to start pintos the way the test scripts would if you're interested in debugging a particular test case. Second, make sure you use bochs, so don't specify --qemu.)

gback@nectarine [1](~/pintos/pintos-solution2/src/filesys/build) > pintos --fs-disk=2 --swap-disk=4 -v --gdb -p tests/userprog/args-none -a args-none -- -f -q run args-none
Copying tests/userprog/args-none into /tmp/Ol0QMvme0N.dsk...
Writing command line to /tmp/9gwksyasZB.dsk...
bochs -q
========================================================================
                       Bochs x86 Emulator 2.2.5
             Build from CVS snapshot on December 30, 2005
========================================================================
00000000000i[     ] reading configuration from bochsrc.txt
00000000000i[     ] Enabled gdbstub
00000000000i[     ] installing nogui module as the Bochs GUI
00000000000i[     ] using log file bochsout.txt
Waiting for gdb connection on localhost:1234
I then start gdb in a separate window on the same machine and attach to the bochs emulator:
gback@nectarine [1](~/pintos/pintos-solution2/src/filesys/build) > gdb kernel.o
GNU gdb Red Hat Linux (6.3.0.0-1.84rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/libthread_db.so.1".

(gdb) source /home/courses/cs3204/public_html/spring2006/gback/pintos-gdb-macros
(gdb) debugpintos
0x0000fff0 in ?? ()
Reply contains invalid hex digit 78
(gdb) c
Continuing.
Reply contains invalid hex digit 78
(gdb) c
Continuing.
The pintos process continues:
Pintos booting with 4,096 kB RAM...
Kernel command line: -f -q put args-none run args-none
366 pages available in kernel pool.
365 pages available in user pool.
Calibrating timer...  102,400 loops/s.
hd0:0: detected 1,008 sector (504 kB) disk, model "Generic 1234", serial "BXHD00011"
hd0:1: detected 4,032 sector (1 MB) disk, model "Generic 1234", serial "BXHD00012"
hd1:0: detected 1,008 sector (504 kB) disk, model "Generic 1234", serial "BXHD00021"
hd1:1: detected 8,064 sector (3 MB) disk, model "Generic 1234", serial "BXHD00022"
Initialized buffer cache for 64 sectors
Formatting filesystem...done.
user pages 365
using swap space for 1008 pages
Boot complete.
Putting 'args-none' into the file system...
at which point it gets stuck, because a page fault exception has occurred. In the gdb window, hook-stop is invoked automatically, which shows this useful backtrace:
Program received signal 0, Signal 0.
Current language:  auto; currently asm
pintos-debug: a page fault occurred in kernel mode
Current language:  auto; currently c
#0  0xc010368e in lock_held_by_current_thread (lock=0x10)
    at ../../threads/synch.c:359
#1  0xc01032e3 in lock_acquire (lock=0x10) at ../../threads/synch.c:254
#2  0xc010e92b in dir_reopen (dirp=0x0) at ../../filesys/directory.c:97
#3  0xc010d8d2 in filesys_resolve_directory (name=0xc0007d88 "args-none", 
    pdir=0xc002ff20, basename=0xc002ff1c) at ../../filesys/filesys.c:76
#4  0xc010da04 in filesys_create (name=0xc0007d88 "args-none", 
    initial_size=38846, type=FILE) at ../../filesys/filesys.c:119
#5  0xc01107d5 in fsutil_put (argv=0xc0116a08) at ../../filesys/fsutil.c:105
#6  0xc01007f0 in run_actions (argv=0xc0116a08) at ../../threads/init.c:331
#7  0xc01000fe in main () at ../../threads/init.c:140
Current language:  auto; currently asm
0xc0102320 in intr0e_stub ()
(gdb)
I can see that I was calling "dir_reopen" with a NULL pointer (see frame #2), which leads to an attempt to acquire a lock located at address 0x10 (which happens to be the offset of an embedded struct lock inside struct dir in my implementation of Project 4.) Address 0x10 is an invalid address, leading to a page fault exception.

The reason for the bug turned out that this code is invoked before process_init(), and at this point in time the running thread did not yet have a current directory. However, the path name "args-none" is a relative path. The solution is to treat relative paths like absolute paths when the filesystem is used before processes have been initialized. A simple check whether the current directory of the current thread was initialized fixed that problem.

Diagnosing double and triple faults

Note that in many cases, when running your kernel without gdb, you will see a kernel panic and a backtrace when a page fault occurs in kernel mode. This happens because unless you modify page_fault() in exception.c, page_fault() will call kill() in exception.c. This in turn will output the "Kernel bug - unexpected interrupt in kernel" message with which you are probably familiar by now. After that, the kernel will halt and bochs will stop the simulation.

However, in Project 2 you will have to modify page_fault(), if only to identify page faults that occur in user mode because of misbehaving processes. (As an aside, you can look at kill() to see how it establishes that the exception occurred in user code.) In Project 3, you may even have to continue after resolving page faults that occur in user mode. [ In addition, as pointed out above, if you choose the put_user()/get_user() option to verify user memory accesses, you may have to continue even after page faults that occurred in kernel code --- if the address was a user virtual address accessed inside get_user()/put_user(), which are both part of kernel code. ]

But what if your page fault exception handling implementation is itself buggy? If such bugs cause an invalid memory access, you will be experiencing a so-called double-fault. The resulting page fault exception is delivered to the same interrupt handler, which means that intr0e_stub() will be called, which may then reenter page_fault() in exception.c. Note that in this case the exception frame will say that the exception came from kernel code, which --- unless you changed page_fault() --- will call kill(), which will hopefully lead to a backtrace and halt.

However, if you are not prepared for that to happen and your code causes yet another page fault exception while handling the double fault, a triple fault occurs. Triple faults are the bane of all OS developers, though they are known to happen even in some widely used operating systems. They are dreaded because they lead to a shutdown of the CPU, which typically causes the machine to reset and reboot.

In my implementation of Pintos, I had chosen the get_user/put_user approach in projects 2 and up. When I ran my Pintos kernel without gdb, it would spontaneously --- and continuously --- reboot, without printing any kernel panic or backtrace.

Fortunately, the hook-stop implementation helped me debug what was causing the triple-fault. After the first page fault (which was caused by my accessing NULL in dir_reopen(), see above), I instructed the kernel to continue:

(gdb) c
Continuing.
which triggers the double-fault. The stop hook executed and gave this useful backtrace:
Program received signal 0, Signal 0.
pintos-debug: a page fault occurred in kernel mode
Current language:  auto; currently c
#0  0xc010a720 in find_bucket (h=0x0, e=0xc002fd60)
    at ../../lib/kernel/hash.c:301
#1  0xc010a37d in hash_find (h=0x0, e=0xc002fd60)
    at ../../lib/kernel/hash.c:127
#2  0xc011130a in page_handle_fault (irqf=0xc002fe18, faultaddr=0x10)
    at ../../vm/page.c:242
#3  0xc010c61b in page_fault (f=0xc002fe18) at ../../userprog/exception.c:157
#4  0xc01020e5 in intr_handler (frame=0xc002fe18)
    at ../../threads/interrupt.c:356
#5  0xc0102297 in intr_entry () at ../../threads/intr-stubs.S:37
#6  0xc002fe18 in ?? ()
#7  0xc0115ed9 in __func__.1624 ()
#8  0xc012a210 in ?? ()
#9  0xc002fe68 in ?? ()
#10 0xc002fe38 in ?? ()
#11 0x000000b9 in ?? ()
#12 0xc0007d65 in ?? ()
#13 0x00000009 in ?? ()
#14 0x00000010 in ?? ()
#15 0xc0020010 in ?? ()
#16 0xc0100010 in main () at ../../threads/init.c:77
Current language:  auto; currently asm
0xc0102320 in intr0e_stub ()
(gdb) 
Continuing now would lead to a triple-fault and reboot.

It turned out that I called page_handle_fault() in vm/page.c from page_fault() in userprog/exception.c. page_handle_fault() is a function I had written as part of my Project 3 solution, in which I determine if the address that the code attempted to access before incurring the page fault is a valid address for the currently executing process or not. (The weird name `page_handle_fault' comes about because I named all exported functions in page.c with a page_ prefix.)

To make that determination, I have to consult the page table of the current process. It turns out that because this exception occurred before the first process was started, the currently running thread did not yet have a page table. I ignored that case, which meant I called hash_find() with NULL as its first argument, which led to the page fault in find_bucket(). (Note that I followed the recommended implementation strategy in Project 3 in that I keep my per-process page table data structure entirely separate from the actual per-process page directory that the CPU's MMU hardware uses, as is recommended in 5.2.2 Page Table Management.)

The full output of this session is here: what I typed in gdb and what I typed to run Pintos.

This page was last modified: Tuesday, 07-Mar-2006 22:34:42 EST