Interpreting Pintos Backtraces

When writing programs in C or C++, you may occassionally make a mistake and dereference a pointer to memory that is not mapped or to which you do not have access permission. If this happens in an ordinary user program, you would receive a so-called "segmentation fault" or "general protection fault", which would ordinarily result in the termination of the program.

The handling of such faults is implemented as a combination of hardware and software. The hardware (specifically the CPU's memory management unit or MMU) knows which addresses are valid and which are not. When an instruction attempts to read from or write to an address that is not valid, the CPU aborts this instruction and jumps to a fault handler at a particular address ("intr_entry" in Pintos).

This fault handler must handle this situation. If the fault occurred in user code (which is possible in Pintos Project 2 and up), then the user process must be terminated. If the fault occurred in kernel code, then this is indicative of a bug (the kernel should never access invalid addresses). In this case, Pintos attempts to output as much useful information as it can to the console, before calling PANIC() which will halt the machine.

Let's look at an example. When running test_priority_change, you might see:

gback@cantaloupe [21](~/pintos/pintos-solution2/src/threads/build) > pintos  -v -T 60  -- -q run priority-change
Writing command line to /tmp/Wh1QpnjYEa.dsk...
bochs -q
========================================================================
                       Bochs x86 Emulator 2.2.5
             Build from CVS snapshot on December 30, 2005
========================================================================
00000000000i[     ] reading configuration from bochsrc.txt
00000000000i[     ] installing nogui module as the Bochs GUI
00000000000i[     ] using log file bochsout.txt
Pintos booting with 4,096 kB RAM...
Kernel command line: -q run priority-change
374 pages available in kernel pool.
373 pages available in user pool.
Interrupt 0x0e (#PF Page-Fault Exception) at eip=0xc0101647
 cr2=00000000 error=00000002
 eax=00000000 ebx=00000089 ecx=00000004 edx=00000002
 esi=00007369 edi=c0111000 esp=c0116000 ebp=c002ff58
 cs=0008 ds=0010 es=0010 ss=d581
Kernel PANIC at ../../threads/interrupt.c:354 in intr_handler(): Unexpected interrupt
Call stack: 0xc01088c4 0xc0101fce 0xc010218b 0xc0101647 0xc0100e9f 0xc01008d8 0xc01000a2.
The `backtrace' program can make call stacks useful.
Read "Backtraces" in the "Debugging Tools" chapter
of the Pintos documentation for more information.
Timer: 0 ticks
Thread: 0 idle ticks, 0 kernel ticks, 0 user ticks
Console: 774 characters output
Keyboard: 0 keys pressed
Powering off...
========================================================================
Bochs is exiting with the following message:
[UNMP ] Shutdown port: shutdown requested
========================================================================
Your first step (the automated grading scripts even do that for you if they detect a test fails) is to run backtrace on the list of addresses listed after Call stack:
gback@cantaloupe [22](~/pintos/pintos-solution2/src/threads/build) > backtrace 0xc01088c4 0xc0101fce 0xc010218b 0xc0101647 0xc0100e9f 0xc01008d8 0xc01000a2
0xc01088c4: debug_panic (../../lib/kernel/debug.c:32)
0xc0101fce: intr_handler (..../threads/interrupt.c:356)
0xc010218b: intr_entry (..../threads/intr-stubs.S:38)
0xc0101647: alloc_frame (../../threads/thread.c:635)
0xc0100e9f: thread_create (../../threads/thread.c:291)
0xc01008d8: thread_start (../../threads/thread.c:105)
0xc01000a2: main (../../threads/init.c:123)
This backtrace is read backwards. It says that main() called thread_start() on line 123 in file threads/init.c, that thread_start() called thread_create() on line 105 in thread.c, and so on.

You should first examine the backtrace if it makes sense (meaning whether indeed in your code, in thread.c, on line 105, main() calls thread_start(). See Section E.4 Backtraces in the Pintos documentation for more information on when not to trust the information displayed.) In this case, you'll notice that everything checks out, except for intr_entry(), which is not called in alloc_frame() on thread.c:635. Instead, this is where the abrupt control transfer happened as a result of the page fault exception. You need to take a closer look. You should first look at line 635 in thread.c and see whether it is obvious what's causing the fault.

If you can't see it already, you'll have to dig deeper. The state dump in the interrupt handler gives you two more important pieces of information: it tells you the instruction that caused the fault, and it tells you the address the instruction tried to access. Look at eip for the address of the faulting instruction (it's eip because that's where the instruction pointer (aka program counter) was when the fault happened. In this case, the fault happened at eip=0xc0101647.)

You can use the objdump program to disassemble your kernel and see what that instruction is (alternatively, gdb will tell you as well, see the "disass" command.) objdump works like so:

gback@cantaloupe [24](~/pintos/pintos-solution2/src/threads/build) > objdump -d kernel.o | less

 ... Now type  /^c0101647
 and 'b' to go back a page until you see all of alloc_frame:

c01015de <alloc_frame>:
c01015de:       55                      push   %ebp
c01015df:       89 e5                   mov    %esp,%ebp
c01015e1:       83 ec 08                sub    $0x8,%esp
c01015e4:       ff 75 08                pushl  0x8(%ebp)
c01015e7:       e8 9c fe ff ff          call   c0101488 <is_thread>
c01015ec:       83 c4 04                add    $0x4,%esp
c01015ef:       0f b6 c0                movzbl %al,%eax
c01015f2:       85 c0                   test   %eax,%eax
c01015f4:       75 21                   jne    c0101617 <alloc_frame+0x39>
c01015f6:       83 ec 0c                sub    $0xc,%esp
c01015f9:       68 71 d6 10 c0          push   $0xc010d671
c01015fe:       68 4e d5 10 c0          push   $0xc010d54e
c0101603:       68 94 d7 10 c0          push   $0xc010d794
c0101608:       68 78 02 00 00          push   $0x278
c010160d:       68 65 d5 10 c0          push   $0xc010d565
c0101612:       e8 4d 72 00 00          call   c0108864 <debug_panic>
c0101617:       8b 45 0c                mov    0xc(%ebp),%eax
c010161a:       83 e0 03                and    $0x3,%eax
c010161d:       85 c0                   test   %eax,%eax
c010161f:       74 21                   je     c0101642 <alloc_frame+0x64>
c0101621:       83 ec 0c                sub    $0xc,%esp
c0101624:       68 a0 d7 10 c0          push   $0xc010d7a0
c0101629:       68 4e d5 10 c0          push   $0xc010d54e
c010162e:       68 94 d7 10 c0          push   $0xc010d794
c0101633:       68 79 02 00 00          push   $0x279
c0101638:       68 65 d5 10 c0          push   $0xc010d565
c010163d:       e8 22 72 00 00          call   c0108864 <debug_panic>
c0101642:       b8 00 00 00 00          mov    $0x0,%eax
c0101647:       c7 00 2a 00 00 00       movl   $0x2a,(%eax)
c010164d:       8b 45 08                mov    0x8(%ebp),%eax
c0101650:       8b 50 18                mov    0x18(%eax),%edx
c0101653:       8b 45 0c                mov    0xc(%ebp),%eax
c0101656:       29 c2                   sub    %eax,%edx
c0101658:       8b 45 08                mov    0x8(%ebp),%eax
c010165b:       89 50 18                mov    %edx,0x18(%eax)
c010165e:       8b 45 08                mov    0x8(%ebp),%eax
c0101661:       8b 40 18                mov    0x18(%eax),%eax
c0101664:       c9                      leave  
c0101665:       c3                      ret    
The instruction at 0xc0101647 is shown in red. It attempts to write the value 0x2a to memory at the address contained in %eax. The previous instruction, shown in blue, set %eax to 0, however. Consequently, this was an attempt to dereference a NULL pointer.

You can confirm this by looking at the value of %eax in the dump above (eax=00000000). You also see that 0 was the faulting address, which is stored in cr2 (cr2=00000000). Let's look at thread.c, line 635 and up:

0626 /* Allocates a SIZE-byte frame at the top of thread T's stack and
0627    returns a pointer to the frame's base. */
0628 static void *
0629 alloc_frame (struct thread *t, size_t size) 
0630 {
0631   /* Stack data is always allocated in word-size units. */
0632   ASSERT (is_thread (t));
0633   ASSERT (size % sizeof (uint32_t) == 0);
0634 
0635   *(void**)NULL = 42;
0636   t->stack -= size;
0637   return t->stack;
0638 }
The offending line is shown in red. Indeed, I am trying to dereference a NULL pointer.

Note that if you dereference a NULL pointer, the offending address may not always be 00000000. For instance, if 'p' is a NULL pointer of type struct P *, then an access to p->x will attempt to access memory at offset(struct P, x), which might be 00000004, 00000008, 0000000c etc.

Conversely, if a list_elem pointer is 0, then list_entry might give you a pointer that is below zero (recall that -1 is FFFFFFFF, -2 is FFFFFFFE, and so on.) For instance, if you have:

struct X {
	int x;
	struct list_elem elem;
}
and you use struct X * xp = list_entry(e, struct X, elem); and access xp->x, then the offending address will be fffffffc if e was NULL:
	struct list_elem * e = NULL;	// a NULL pointer
	struct X * xp = list_entry(e, struct X, elem);	// no memory access here, just pointer arithmetic
	xp->x;	// accesses memory at -4 which is FFFFFFFC

Note that there's nothing special about NULL. It's just an invalid address because of how Pintos maps its memory - most OS make sure that address 00000000 and the addresses around it (typically at least -4096 ... + 4095) are invalid, because it helps detect accesses to NULL pointers. The same scenario would apply to any invalid address. In Pintos Project 1, only addresses between 0xC0000000 (PHYS_BASE) and 0xC0400000 (PHYS_BASE plus 4 MB) are valid, assuming you run Pintos with 4MB of physical memory. All other addresses are invalid and cause page fault exceptions.