Project 3 Debugging

Written by Scott Schneider, 2010.

This document explains the first level of systematic debugging that we expect you will have performed before asking for help from us. Of course, we will help you debug your shell. But if you cannot answer our questions such as "What does valgrind tell you?" and "Where does your shell break?" we will refer you to this document to figure that out first.

You have access to two powerful debugging tools: gdb and valgrind. The following sections explain how to use them to gain a basic understanding of what your shell is doing and what may be going wrong:

  1. gdb
    1. Program Crashing
    2. Infinite Loops
  2. valgrind
    1. Accessing Out of Bounds Memory
    2. Accessing an Unitialized Variable

gdb

The most common bugs in the shell come in two flavors: program crashes and infinite loops. The following will help you diagnose where your program first exhibits bad behavior depending on which kind of bug you have.

Program Crashing

The most common cause of a program crash is a segmentation fault (segfault): your program accessed an address in memory that the operating system has not given your program permission to access. The most basic information you need to debug a crash is to know where, exactly, in your code that the program breaks. Assume the following C program:

  #include <stdlib.h>

  void foo()
  {
      char* ptr = NULL;
      *ptr = 0;
  }

  int main()
  {
      foo();

      return 0;
  }

As expected, when we execute this program we receive a segfault:

  [scschnei@cid ~]$ gcc -ggdb -Wall -o break break.c
  [scschnei@cid ~]$ ./break
  Segmentation fault

We know from inspection where the program crashed, but we will use gdb to confirm this. You should already be familiar with starting a program with gdb after Projects 1 and 2, so we'll jump straight to running the program:

  [scschnei@cid ~]$ gdb ./break
  GNU gdb Fedora (6.8-27.el5)
  Copyright (C) 2008 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later 
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
  and "show warranty" for details.
  This GDB was configured as "i386-redhat-linux-gnu"...
  (gdb) run
  Starting program: /home/grads/scschnei/break 

  Program received signal SIGSEGV, Segmentation fault.
  0x08048364 in foo () at break.c:6
  6		*ptr = 0;

Sure enough, the program generates a segfault on the line that tries to access a NULL pointer. We are also told what line number this occurs on. However, in non-trivial programs, functions are often called from many different places. Knowing the line the segfault occurs on is sometimes not enough; we also need to know how the program reached that function. We can learn this through the backtrace (short form: bt) command:

  (gdb) bt
  #0  0x08048364 in foo () at break.c:6
  #1  0x0804837c in main () at break.c:11

Now we know that our program caused a segfault on line 6 in the function foo, which was called at line 11 in main. Most programs are not as easy to diagnose as this trivial example (we can easily tell that it dereferences a NULL pointer), but obtaining a stacktrace is the first step in debugging such a problem.

Infinite Loops

Malformed lists often cause infinite loops. Common causes of malformed lists are putting an item onto one list without removing it from another; passing a list by value instead of passing a pointer to a list (as explained in #9 in the FAQ); and destructively modifying a list while iterating over it. But, before you can determine what the cause of the infinite loop is, you need to determine where, exactly, the program is stuck. Assume the following C program:
  #include <stdlib.h>
  #include <math.h>

  double square_root(double n)
  {
      return sqrt(n);
  }

  double square(double n)
  {
      return pow(n, 2.0);
  }

  int main()
  {
      double num = 3.14;
      while (1) {
          num = square(num);
          num = square_root(num);
      }

      return 0;
  }

We can tell by inspection that the loop in main will never terminate. But, as before, we will use gdb to verify what is going on in the program. When we execute our program in gdb, it will continue until we interrupt its execution by issuing the interrupt signal with control+c:

  [scschnei@cid ~]$ gdb ./inf
  GNU gdb Fedora (6.8-27.el5)
  Copyright (C) 2008 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later 
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
  and "show warranty" for details.
  This GDB was configured as "i386-redhat-linux-gnu"...
  (gdb) run
  Starting program: /home/grads/scschnei/inf 

  Program received signal SIGINT, Interrupt.
  square_root (n=3.1400000000000001) at inf.c:5
  5	{

We are told that the program was interrupted at line 5 of our program, and a stacktrace tells us what the calling sequence was:

  (gdb) bt
  #0  square_root (n=3.1400000000000001) at inf.c:5
  #1  0x08048459 in main () at inf.c:19

Line 19 is inside of our while loop. We tell gdb to continue (short form: c), interrupt it again, and request a backtrace:

  (gdb) c
  Continuing.

  Program received signal SIGINT, Interrupt.
  square (n=3.1400000000000001) at inf.c:12
  12	}
  (gdb) bt
  #0  square (n=3.1400000000000001) at inf.c:12
  #1  0x0804844b in main () at inf.c:18

While the program was interrupted at a different location (line 12 in our program), and is on a different line in main, we can recognize that the program is still stuck inside the same loop. Determining where a program is stuck is more involved than determining where a program crashes. For non-trivial programs with nested loops, you may need to interrupt the program several times to determine which loop the program is stuck in.

valgrind

Running your shell in valgrind can tell you if you have errors related to memory, such as accessing data you did not request, or using an unitialized value. These errors are subtle, because they often do not cause the program to crash at the location of the error. Rather, the program continues execution with bad data, and problems caused by the error can manifest much later. The following two errors are common in student shells, and can be detected by valgrind.

Accessing Out of Bounds Memory

When your program requests memory from malloc, the memory allocation policies behind malloc will probably request much more memory from the operating system than just the amount of memory your program requested. Hence, accessing a little past what you requested will probably not cause a segfault - but other data your program relies on could be there. Consider the following C program:

  #include <stdio.h>
  #include <stdlib.h>

  int main()
  {
      char* mem = malloc(4 * sizeof(char));
      int* num = malloc(sizeof(int));

      *num = 256;

      printf("before: %d\n", *num);

      mem[17] = 0;

      printf("after: %d\n", *num);
      return 0;
  }

Compiling and executing this program:

  [scschnei@cid ~]$ gcc -ggdb -Wall -o oob oob.c
  [scschnei@cid ~]$ ./oob
  before: 256
  after: 0

Note that the access mem[17] is outside of the amount of memory that we allocated for mem, but we own that memory. And, not only do we own that memory, but the integer pointed to by num is stored there. So not only does our program not crash, a value in our program mysteriously changes. For this program, detecting this error through inspection is easy. Detecting this kind of error in any interesting program is hard. For interesting programs, we can use valgrind. Running the above program with valgrind yields the following:

  [scschnei@cid ~]$ valgrind ./oob
  ==20460== Memcheck, a memory error detector.
  ==20460== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
  ==20460== Using LibVEX rev 1658, a library for dynamic binary translation.
  ==20460== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
  ==20460== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
  ==20460== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
  ==20460== For more details, rerun with: -v
  ==20460== 
  before: 256
  ==20460== Invalid write of size 1
  ==20460==    at 0x8048407: main (oob.c:13)
  ==20460==  Address 0x4023039 is 13 bytes after a block of size 4 alloc'd
  ==20460==    at 0x40053C0: malloc (vg_replace_malloc.c:149)
  ==20460==    by 0x80483D0: main (oob.c:6)
  after: 256
  ==20460== 
  ==20460== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 13 from 1)
  ==20460== malloc/free: in use at exit: 8 bytes in 2 blocks.
  ==20460== malloc/free: 2 allocs, 0 frees, 8 bytes allocated.
  ==20460== For counts of detected errors, rerun with: -v
  ==20460== searching for pointers to 2 not-freed blocks.
  ==20460== checked 46,956 bytes.

Notice that valgrind recognizes the invalid write at line 13 of our program, and it further recognizes that this memory was allocated at line 6. In your shell, correct all instances of invalid writes that valgrind detects.

Accessing an Unitialized Variable

Another common cause of subtle errors is using a variable before it has been given any value. Consider the following C program:
  #include <stdlib.h>
  #include <stdio.h>

  int main()
  {
      int a;
      int b = a;
      printf("%d\n", b);
      return 0;
  }

We compile and run this program, with the following result:

  [scschnei@cid ~]$ gcc -ggdb -Wall -o init init.c
  [scschnei@cid ~]$ ./init
  2363380

Our program's result, 2363380, depends entirely on whatever was at a's location in memory before our program executed. It very well could have been any other value. On our systems it is likely to be that value again because the same program on the same system will often exhibit the same behavior - but not always. Using unitialized values can lead to subtle errors not only because an unknown value can enter your program, but also because that value can change over subsequent executions. Luckily, valgrind can catch these errors:

  [scschnei@edgar ~]$ valgrind ./init
  ==24674== Memcheck, a memory error detector.
  ==24674== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
  ==24674== Using LibVEX rev 1658, a library for dynamic binary translation.
  ==24674== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
  ==24674== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
  ==24674== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
  ==24674== For more details, rerun with: -v
  ==24674== 
  ==24674== Use of uninitialised value of size 4
  ==24674==    at 0x13BBFB: _itoa_word (in /lib/libc-2.5.so)
  ==24674==    by 0x13F390: vfprintf (in /lib/libc-2.5.so)
  ==24674==    by 0x146E42: printf (in /lib/libc-2.5.so)
  ==24674==    by 0x80483AD: main (init.c:8)
  ==24674== 
  ==24674== Conditional jump or move depends on uninitialised value(s)
  ==24674==    at 0x13BC03: _itoa_word (in /lib/libc-2.5.so)
  ==24674==    by 0x13F390: vfprintf (in /lib/libc-2.5.so)
  ==24674==    by 0x146E42: printf (in /lib/libc-2.5.so)
  ==24674==    by 0x80483AD: main (init.c:8)
  ==24674== 
  ==24674== Conditional jump or move depends on uninitialised value(s)
  ==24674==    at 0x13D760: vfprintf (in /lib/libc-2.5.so)
  ==24674==    by 0x146E42: printf (in /lib/libc-2.5.so)
  ==24674==    by 0x80483AD: main (init.c:8)
  ==24674== 
  ==24674== Conditional jump or move depends on uninitialised value(s)
  ==24674==    at 0x13F9D4: vfprintf (in /lib/libc-2.5.so)
  ==24674==    by 0x146E42: printf (in /lib/libc-2.5.so)
  ==24674==    by 0x80483AD: main (init.c:8)
  ==24674== 
  ==24674== Conditional jump or move depends on uninitialised value(s)
  ==24674==    at 0x13D80B: vfprintf (in /lib/libc-2.5.so)
  ==24674==    by 0x146E42: printf (in /lib/libc-2.5.so)
  ==24674==    by 0x80483AD: main (init.c:8)
  2363380
  ==24674== 
  ==24674== ERROR SUMMARY: 17 errors from 5 contexts (suppressed: 13 from 1)
  ==24674== malloc/free: in use at exit: 0 bytes in 0 blocks.
  ==24674== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
  ==24674== For counts of detected errors, rerun with: -v
  ==24674== All heap blocks were freed -- no leaks are possible.

Notice that there are many errors listed by valgrind, but they are all from line 8 in our program. Line 8 is the call to printf. With the knowledge that this line is using an unitialized value, it is our job to determine which value that is.