Project 1

Due date: Oct 2 Oct 3, 11:59pm

Project 2007-10-03 23:59:59 GMT-04:00.

Frequently, when doing systems research, you have to make minor modifications to an existing piece of systems software, such as a kernel. In this project, you will add a simple per-process page fault tracing to Linux. This facility will allow a process to record a trace of its page faults. This trace will be accessible through the /proc file system.

Examing a page fault trace can be useful for a number of purposes: for instance, it can create an access sequence that can help optimize page replacement algorithms or cache prefetch algorithms.

In the course of this project, you will need to read some background about the Linux kernel. A recommended book is Understanding the Linux Kernel by Daniel P. Bovet and Marco Cesati, which is available as an electronic resource (use LibX in Firefox for off-campus proxy access.) In addition, I will point out which parts of the Linux kernel you need to examine.

Setting up User Mode Linux

We will be using User Mode Linux for our experiments. Please read the HOWTO for details on how to set it up. Put briefly, user mode linux provides a virtual machine in which to run the kernel. The following steps should get you started on the lab machines - if you work elsewhere, you'll have to adapt those steps.
/* in some directory of your own choosing, let's say ~/cs5204proj1 */
cd ~/cs5204proj1
mkdir linux
cd linux
lndir /home/courses/cs5204/uml/linux-2.6.22.5 .
make defconfig ARCH=um
make ARCH=um
cd ..
mkdir user
linux/linux ubda=cow,/home/courses/cs5204/uml/FedoraCore5-x86-root_fs
Then, UML should boot. Type "root" at the login prompt, no password. Once you have a command prompt inside the VM, type:
mkdir /host
mount none /host -t hostfs -o <insert path to your homedir here>/cs5204proj1/user
Now the user directory should appear in the guest at /host. You can place statically linked binaries into this directory and execute them on the host. To link a binary statically, use the -static option to gcc. Use halt to shut down the VM.

The lndir command created a link to my copy of the kernel. To make changes to individual files, you must create a copy. For instance, to start making changes to fs/proc/base.c, create a copy of it. The shell script ~cs5204/bin/localCopy will help. After you made a change, type make ARCH=um to recompile your kernel.

Should you experience a kernel crash that leads to file system corruption, delete the file 'cow' and start over.

Starting Page Fault Tracing

Page fault tracing should start if you write a '1' to /proc/pid/pgfault_history. It should stop if a process exits or if you write a '0' to /proc/pid/pgfault_history.

Retrieving The Page Fault Trace

The page fault trace should be retrieved by reading from /proc/pid/pgfault_history. Reading from that file should return EOF only when the process with pid has been reaped. The page fault trace format is up to you, but you need to provide a user tool to interpret it.

Recording Virtual Memory Regions

In addition to recording page faults, you should record the regions of virtual memory as they are created and destroyed during a process's lifetime. Type
perl -e 'printf "cat /proc/%d/maps", getppid()' | sh
to see the current shell's set of virtual memory regions.

Strategy

One possible strategy is as follows. Introduce a flag that records if a process is tracing its page faults or not. Look for PF_* in include/linux/sched.h and add your own constant, choosing an unused value. Check the flag on the page fault path, for instance, in __handle_mm_fault in mm/memory.c. If it is set, record the current time (via CURRENT_TIME) and the address.

You should maintain a trace buffer that you allocate on demand. You could allocate that buffer with __get_free_pages(GFP_KERNEL, order). Start with order = 1, which creates a buffer of 4KB. If there is not enough room in the buffer, increase it, up to a maximum (such as order = 4, which is 64 KB.) At the beginning of the buffer you allocate, store some meta information. You should probably store the following pieces of meta information:

Store a pointer to that buffer in the PCB, see struct task_struct in linux/sched.h. The PCB in Linux will persist until the process is reaped.

Caveat: Linux will blindly clone the PCB, creating an alias for anything in it when creating a new thread or process from an existing one. To avoid cloning this pointer, zero it out in dup_task_struct in kernel/fork.c.

(For extra credit, come up with a solution that can trace multi-threaded processes.)

The trace is accessed by a reading process in a producer/consumer fashion. A read of an event removes it from the trace. The occurrence of an event results in the addition of the event to the trace, and a notification of any readers/consumers waiting for it. (Hint: use wake_up() for this purpose.)

If a process exits, readers must be notified of that fact as well. Consider changing do_exit() in exit.c for that purpose. If a reader was waiting for trace events from a process and wakes up to find the process has exited, it should return 0 from the read(), signaling that EOF has occurred.

The behavior is undefined if there are multiple readers.

Implementing pgfault_history. Look at fs/proc/base.c to see how entries are added to the /proc/pid filesystem. You will probably want to add a REG() style entry and implement a read and write function. You can model your write() function after oom_adjust_write(). Your read() function should copy as many event records from the trace buffer as fit into the user-provided buffer. If the target process isn't tracing page faults, it should return -ESRCH. If the target process's trace buffer is empty and the process has not exited, the reader should wait until one of these two conditions is no longer true. If the process has exited and there are no items in the trace buffer, return 0 for end of file.

You should consider using the wait_event_interruptible() macro for this purpose. You should also make sure a process can be killed by checking for pending signals afterwards. Make sure that access to the shared buffer is protected by a lock. Make sure that the buffer is deallocated after a child has been reaped - consider placing code in release_task() in exit.c for that purpose.

Tracing Memory Regions

Linux keeps track of a process's address space in a red-black tree whose elements are of type struct vm_area_struct. In addition, all vma's are kept in a singly-linked list. Each VMA has a start and an end. VMAs are non-overlapping. As Linux implements VMA manipulations, VMAs may be split or merged. In the course of doing so, a VMA's start and end may change, and there be may - temporarily two VMAs whose (start, end) areas overlap.

Your goal is to be able to reconstruct from the trace the address space layout of a process. To accomplish this, you can record the initial state of the address space in the trace, then add "change" records to reflect changes. I recommend that you record the identity of a VMA in the trace so you can use this identity when recording change record. A possible identity is given by the address of the allocated vm_area_struct.

To obtain the initial layout, iterate over all VMAs using the current task's mm_struct (see get_task_mm). The mm_struct describes the address space in which a thread executes, which include its VMA regions. Make sure you hold the mmap_sem in read mode while iterating.

To monitor changes, you need to change mm/mmap.c. Note that the field mmap_count in mm_struct reflects the number of elements in the R/B tree as well as the number of elements in the list of VMAs. Therefore, you can report the addition of a new VMA where mmap_count is incremented, and you can report the removal of an VMA where mmap_count is decremented. The case in which an VMA is resized can be handled by observing where vm_start and/or vm_end are being changed (2 locations inside vma_adjust.)

User Tools

Provide a user tool that can run a single process and record its faults. The output of the tool should record which regions were added, which regions were removed, and where page faults occurred. See below for an example output.

The tool will work as described by the following pseudo-code.

/* tracepgfault.c */
int
main(int ac, char *av[]) {
    if (child = fork()) {
       /* parent - read the trace file */
       open /proc/*childpid*/pgfault_history
       read until eof, write data to file with name av[1]
       wait for child via wait4 or wait()
    } else {
       /* child */
       write a '1' to /proc/*getpid()*/pgfault_history
       execvp(av[2], av+2);
    }
}
In this case, you'd run the program with "tracepgfault outfile cmd arg0 arg1...." The page fault trace of executing "cmd arg0 arg1...." would be written to file 'outfile'. Read the man pages for fork, execvp, wait, as necessary.

Example Output

Your tools could produce output such as this, which shows the execution of the /bin/ps command.

Region: bfb99000 - bfbaf000 (  22 pages): |....XX............XXXX|
Region: 08048000 - 0805a000 (  18 pages): |XXXXXXXXXXXXXXXX..|
Region: 0805a000 - 0805b000 (   1 pages): |X|
Region: 0805b000 - 0809c000 (  65 pages): |............X................X.XX..X.............................|
Region: 40000000 - 40019000 (  25 pages): |XXXXXXXXXXXXXXXXX.XXXXX..|
Region: 40019000 - 4001a000 (   1 pages): |X|
Region: 4001a000 - 4001b000 (   1 pages): |X|
Region: 4001b000 - 4001c000 (   1 pages): |X|
Region: 4001e000 - 4002b000 (  13 pages): |XXXXXXXXXXXX.|
Region: 4002b000 - 4002c000 (   1 pages): |X|
Region: 4002c000 - 40040000 (  20 pages): |.XX...............XX|
Region: 40040000 - 40042000 (   2 pages): |XX|
Region: 40042000 - 40043000 (   1 pages): |X|
Region: 40043000 - 40044000 (   1 pages): |X|
Region: 40044000 - 40171000 ( 301 pages): |XXXXXXXXXXXXXXXXXXXXXX........XXXX.....XXXXXX...........XXXXXX.........XXXXXX....X.XXXX.....XXXXXXXXXXXX.XXX...............X..........X...XXX...........................................XXXX...XXX...X..X...........X......X...............................X..X.....XX............XXX..XXXXX.................|
Region: 40171000 - 40173000 (   2 pages): |XX|
Region: 40173000 - 40174000 (   1 pages): |X|
Region: 40174000 - 40199000 (  37 pages): |XXXXX................................|
Region: 40199000 - 4019a000 (   1 pages): |.|
which is part of this example produced by my sample solution.

Don't hesitate to ask questions in the forum.

Good Luck!