Project 2007-10-03 23:59:59 GMT-04:00.
Frequently, when doing systems research, you have to make minor modifications to an existing piece of systems software, such as a kernel. In this project, you will add a simple per-process page fault tracing to Linux. This facility will allow a process to record a trace of its page faults. This trace will be accessible through the /proc file system.
Examing a page fault trace can be useful for a number of purposes: for instance, it can create an access sequence that can help optimize page replacement algorithms or cache prefetch algorithms.
In the course of this project, you will need to read some background about the Linux kernel. A recommended book is Understanding the Linux Kernel by Daniel P. Bovet and Marco Cesati, which is available as an electronic resource (use LibX in Firefox for off-campus proxy access.) In addition, I will point out which parts of the Linux kernel you need to examine.
/* in some directory of your own choosing, let's say ~/cs5204proj1 */ cd ~/cs5204proj1 mkdir linux cd linux lndir /home/courses/cs5204/uml/linux-2.6.22.5 . make defconfig ARCH=um make ARCH=um cd .. mkdir user linux/linux ubda=cow,/home/courses/cs5204/uml/FedoraCore5-x86-root_fsThen, UML should boot. Type "root" at the login prompt, no password. Once you have a command prompt inside the VM, type:
mkdir /host mount none /host -t hostfs -o <insert path to your homedir here>/cs5204proj1/userNow the user directory should appear in the guest at /host. You can place statically linked binaries into this directory and execute them on the host. To link a binary statically, use the -static option to gcc. Use halt to shut down the VM.
The lndir command created a link to my copy of the kernel. To make changes to individual files, you must create a copy. For instance, to start making changes to fs/proc/base.c, create a copy of it. The shell script ~cs5204/bin/localCopy will help. After you made a change, type make ARCH=um to recompile your kernel.
Should you experience a kernel crash that leads to file system corruption, delete the file 'cow' and start over.
perl -e 'printf "cat /proc/%d/maps", getppid()' | shto see the current shell's set of virtual memory regions.
You should maintain a trace buffer that you allocate on demand. You could allocate that buffer with __get_free_pages(GFP_KERNEL, order). Start with order = 1, which creates a buffer of 4KB. If there is not enough room in the buffer, increase it, up to a maximum (such as order = 4, which is 64 KB.) At the beginning of the buffer you allocate, store some meta information. You should probably store the following pieces of meta information:
Caveat: Linux will blindly clone the PCB, creating an alias for anything in it when creating a new thread or process from an existing one. To avoid cloning this pointer, zero it out in dup_task_struct in kernel/fork.c.
(For extra credit, come up with a solution that can trace multi-threaded processes.)
The trace is accessed by a reading process in a producer/consumer fashion. A read of an event removes it from the trace. The occurrence of an event results in the addition of the event to the trace, and a notification of any readers/consumers waiting for it. (Hint: use wake_up() for this purpose.)
If a process exits, readers must be notified of that fact as well. Consider changing do_exit() in exit.c for that purpose. If a reader was waiting for trace events from a process and wakes up to find the process has exited, it should return 0 from the read(), signaling that EOF has occurred.
The behavior is undefined if there are multiple readers.
Implementing pgfault_history. Look at fs/proc/base.c to see how entries are added to the /proc/pid filesystem. You will probably want to add a REG() style entry and implement a read and write function. You can model your write() function after oom_adjust_write(). Your read() function should copy as many event records from the trace buffer as fit into the user-provided buffer. If the target process isn't tracing page faults, it should return -ESRCH. If the target process's trace buffer is empty and the process has not exited, the reader should wait until one of these two conditions is no longer true. If the process has exited and there are no items in the trace buffer, return 0 for end of file.
You should consider using the wait_event_interruptible() macro for this purpose. You should also make sure a process can be killed by checking for pending signals afterwards. Make sure that access to the shared buffer is protected by a lock. Make sure that the buffer is deallocated after a child has been reaped - consider placing code in release_task() in exit.c for that purpose.
Your goal is to be able to reconstruct from the trace the address space layout of a process. To accomplish this, you can record the initial state of the address space in the trace, then add "change" records to reflect changes. I recommend that you record the identity of a VMA in the trace so you can use this identity when recording change record. A possible identity is given by the address of the allocated vm_area_struct.
To obtain the initial layout, iterate over all VMAs using the current task's mm_struct (see get_task_mm). The mm_struct describes the address space in which a thread executes, which include its VMA regions. Make sure you hold the mmap_sem in read mode while iterating.
To monitor changes, you need to change mm/mmap.c. Note that the field mmap_count in mm_struct reflects the number of elements in the R/B tree as well as the number of elements in the list of VMAs. Therefore, you can report the addition of a new VMA where mmap_count is incremented, and you can report the removal of an VMA where mmap_count is decremented. The case in which an VMA is resized can be handled by observing where vm_start and/or vm_end are being changed (2 locations inside vma_adjust.)
The tool will work as described by the following pseudo-code.
/* tracepgfault.c */ int main(int ac, char *av[]) { if (child = fork()) { /* parent - read the trace file */ open /proc/*childpid*/pgfault_history read until eof, write data to file with name av[1] wait for child via wait4 or wait() } else { /* child */ write a '1' to /proc/*getpid()*/pgfault_history execvp(av[2], av+2); } }In this case, you'd run the program with "tracepgfault outfile cmd arg0 arg1...." The page fault trace of executing "cmd arg0 arg1...." would be written to file 'outfile'. Read the man pages for fork, execvp, wait, as necessary.
Your tools could produce output such as this, which shows the execution of the /bin/ps command.
Region: bfb99000 - bfbaf000 ( 22 pages): |....XX............XXXX| Region: 08048000 - 0805a000 ( 18 pages): |XXXXXXXXXXXXXXXX..| Region: 0805a000 - 0805b000 ( 1 pages): |X| Region: 0805b000 - 0809c000 ( 65 pages): |............X................X.XX..X.............................| Region: 40000000 - 40019000 ( 25 pages): |XXXXXXXXXXXXXXXXX.XXXXX..| Region: 40019000 - 4001a000 ( 1 pages): |X| Region: 4001a000 - 4001b000 ( 1 pages): |X| Region: 4001b000 - 4001c000 ( 1 pages): |X| Region: 4001e000 - 4002b000 ( 13 pages): |XXXXXXXXXXXX.| Region: 4002b000 - 4002c000 ( 1 pages): |X| Region: 4002c000 - 40040000 ( 20 pages): |.XX...............XX| Region: 40040000 - 40042000 ( 2 pages): |XX| Region: 40042000 - 40043000 ( 1 pages): |X| Region: 40043000 - 40044000 ( 1 pages): |X| Region: 40044000 - 40171000 ( 301 pages): |XXXXXXXXXXXXXXXXXXXXXX........XXXX.....XXXXXX...........XXXXXX.........XXXXXX....X.XXXX.....XXXXXXXXXXXX.XXX...............X..........X...XXX...........................................XXXX...XXX...X..X...........X......X...............................X..X.....XX............XXX..XXXXX.................| Region: 40171000 - 40173000 ( 2 pages): |XX| Region: 40173000 - 40174000 ( 1 pages): |X| Region: 40174000 - 40199000 ( 37 pages): |XXXXX................................| Region: 40199000 - 4019a000 ( 1 pages): |.|which is part of this example produced by my sample solution.
Don't hesitate to ask questions in the forum.
Good Luck!