CS 5984: Programming Assignment 6

CS 5984: Computation for the Life Sciences

Fall Semester, 2002
Programming Assignment 6

The point value (if any) of each step is given in square brackets [ ]. Like the previous programming assignment, this one requires you to mix Perl and database queries to solve a problem. The solutions (details below) should be turned in by email to naren@cs.vt.edu or heath@cs.vt.edu, by 5pm on Dec 17, 2002.

The goal of this assignment is three fold: (i) to populate a database using data that is given in files, (ii) to gain experience in writing queries, and (iii) getting the answers to queries out of the database and printing them. Slightly more complexity is added as compared to the previous assignment.

[5] You are given 384 clones (genes); information is available about their raw expression level from (only) one microarray experiment and also about their membership in functional categories. expr.txt is a file listing information in (clone, expression level) form. In typical microarray experiments, clones are replicated several times to improve confidence in data; likewise expr.txt shows that each clone has many replicates for the given experiment. For instance, clone 1 has seven replicates, clone 2 has four replicates, clone 3 has eight replicates, and so on.

members.txt is the same file as in the previous experiment (i.e., assigns clones to functional categories). As usual, this file is delimited with a "|" character; recall that one clone can be a member of many categories. For instance, clone 1 is a member of both "Lignin biosynthesis" and the "Phenylpropanoid pathway".

Familiarize yourself with this data; logon to the postgres system, and create two tables, one for storing the information in each of the above two files. You need to decide the names of the tables, the names of the columns, and choose the data types appropriately. Make sure the PRIMARY KEY is defined suitably for each table.
[5] If you have successfully created the tables, they will be empty at first. Do a "\d" and convince yourself that the tables are indeed there! Now we need to load the data from the files into the tables. To do this, you can write a Perl program that takes the data from each input file and prepares "INSERT INTO ..." commands that are written onto a new file. Then, you can load this file (and hence the commands in it) into the database system using the "psql" command line options (e.g., the "-f" option).

To gain full points for this step, you should write a correct Perl program, execute it, load the prepared file into the database, and go into the database and verify that the data is indeed there! Notice that depending on how you defined the table columns, the Perl program must suitably prepare the INSERT commands. If your Perl program is behaving funnily, notice that you may have to go in and delete any bad data and try again!

It is hence prudent to first try out "INSERT INTO" commands manually in the database, convince yourself that they work, and only then prepare the Perl program that spits out these "INSERT INTO" commands.
[10] At this stage, you have successfully gotten the data from the files into the database. Login to the database and write a database view to find the expression values (including replicates) of all clones in the "Carbon Metabolism" category.

Query this view and convince yourself that it is producing the desired results.
[5] Get these expression values out of the database into some file (suitably name it yourself). We have seen many psql command line options in class for doing this. So at the end of this step, you should have a file that contains several expression levels. We are back in file territory.
[25] Write a Perl program to read this last file, and find the median expression level, and print it out.

The "median" of a list of numbers is computed as follows. First arrange the numbers in ascending order. Pick the "middle" number. That is the median!

For instance, if we are given student ages such as:

[16, 23, 23, 22, 24, 25, 21, 23, 24, 29, 35]

the median student age is: 23. The median is a better measure of central tendency than, say, the average, because it is less sensitive to outliers. Thus, it is not swayed by the extreme values of "35" and "16" in the above list.

If the list has an even number of entries, such as:

[16, 23, 24, 22]

then the median is the average of the two middle entries (after sorting). So, the median here is 22.5.

The task here is to find the median expression level among the clones and their replicates in the Carbon Metabolism category.
[50] Repeat the median calculation for all categories. Thus, we are now interested in the median level of gene expression for every category present in the members.txt file. You will have to plan this step out and see how you can complete it without laboriously redoing the above steps, once for each category. If you have successfully completed this step, you must have a file listing all categories and their median expression level alongside.

How to turnin the assignment

Collect all the work you have done (data files, files containing database definition and data insertion commands, output files) in a directory under your home directory on arabidopsis. Write a README file (plain text, no Microsoft Word please) describing how you did each step, what assumptions you made, what troubles you ran into, how you solved the problems, and the names of the files/tables/views that are created in the process. Mention also the output file that contains the median expression level for all categories. Email this file to naren@cs.vt.edu or heath@cs.vt.edu so that we can use it as a guide to go into your directory and check these files and commands out. We will also login to the database and verify that your tables and views are working as they should.

Useful Information

How to Get Started (Logging in, passwords etc.)
An interactive session with Postgres, covered in class on Dec 4, 2002.

Please report any problems found in these pages by email to: Naren Ramakrishnan or Lenwood S. Heath.