CS 5984: Computation for the Life Sciences
Fall Semester, 2002
Programming Assignment 6
The point value (if any) of each step is given in square brackets [ ].
Like the previous programming assignment, this one
requires you to mix Perl
and database queries to solve a problem.
The solutions (details below)
should be turned in by email to naren@cs.vt.edu or heath@cs.vt.edu, by 5pm on Dec 17, 2002.
The goal of this assignment is three fold: (i) to populate a database
using data that is given in files, (ii) to gain experience in writing queries,
and (iii) getting the answers to queries out of the database and printing them.
Slightly more complexity is added as compared to the previous assignment.
- [5] You are given 384 clones (genes); information is available about
their raw expression level from (only) one microarray experiment and also about
their membership in functional categories. expr.txt
is a file listing information in (clone, expression level) form.
In typical microarray experiments, clones are replicated several times to
improve confidence in data; likewise expr.txt shows that each clone has
many replicates for the given experiment. For instance, clone 1 has
seven replicates, clone 2 has four replicates, clone 3 has eight replicates,
and so on.
members.txt is the same file as in the previous experiment (i.e.,
assigns clones to functional categories).
As usual, this file is delimited with a "|" character; recall
that one clone can be a member of many categories. For instance, clone 1
is a member of both "Lignin biosynthesis" and the "Phenylpropanoid pathway".
Familiarize yourself with this data; logon to the postgres system, and
create two tables, one for storing the information in each of the above
two files. You need to decide the names of the tables, the names of
the columns, and choose the data types appropriately. Make sure the PRIMARY
KEY is defined suitably for each table.
- [5] If you have successfully created the tables, they will be empty
at first. Do a "\d" and convince yourself that the tables are indeed
there! Now we need to load the data from the files into the tables. To
do this, you can write a Perl program that takes the data from each input
file and prepares "INSERT INTO ..." commands that are written onto
a new file. Then, you can load this file (and hence the commands in it)
into the database system using the
"psql" command line options (e.g., the "-f" option).
To gain full points for this step, you should write a correct Perl program,
execute it, load the prepared file into the database,
and go into the database and verify that the data is indeed
there! Notice that depending on how you defined the table columns, the
Perl program must suitably prepare the INSERT commands. If your Perl
program is behaving funnily, notice that you may have to go in and delete
any bad data and try again!
It is hence prudent to first try out "INSERT INTO" commands manually in
the database, convince yourself that they work, and only then prepare
the Perl program that spits out these "INSERT INTO" commands.
- [10] At this stage, you have successfully gotten the data from the files
into the database. Login to the database and write a database view to
find the expression values (including replicates) of all clones in the
"Carbon Metabolism" category.
Query this view and convince yourself that it is producing the
desired results.
- [5] Get these expression values out of the database into
some file (suitably name
it yourself). We have seen
many psql command line options in class for doing this. So at the end of
this step, you should have a file that contains several expression levels.
We are back in file territory.
- [25] Write a Perl program to read this last file, and find
the median expression level, and print it out.
The "median" of a list of numbers is computed as follows. First arrange
the numbers in ascending order. Pick the "middle" number. That is the median!
For instance, if we are given student ages such as:
[16, 23, 23, 22, 24, 25, 21, 23, 24, 29, 35]
the median student age is: 23. The median is a better measure of
central tendency than, say, the average, because it is less sensitive
to outliers. Thus, it is not swayed by the extreme values of "35" and
"16" in the above list.
If the list has an even number of entries, such as:
[16, 23, 24, 22]
then the median is the average of the two middle entries (after sorting).
So, the median here is 22.5.
The task here is to find the median expression level among the
clones and their replicates in the Carbon Metabolism category.
- [50] Repeat the median calculation for all categories.
Thus, we are now interested in the median level of gene expression
for every category
present in the members.txt file. You will have to plan this step out
and see how you can complete it without laboriously redoing the
above steps, once for each category. If you have successfully
completed this step, you must have a file listing all categories and
their median expression level alongside.
How to turnin the assignment
- Collect all the work you have done (data files, files containing
database definition and data insertion
commands, output files) in
a directory under your home directory on arabidopsis. Write a README file
(plain text, no Microsoft Word please)
describing how you did each step, what assumptions you made, what troubles
you ran into, how you solved the problems, and the names of
the files/tables/views that are
created in the process. Mention also the output
file that contains the median expression level for all categories.
Email this file to naren@cs.vt.edu or heath@cs.vt.edu
so that we can use it as a guide to go into your directory
and check these files and commands out. We will also login to the database
and verify that your tables and views are working as they should.
Useful Information
Please report any problems found in these pages by email to:
Naren Ramakrishnan
or Lenwood S. Heath.