Warning: Undefined array key "PATH_TRANSLATED" in /mnt/web/export/courses/cs5045/fall_2002/handouts/programs/program3.php on line 3
CS 5984: Programming Assignment 3
CS 5984: Computation for the Life Sciences

Fall Semester, 2002
Programming Assignment 3

The point value of each script is given in []. Within each script, include a header comment block that looks something like this:

#!/usr/bin/perl -w
#
# Perl script: name_of_script
#
# Perl guru: your_name
#
# Description_of_script
#
# Date: date_created
#

You should email your Perl scripts in separate emails to heath@vt.edu --- with the subject of each email being the name of the script --- by 5:00PM on September 27, 2002. No late submission will be accepted.


  1. [30] DNA sequences are typically communicated using FASTA format. For nucleic acid sequences, a number of codes are supported in FASTA format, including N for unknown and - for gap. However, a DNA sequence in strict FASTA format will include only A, C, G, and T codes. Here is an example of a DNA sequence in strict FASTA format taken from GenBank. Write and submit a Perl script named FASTA_check that works as follows.

    1. Requires one command line argument interpreted as the name of a FASTA file.

    2. Opens the FASTA file for reading, if possible; in case of error, die with an appropriate error message.

    3. Examine the entire file to verify that it meets the syntax for a DNA sequence in strict FASTA format.

    4. If the file has correct syntax, then print
          File <filename> is in correct FASTA format.
      
      to standard output. Otherwise, print an error message to standard output that describes the syntax error. If there are multiple syntax errors, finding and reporting any one error is sufficient.

    The example FASTA file AY133756_SOD.fasta can be checked with this command:

        FASTA_check AY133756_SOD.fasta
    
    with this result:
        File AY133756_SOD.fasta is in correct FASTA format.
    
    You will want to generate your own test files that are not syntactically correct in one way or another. If you email up to three of your test files to heath@vt.edu, then I will post them here for everybody to use.


  2. [30] Write and submit a Perl script named GC_content with the following behavior.

    1. Requires one command line argument interpreted as the name of a FASTA file.

    2. Counts the total number of nucleotides in the file, as well as the number of G's and C's.

    3. Computes the percentage of GC content in the DNA sequence and reports it to standard output as follows:
          The DNA sequence in file <filename> has GC content <percentage>%.
      

    For example, the test file test_GC.fasta can be examined with this command:

        GC_content test_GC.fasta
    
    with this result:
        The DNA sequence in file test_GC.fasta has GC content 50%.
    
    You may use either substr as in Example 5.6 or regular expressions to "find" the G's and C's.


This page was last modified at 12:00 AM UTC, on January 1, 1970.

Current time at Virginia Tech is 6:10 AM UTC, on January 8, 2025.

Please report any problems found in these pages by email to: Lenwood S. Heath.