CS 2204: Homework #5 solutions

What to turn in: A legible paper copy giving your answers

A common medical condition is the so-called "Megaloheimer's syndrome". People with Megaloheimer's tend to forget everything about everybody else except themselves, and often remember good things about themselves that have actually never happened to them but happened to other people. The intellectual ability of these patients is usually below average, although in the early stages of the disease they can pool the wool over everybody's eyes to effectively conceal this fact. Often, with the help of their influential and wealthy friends, people with this condition attain high power positions in the society.

Biomedical research reveals that the DNA of people with Megaloheimer's syndrome is likely to contain a relatively high proportion (more than 0.25 % of the total DNA content, counted "by letters" ) of the following sub-sequence: ACCTT , spaced at regular intervals along the patient's DNA. In this homework you will write a series of short PERL programs and try to answer the following question: is the patient XX likely to develop Megaloheimer's? The patient's DNA fragment is here: Patient XX. DNA sequence from locus 17NI5 promoter region. Admitted for diagnostic tests April 1 2006.

To approach the problem, write two separate perl programs to read the DNA sequence above and answer the questions below. Write your answer next to each question and provide your perl codes (on a separate sheet, if neccessary). Note that it is possible that some of the ACCTT strings may be split between two lines in the DNA file you have just downloaded.

SOLUTION. Some preliminary thoughts. Ok, first, remove the first (title) line from the DNA sequence file. Easiest to do it by hand. Next, you want to stitch up the lines to form one large string. Recall how we did it with \texttt{gawk} in lecture 5 (or 6). Do the same in PERL, or just use gawk , that's Ok. Now, all that remains is to match the ACCTT pattern in this long string. You can try =~ /ACCTT/ , but here is the rub: it will match just the first occurence of ACCTT , not all of them. Look up the PERL section of your text book for a solution: =~ /ACCTT/g where g means global search for every occurence in the given string. All you have to do is to put the above (well, using the proper sintax) in a while loop with a counter, and it will count the number of occurences of ACCTT . Below is a full code along these line. The code below also automatically chops off the first line. Note the use of pos() function in PERL to identify the sub-string's position in a string.

  1. (4 points) Read in the patient's DNA sequence and count the number of occurrences of the sub-sequence ACCTT in it. #!/usr/bin/perl

    open(IN,"XM_043865.fasta") || die "couldn't open file";
    # eat the first line, which doesn't have DNA in it
    <IN>;
    # read in the rest, appending them to $line as we go.
    while (<IN>) {
      $line = $line . $_;
    }
    # delete the newlines in the string
    $line =~ s/\n//g;

    $count = 0;
    while ($line =~ m/ACCTT/g) {
      # take out the one we just matched.
      $line =~ s/ACCTT//;
      $count++;
    }

    # and now print the final count.
    print "Got $count occurrences of ACCTT\n";


  2. (4 points) Read in the the patient's DNA sequence and print the indexes (positions) for every occurrence of the sub-sequence ACCTT in it. For example, the sequence:
          ACCTTGCGATGAACCTTGATGCCG
    contains the sub-sequence ACCTT two times at indexes (positions) 0 and 12. #!/usr/bin/perl

    open(IN,"XM_043865.fasta") || die "couldn't open file";
    # eat the first line, which doesn't have DNA in it
    <IN>;
    # read in the rest, appending them to $line as we go.
    while (<IN>) {
      $line = $line . $_;
    }
    # delete the newlines in the string
    $line =~ s/\n//g;
    print "Got " . length($line) . " characters of DNA.\n";

    # Now go find all the instances.
    # (as we remove each found occurrance, we use $offset to remember how
    # much we're taking off the string)
    $offset = 0;
    while ($line =~ m/ACCTT/g) {
      # the 'pos' function gives the position of the last match
      print "Found at position " . ($offset-5 + pos $line) . "\n";
      $line =~ s/ACCTT//;
      $offset += 5;
    }

    # and now print the final count.
    print "Got ". ($offset/5) . " occurrences of ACCTT\n";


  3. (4 points) So, is this patient likely to end up at a high place in the society? Present your arguments based on your findings in 1) and 2) and the decription of the disease given above. Output: Got 3371 characters of DNA.
    Found at position 1101
    Found at position 1230
    Found at position 1747
    Found at position 3068
    Got 4 occurrences of ACCTT

    That means we've got 4*5=20 characters of ACCTT, out of 3371. 20/3371 = .00593, which is 0.5%, more than the 0.25% required. So looks like it's a "YES". Well, note that for a "YES" one also needs to have ACCTT spaced out at regular intervals. But this is not the case: the first two instances are about 100 letters apart, but the last two are more than a thousand letters apart. So, the answer is actually "NO".