CS 2204: Common Mistakes for Homework 5

What to turn in: A legible paper copy giving your answers

A common medical condition is the so-called "Megaloheimer's syndrome". People with Megaloheimer's tend to forget everything about everybody else except themselves, and often remember good things about themselves that have actually never happened to them but happened to other people. The intellectual ability of these patients is usually below average, although in the early stages of the disease they can pool the wool over everybody's eyes to effectively conceal this fact. Often, with the help of their influential and wealthy friends, people with this condition attain high power positions in the society.

Biomedical research reveals that the DNA of people with Megaloheimer's syndrome is likely to contain a relatively high proportion (more than 0.25 % of the total DNA content, counted "by letters" ) of the following sub-sequence: ACCTT , spaced at regular intervals along the patient's DNA. In this homework you will write a series of short PERL programs and try to answer the following question: is the patient XX likely to develop Megaloheimer's? The patient's DNA fragment is here: Patient XX. DNA sequence from locus 17NI5 promoter region. Admitted for diagnostic tests April 1 2006.

To approach the problem, write two separate perl programs to read the DNA sequence above and answer the questions below. Write your answer next to each question and provide your perl codes (on a separate sheet, if neccessary). Note that it is possible that some of the ACCTT strings may be split between two lines in the DNA file you have just downloaded.

SOLUTION. Some preliminary thoughts. Ok, first, remove the first (title) line from the DNA sequence file. Easiest to do it by hand. Next, you want to stitch up the lines to form one large string. Recall how we did it with \texttt{gawk} in lecture 5 (or 6). Do the same in PERL, or just use gawk , that's Ok. Now, all that remains is to match the ACCTT pattern in this long string. You can try =~ /ACCTT/ , but here is the rub: it will match just the first occurence of ACCTT , not all of them. Look up the PERL section of your text book for a solution: =~ /ACCTT/g where g means global search for every occurence in the given string. All you have to do is to put the above (well, using the proper sintax) in a while loop with a counter, and it will count the number of occurences of ACCTT . Below is a full code along these line. The code below also automatically chops off the first line. Note the use of pos() function in PERL to identify the sub-string's position in a string.

  1. (4 points) Read in the patient's DNA sequence and count the number of occurrences of the sub-sequence ACCTT in it. There is one occurence of ACCTT which wraps from one line to another. If you only checked for the occurence on each single line you did not catch this.


  2. (4 points) Read in the the patient's DNA sequence and print the indexes (positions) for every occurrence of the sub-sequence ACCTT in it. For example, the sequence:
          ACCTTGCGATGAACCTTGATGCCG
    contains the sub-sequence ACCTT two times at indexes (positions) 0 and 12. The first line of the file does not contain DNA sequence; including those characters in the position is thus incorrect. Also, we did not want the line number/line position. This is not the position within the DNA sequence. If the newlines were taken out that indexing would no longer be valid.


  3. (4 points) So, is this patient likely to end up at a high place in the society? Present your arguments based on your findings in 1) and 2) and the decription of the disease given above. The correct way to calculate the percentage is (4 occurences * 5 bases per)/3371 total bases * 100 = .59%. This is greater than .25% so if you said the percentage was less than .25% you lost credit. Also, If you said that because it was .59% the answer was "Yes, the patient will likely to have the disease and therefore will be placed high in the society" you lost credit. There is a second requirement that must be fulfilled; the occurences must be equidistant from each other.