What to turn in: A legible paper copy giving your answers
A common medical condition is the
so-called "Megaloheimer's syndrome". People with
Megaloheimer's tend to forget everything about everybody else except
themselves, and often remember good things about themselves that have
actually never happened to them but happened to other people. The
intellectual ability of these patients is usually below average, although
in the early stages of the disease they can pool the wool over everybody's
eyes to effectively
conceal this fact. Often, with the help
of their influential and wealthy friends, people with this condition attain high power positions in the society.
Biomedical research reveals that the DNA of people with
Megaloheimer's syndrome is likely to contain a relatively high proportion
(more than 0.25 % of the total DNA content, counted "by letters" ) of the following sub-sequence: ACCTT , spaced at regular
intervals along the patient's DNA. In this homework you will write a series of short PERL programs
and try to answer the following question: is the patient XX
likely to develop Megaloheimer's?
The patient's DNA fragment is here: Patient XX. DNA sequence from locus 17NI5 promoter region. Admitted for diagnostic tests April 1 2006.
To approach the problem, write two separate perl programs to read the DNA sequence above and answer the questions below. Write your answer next to each question and provide your perl codes (on a separate sheet, if neccessary). Note that it is possible that some of the ACCTT strings may be split between two lines in the DNA file you have just downloaded.
SOLUTION. Some preliminary thoughts. Ok, first, remove the first (title) line from the DNA sequence file. Easiest to do it by hand. Next, you want to stitch up the lines to form one large string. Recall how we did it with \texttt{gawk} in lecture 5 (or 6). Do the same in PERL, or just use gawk , that's Ok. Now, all that remains is to match the ACCTT pattern in this long string. You can try =~ /ACCTT/ , but here is the rub: it will match just the first occurence of ACCTT , not all of them. Look up the PERL section of your text book for a solution: =~ /ACCTT/g where g means global search for every occurence in the given string. All you have to do is to put the above (well, using the proper sintax) in a while loop with a counter, and it will count the number of occurences of ACCTT . Below is a full code along these line. The code below also automatically chops off the first line. Note the use of pos() function in PERL to identify the sub-string's position in a string.
There is one occurence of ACCTT which wraps from one line to another. If you only checked for the occurence on each single line you did not catch this.
The first line of the file does not contain DNA sequence; including those characters in the position is thus incorrect. Also, we did not want the line number/line position. This is not the position within the DNA sequence. If the newlines were taken out that indexing would no longer be valid.
The correct way to calculate the percentage is (4 occurences * 5 bases per)/3371 total bases * 100 = .59%. This is greater than .25% so if you said the percentage was less than .25% you lost credit. Also, If you said that because it was .59% the answer was "Yes, the patient
will likely to have the disease and therefore will be placed high in the
society" you lost credit.
There is a second requirement that must be fulfilled; the occurences must be equidistant from each other.