What to turn in: A legible paper copy giving your answers
A common medical condition is the
so-called "Megaloheimer's syndrome". People with
Megaloheimer's tend to forget everything about everybody else except
themselves, and often remember good things about themselves that have
actually never happened to them but happened to other people. The
intellectual ability of these patients is usually below average, although
in the early stages of the disease they can pool the wool over everybody's
eyes to effectively
conceal this fact. Often, with the help
of their influential and wealthy friends, people with this condition
attain high power positions in the society.
Biomedical research reveals that the DNA of people with
Megaloheimer's syndrome is likely to contain a relatively high proportion
(more than 0.25 % of the total DNA content, counted "by letters" ) of the following sub-sequence: ACCTT , spaced at regular
intervals along the patient's DNA. In this homework you will write a series of short PERL programs
and try to answer the following question: is the patient XX
likely to develop Megaloheimer's?
The patient's DNA fragment is here: Patient XX. DNA sequence from locus 17NI5 promoter region. Admitted for diagnostic tests April 1 2006.
To approach the problem, write two separate perl programs to read the DNA sequence above and answer the questions below. Write your answer next to each question and provide your perl codes (on a separate sheet, if neccessary). Note that it is possible that some of the ACCTT strings may be split between two lines in the DNA file you have just downloaded.
SOLUTION. Some preliminary thoughts. Ok, first, remove the first (title) line from the DNA sequence file. Easiest to do it by hand. Next, you want to stitch up the lines to form one large string. Recall how we did it with \texttt{gawk} in lecture 5 (or 6). Do the same in PERL, or just use gawk , that's Ok. Now, all that remains is to match the ACCTT pattern in this long string. You can try =~ /ACCTT/ , but here is the rub: it will match just the first occurence of ACCTT , not all of them. Look up the PERL section of your text book for a solution: =~ /ACCTT/g where g means global search for every occurence in the given string. All you have to do is to put the above (well, using the proper sintax) in a while loop with a counter, and it will count the number of occurences of ACCTT . Below is a full code along these line. The code below also automatically chops off the first line. Note the use of pos() function in PERL to identify the sub-string's position in a string.
#!/usr/bin/perl
open(IN,"XM_043865.fasta") || die "couldn't open file";
# eat the first line, which doesn't have DNA in it
<IN>;
# read in the rest, appending them to $line as we go.
while (<IN>) {
$line = $line . $_;
}
# delete the newlines in the string
$line =~ s/\n//g;
$count = 0;
while ($line =~ m/ACCTT/g) {
# take out the one we just matched.
$line =~ s/ACCTT//;
$count++;
}
# and now print the final count.
print "Got $count occurrences of ACCTT\n";
#!/usr/bin/perl
open(IN,"XM_043865.fasta") || die "couldn't open file";
# eat the first line, which doesn't have DNA in it
<IN>;
# read in the rest, appending them to $line as we go.
while (<IN>) {
$line = $line . $_;
}
# delete the newlines in the string
$line =~ s/\n//g;
print "Got " . length($line) . " characters of DNA.\n";
# Now go find all the instances.
# (as we remove each found occurrance, we use $offset to remember how
# much we're taking off the string)
$offset = 0;
while ($line =~ m/ACCTT/g) {
# the 'pos' function gives the position of the last match
print "Found at position " . ($offset-5 + pos $line) . "\n";
$line =~ s/ACCTT//;
$offset += 5;
}
# and now print the final count.
print "Got ". ($offset/5) . " occurrences of ACCTT\n";
Output:
Got 3371 characters of DNA.
Found at position 1101
Found at position 1230
Found at position 1747
Found at position 3068
Got 4 occurrences of ACCTT
That means we've got 4*5=20 characters of ACCTT, out of 3371. 20/3371 = .00593,
which is 0.5%, more than the 0.25% required. So looks like it's a "YES".
Well, note that for a "YES" one also needs to have ACCTT spaced out at
regular intervals. But this is not the case: the first two instances are
about 100 letters apart, but the last two are more than a thousand letters
apart. So, the answer is actually "NO".