Computational Biology
Scribe Notes for Class 6
May 30, 2000
Scribe: S. Oak
Today's Topics
The topic of Fragment Assembly was
discussed, along with a set of computational challenges.
-
Introduction to Fragment Assembly.
-
Sequencing ---
To obtain
a string of bases that make up a DNA molecule.
-
Target (S) ---
Long
sequence resuliting from Fragment Assembly.
-
Fragment ---
Sequencing
the whole DNA molecule is not possible due to its length. However a piece
of the molecule can be used and sequenced in the canonical (5' ---> 3')
direction for a certain length. Each such sequence is called a fragment.
-
Fragment Assembly ---
The
whole DNA molecule is then deduced from these collection of fragments and
hence the process is termed as fragment assembly.
-
Guidance to Fragment Assembly.
-
Overlaps --- Sometimes
the end part of a fragment is similar to the beginning of the other.
-
Approximate Size ---The
approximate length of the target DNA molecule is known.
-
Contigs --- connected
pieces of the target DNA that we assemble.
-
Result of Fragment
Assembly.
-
Consensus sequence (consensus) ---
The consessus is obtained by taking the majority vote among all the bases
in each column during assembly.
-
Complications
for Fragment Assembly.
-
Base call errors --- The Base call errors
comprise base substitutions, insertions and deletions in the fragments.
-
Chimeras --- Chimeras arise when 2 regular
fragments from distinct parts of the target molecule join end-to-end to
form a fragment that is not a contiguous part of the target.
-
Contamination --- Some fragments that do not
have anything to do with the target molecule are the cause for contamination.This
is due to contamination from the host or vector DNA.
-
Unknown orientation --- Fragments can come
from any DNA strands and hence the fact that a fragment is a substring
of 1 strand is equivalent to the fact that its reverse complement is substring
of other.
-
Repeated regions --- All the overlaps involving
repeats may give an accepatable answer and hence this complicates the assembly
process.
-
Alignment --- Imperfect alignment can result
due to skip of a base.
-
Substrings --- One fragment can be a part
of another or it may be contained in another.
-
Lack of coverage --- Coverage at position
i of the target is the number of fragments that cover that position. Coverage
is difficult to compute because the actual positions of the fragments in
the target is not known.
-
The message arrival model was discussed
to explain:
-
Stochastic Process --- A counting process
for a set of events occuring in time interval(0, ¥)
-
Poisson Process --- A counting process for
a set of events with constant rate.
-
The Poisson distribution formulae were applied to give the expression for
esimated
number of apparent contigs as : p
= ne-n(l-t)/T
-
The fraction of target sequence covered by exactly
'k' fragments is given by: r
= n(e-cck)/k!
where
T denotes the length of target molecule;
l denotes the length of all the fragments;
t denotes the overlap size;
n denotes the number of fragments;
c = nl/T.
Today's Sources
-
Setubal and Meidanis:
Chapter 1 (Pages 15 - 21) and Chapter 4 (Pages 105 - 113).
-
Salzberg, Searls,
and Kasif: not used today.
-
Introduction to Computational Biology : Maps, Sequences
and genomes. By : Micheal Waterman.
( Pages 94 -98)
-
Getting all fragments using the message arrival model.
Please report any problems found in these pages to:
CS6104 Account (cs6104@courses.cs.vt.edu)