Computational Biology
Scribe Notes for Class 6
May 30, 2000
Scribe: S. Oak
Today's Topics
The topic of Fragment Assembly was
discussed, along with a set of computational challenges.
Introduction to Fragment Assembly.
Sequencing ---
To obtain
a string of bases that make up a DNA molecule.
Target (S) ---
sequence resuliting from Fragment Assembly.
Fragment ---
the whole DNA molecule is not possible due to its length. However a piece
of the molecule can be used and sequenced in the canonical (5' ---> 3')
direction for a certain length. Each such sequence is called a fragment.
Fragment Assembly ---
whole DNA molecule is then deduced from these collection of fragments and
hence the process is termed as fragment assembly.
Guidance to Fragment Assembly.
Overlaps --- Sometimes
the end part of a fragment is similar to the beginning of the other.
Approximate Size ---The
approximate length of the target DNA molecule is known.
Contigs --- connected
pieces of the target DNA that we assemble.
Result of Fragment
Consensus sequence (consensus) ---
The consessus is obtained by taking the majority vote among all the bases
in each column during assembly.
for Fragment Assembly.
Base call errors --- The Base call errors
comprise base substitutions, insertions and deletions in the fragments.
Chimeras --- Chimeras arise when 2 regular
fragments from distinct parts of the target molecule join end-to-end to
form a fragment that is not a contiguous part of the target.
Contamination --- Some fragments that do not
have anything to do with the target molecule are the cause for contamination.This
is due to contamination from the host or vector DNA.
Unknown orientation --- Fragments can come
from any DNA strands and hence the fact that a fragment is a substring
of 1 strand is equivalent to the fact that its reverse complement is substring
of other.
Repeated regions --- All the overlaps involving
repeats may give an accepatable answer and hence this complicates the assembly
Alignment --- Imperfect alignment can result
due to skip of a base.
Substrings --- One fragment can be a part
of another or it may be contained in another.
Lack of coverage --- Coverage at position
i of the target is the number of fragments that cover that position. Coverage
is difficult to compute because the actual positions of the fragments in
the target is not known.
The message arrival model was discussed
to explain:
Stochastic Process --- A counting process
for a set of events occuring in time interval(0, ¥)
Poisson Process --- A counting process for
a set of events with constant rate.
The Poisson distribution formulae were applied to give the expression for
number of apparent contigs as : p
= ne-n(l-t)/T
The fraction of target sequence covered by exactly
'k' fragments is given by: r
= n(e-cck)/k!
T denotes the length of target molecule;
l denotes the length of all the fragments;
t denotes the overlap size;
n denotes the number of fragments;
c = nl/T.
Today's Sources
Setubal and Meidanis:
Chapter 1 (Pages 15 - 21) and Chapter 4 (Pages 105 - 113).
Salzberg, Searls,
and Kasif: not used today.
Introduction to Computational Biology : Maps, Sequences
and genomes. By : Micheal Waterman.
( Pages 94 -98)
Getting all fragments using the message arrival model.
Please report any problems found in these pages to:
CS6104 Account (