Computational Biology

Scribe Notes for Class 6
May 30, 2000

Scribe: S. Oak

Today's Topics

The topic of Fragment Assembly was discussed, along with a set of computational challenges.

Introduction to Fragment Assembly.

Sequencing --- To obtain a string of bases that make up a DNA molecule.
Target (S) --- Long sequence resuliting from Fragment Assembly.
Fragment --- Sequencing the whole DNA molecule is not possible due to its length. However a piece of the molecule can be used and sequenced in the canonical (5' ---> 3') direction for a certain length. Each such sequence is called a fragment.
Fragment Assembly --- The whole DNA molecule is then deduced from these collection of fragments and hence the process is termed as fragment assembly.

Guidance to Fragment Assembly.

Overlaps --- Sometimes the end part of a fragment is similar to the beginning of the other.
Approximate Size ---The approximate length of the target DNA molecule is known.
Contigs --- connected pieces of the target DNA that we assemble.

Result of Fragment Assembly.

Consensus sequence (consensus) --- The consessus is obtained by taking the majority vote among all the bases in each column during assembly.

Complications for Fragment Assembly.

Base call errors --- The Base call errors comprise base substitutions, insertions and deletions in the fragments.
Chimeras --- Chimeras arise when 2 regular fragments from distinct parts of the target molecule join end-to-end to form a fragment that is not a contiguous part of the target.
Contamination --- Some fragments that do not have anything to do with the target molecule are the cause for contamination.This is due to contamination from the host or vector DNA.
Unknown orientation --- Fragments can come from any DNA strands and hence the fact that a fragment is a substring of 1 strand is equivalent to the fact that its reverse complement is substring of other.
Repeated regions --- All the overlaps involving repeats may give an accepatable answer and hence this complicates the assembly process.
Alignment --- Imperfect alignment can result due to skip of a base.
Substrings --- One fragment can be a part of another or it may be contained in another.
Lack of coverage --- Coverage at position i of the target is the number of fragments that cover that position. Coverage is difficult to compute because the actual positions of the fragments in the target is not known.

The message arrival model was discussed to explain:

Stochastic Process --- A counting process for a set of events occuring in time interval(0, ¥)
Poisson Process --- A counting process for a set of events with constant rate.

The Poisson distribution formulae were applied to give the expression for esimated number of apparent contigs as : p = ne^-n(l-t)/T

The fraction of target sequence covered by exactly 'k' fragments is given by: r = n(e^-cc^k)/k!

            where T denotes the length of target molecule;
                      l denotes the length of all the fragments;
                      t denotes the overlap size;
                      n denotes the number of fragments;
                      c = nl/T.

Today's Sources

Setubal and Meidanis: Chapter 1 (Pages 15 - 21) and Chapter 4 (Pages 105 - 113).
Salzberg, Searls, and Kasif: not used today.
Introduction to Computational Biology : Maps, Sequences and genomes. By : Micheal Waterman. ( Pages 94 -98)

Getting all fragments using the message arrival model.

Please report any problems found in these pages to:

CS6104 Account (cs6104@courses.cs.vt.edu)

Computational Biology

Scribe Notes for Class 6 May 30, 2000

Scribe: S. Oak

Today's Topics

Today's Sources

Scribe Notes for Class 6
May 30, 2000