Computational Biology
Scribe Notes for Class 7
May 31, 2000
Scribe: N. Allen
Today's Handouts and Announcements
- No handouts or announcements
Today's Topics
- Fragment Assembly Process
- Break DNA molecule into smaller fragments
- Sequence the smaller fragments
- Assemble smaller fragments to reconstruct original DNA sequence
- Assembly Model: Shortest Common Superstring (SM p. 114)
- Input: A collection of fragments F.
- Output: The shortest string S such that for every f in F, S is a superstring of f.
- This model ignores complemented sequences, sequencing errors, and repetitions.
- The equivalent decision problem is NP-hard
- Assembly Model: Reconstruction (SM p.116)
- Input: A collection of fragments F and an error tolerance 0<=e<=1.
- Output: The shortest string S such that for every f in F, min{ds(f,S),ds(fc,S)} <= e*|f|.
- ds is the substring edit distance.
- ds(a,b)=min{d(a,s)} where s varies over the substrings of b.
- d is the classical edit distance.
- d(a,b) is a measure of the substitutions, insertions, and deletions needed to match a and b.
- Example: Let x, y, and z be positive real numbers. Then, d(a,b)=x*(# substitutions)+y*(# insertions)+z*(# deletions).
- Note that ds is not actually a metric which can cause mathematical and algorithmic problems if not accounted for.
- fc is the reverse complement of f. SM uses an overbar to denote this.
Handles complemented sequences and sequencing errors but is still an NP-hard decision problem.
Assembly Model: Multicontig (Class Version) (SM p.117)
- Input: A collection of fragments F.
- Output: The layout of F with weakest link of maximal length.
- A layout is a potential answer that contains every f in F or its complement (exclusively). A layout is thus a multiple alignment of F where every column contains only one kind of base.
- Example:
G | G | T | A | | | | A | T | T |
|   | T | A | C | |   | | T | T | C | A | G |
| | | | C | T | C | A | | | C | A | G | C |
- A link is an overlap of fragments not contained in any other fragment.
- The weakest link in a layout is the smallest size of any link in it.
Another NP-hard decision problem.
Assembly Model: Multicontig (Book Version) (SM p.117)
- Input: A collection of fragments F and an integer t.
- Output: A partition of F into a minimal number of subcollections such that every subcollection has weakest link of length at least t.
Algorithmic model for overlaps (SM p.119)
- The overlap multigraph OM(F) of F is a directed, weighted multigraph defined by:
- The vertex set of OM(F) is F.
- For every a and b in F, if a != b and there exists some t>=0 such that (akt)b=a(ktb) then there is an edge of weight t from a to b.
- A simple path in OM(F) corresponds to a superstring of the vertices of OM(F).
- The length of the superstring will be the sum of the lengths of the fragments minus the sum of the lengths of the edges in the path.
- Example (continued): TTCAGCTCATTACGGTA
- Note that the shortest common superstring of the fragments in F will be the Hamiltonian path of maximal weight over OM(F).
Approximation algorithm for shortest common superstring (SM p.125)
- Sort the edges of OM(F) by weight.
- [Iteration] Choose the largest edge satisfying the following properties:
- The edge does not form a cycle with previously chosen edges.
- The edge does not originate from a vertex that a previously chosen edge originated from.
- The edge does not terminate at a vertex that a previously chosen edge terminated at.
Today's Sources
Please report any problems found in these pages to:
CS6104 Account (cs6104@courses.cs.vt.edu)