Bioinformatics - cse546, Autumn 2010

(Computational Biology)

Instructor: Dr. M. Sakalli, E-mail: WildSolarianTree at gmail com.

Books

Presentations

The aim of this project is to build a graph generator such that:
1)
De Bruijn Graph can be produced for a given (DNA or Protein) sequence..

2) and genetic sequences can be mapped into a Eulerian (st) path.

3) to achieve this you need to understand some tools start with implementation of suffix trees - suffix arrays, therefore I think border arrays may be a generic method!!.

You have to understand Burrows Wheeler Transforms, and as an application Ross Lippert provides his code, and paper, but below this page I included many other resources that will help you go through.

You need to understand Ukkonen's O(n) suffix tree algorithm and its utilization for pattern and profile search..

4) One question is how to merge suffix trees to construct De Bruijn graphs, or may be another approach.

5) and finally exploiting Eulerian or (Hamiltonian) path approach with De Bruijn graphs that can be loaded with meaningful outcomes for protein and/or peptide sequences.

This was the big picture of the project that I told you, which is the same, and I hope once you start you'll/may find yourself been engaged deeply inside with one of the concepts.

For full mark, your work must include your programming and presention of your results with supported papers.

There is a very similar work, that I am running for, from the Univ of Cincinnati, De Bruijn Graphs seems completed recently by 2010.

The things you must include in your work on.

1- Implementation of Burrows Wheeler Transforms. for DNA sequences you will be given.

2- Implement Ukkonen's algorithm and apply it to the sequences (..).

3- A program constructing De Bruijn graphs (should be simple), here is one (This is the same link that was given above) python code applied to SNM case.

You must choose one of the tasks addressed above and bring a presentation.

4) Data sequences and some publications soon.Matlab implementations will be prefered, but gcc, vc++ 6, python implementations are acceptable as well..

To be able to appreciate the project please go thou these links..

There are many sources as presented below including in/thru wikipedia, but to save your time go to Ross Lipper's Notes of MIT..

Suffix Trees: Burrows Wheeler Transform applied to DNA sequences from Ross Lipper notes. Following three links are for you to understand antisorting stage to recover the original sequence and source codes (Rememebr I am not asking encoding and decoding stages since our interest is in patterns.). Trie, Patricia Trie.

from fit.edu, M.Mahoney Source bbb compressor.

Reconstruction example with javascript,

another site with suffix trees and source code as well, and

More Suffix trees

1- Page by Prof. Sartaj Sahni of UFL.edu and the dead link at the bottom of the page is the sufixcode link located here Fast string matching wt suffix trees mark nelson's page. ***

2- Suffix trees in computational biology of Chris Lewis of usask.ca here. Many of the links may be the same..

3- Suffix trees by Lloyd Allison and so much more (of Monash University), Wolloomoloo example, (kangaroo)

4- More on suffix trees Christian Kreibich's libstree including depth-and breadth-first traversal, string update, and examples like Longest Common Sub String, and L Repeated Substring

5- ansi c implementation source code by D. Tsadoc and S. Yona,

6- Python bound applications of suffix trees seem to originally in C, therefore for now better just to stick by with those provided above.

7- Gusfield prgs

 

Piotr Dendek has pointed out another article on the topic.. and he also found a source code point in java for the suffix trees..

 

E.coli K-12..fasta, E.coli_full_sequence_ALLInf.gb,. E.coli_features.gb, E.coli.genes.gb, more information available at from ftp site.. and webpage of ncbi.nih.gov.

Links 2 more papers on De Bruijn included..

On Combinatorial DNA Word Design Word Design...

Sense from sequence reads: methods for alignment and assembly..

Generating graphs and De Bruijn Cincinnati, in python.

+++) Papers: Method of constructing De Bruijn sequences. By B. Arazi 1976.

+++) Lempil Ziv notes of Peter Shor.

+++) I think this will be helpful to reduce the dimension of the problems. You must look at the way codons are classified here, "anew classification of Codons" from the "Inst of Molecular Biology" by Wilhelm and Nikolajewa.

+++) Practical Suffix Tree Construction on large sequences from eecs.umich.edu/tdd ***.

+++) Suffx Tree Construction of rpi **** C code avlble

+++) Linear Time Algos finding Tandem Repeats ***

Structures of String Matching and Data Compression. 1999, NJ. Jesper Larsson

+++) Necklaces and Lyndon words

++) Analysis of tRNA Gene Sequences by Neural Network ** nih.

Codon reading patters in Drosophila melanogaster mitochonria based on their tRNA..

Regulation by RNA by M. Szymañski and J. Barciszewski,

DNA sequences that code for tRNA and rRNA

RNA-based regulation of genes by G. Yanofsky, *

RNA-Based Therapy..1 **, 2 **

tRNA chapters from Virginia state. Chapters 32, 33, 11

tRNA usage 1 2

An example of constructing graph: Clustering in Relational Biological Data by A Dayanik and CG Nevill-Manning

Statistical signals in bioinformatics, by S Karlin. **

tRNA databse.

Inferring Strings from Runs

diversity of tRNA in eukaryotes .

Sequences of tRNA-encoding genes and associated open reading

PCA analysis: (Specific Codon Usage Pattern of H1N1) ***

Characterization of tRNA genes in tRNA region

Partial Sequence Analysis of the 5s to 18s rRNA Gene Region

Ivet Bahar Vibrational dynamicsof tRNAs.