FriendFeed is the easiest way to share online. Learn more » |

PT43: Tobias Marschall - Efficient Exact Motif Discovery

Marcel Martin and Roland Krause liked this

Unsupervised motif discovery on a string with no previous knowledege in an automated fashion.
- Roland Krause

44 more comments

issues: How to measure over-representation and how to find them.
- Roland Krause

184 publications in pubmed for "motif discovery algorithm"
- Roland Krause

aim to establish an (almost) exact method based on a rigorous motif statistics
- Mikhail Spivakov

given: query text, IUPAC motifs, random text model (background) - for now, iid
- Mikhail Spivakov

Calculating a p-value of a given query text, a IUPAC motif and a random text model.
- Roland Krause

want: a p-value for a motif
- Mikhail Spivakov

use a novel device called probabilistic arithmetic automata - won't go into details
- Mikhail Spivakov

Need to compute the distribution of occurrences by chance. Not a straight forward task, recently proposed a new approach by building a probabilistic arithmetic automata.
- Roland Krause

an exact calculation
- Mikhail Spivakov

The problem is that computing p-values is infeasible due to large number of motifs.
- Roland Krause

matches occur in clumps. use compound possion approximation (almost exact) to calculate exact distribution of clump sizes. approximate number of clumps by Poisson distribution
- Mikhail Spivakov

Use of a Compound Poisson Approximation on a set of clumps (sets of overlapping motifs)
- Roland Krause

clump = overlapping occurences
- Marcel Martin

The clump size can be used the probabilistic atutomata.
- Roland Krause

nice: clumP SIze is abbreviated with an uppercase psi
- Marcel Martin

How to bound the p-value to prune the search space?
- Roland Krause

p-value: the probability of observing >k occurences (when found k in the real data)
- Mikhail Spivakov

bound no. of occurences using the no. of clumps
- Marcel Martin

Motifs with the same composition have the same expectation.
- Roland Krause

iterate over all possible compositions, not over the motifs themselves to take advantage of the same expectation
- Marcel Martin

but iid model for DNA isn't very appropriate
- Mikhail Spivakov

Use of a suffix tree of the sequence, iterate over the motifs, use the lower bound for pruning, walk the tree and identify overrepresented motifs.
- Roland Krause

so re-evaluate the motifs producing a good p-value with iid on a Markovian text model
- Mikhail Spivakov

designing a good benchmark set is hard
- Marcel Martin

other tools: Weeder, MEME
- Marcel Martin

Benchmark sets are not easy, used a set by Sandve et al. http://www.pubmedcentral.nih.gov/article...
- Roland Krause

Outperforms Weeder and MEME at the cost of higher running time of ~12 hours.
- Roland Krause

algorithm is not as fast as the other tools. is easily parallelizable
- Marcel Martin

with a 4.4 Mbp genome of M.tuberculosis, found motifs in ~250 CPU hrs in a parallelized setting
- Mikhail Spivakov

best motif: AGACSCARAA (or sth like that), found in literature
- Marcel Martin

computationally demanding, but possible with modern computers
- Mikhail Spivakov

List of models, the first is described, others are under investigation.
- Roland Krause

in the future: use modern hardware (eg GPUs)
- Mikhail Spivakov

optimise wrt Markovian models directly (rather than iid)
- Mikhail Spivakov

Future work could incorporate Markovian models directly or use phylogenetic information.
- Roland Krause

Q. Is the implementation available? A: Given in the paper.
- Roland Krause

question: is the tool available? yes, URL in the paper
- Marcel Martin

you have to take into account overlapping motifs for doing proper statistics
- Marcel Martin

Q. Are the data in Jasper or Transfac? A. Had an expert looking at it.
- Roland Krause

# Jasper and Transfac do not really cover Mycobacterium motifs
- Roland Krause

Q: Performance of the algorithm on short motifs. A. Length 10 is the upper bound for the algorithm which is quite dependent on the length.
- Roland Krause

Q: applying to protein models? A: problematic because alphabet is larger and indels would need to be modelled
- Marcel Martin

Q: how is the iid text model? how do you justify that the text fulfills the model? A: the iid model is estimated from the text. dependencies between characters are incorporated by using the Markovian model
- Marcel Martin

Q. (Marcel Schulz) Differences in Markov models of different orders. A. Shorter orders give spurious results.
- Roland Krause

Q: why only a part of the motif space? A: tried to come up with a plausible set that includes most motifs
- Marcel Martin