Sign in or Join FriendFeed
FriendFeed is the easiest way to share online. Learn more »

Marcel Martin › Likes

ISMB/ECCB
PT43: Tobias Marschall - Efficient Exact Motif Discovery
Unsupervised motif discovery on a string with no previous knowledege in an automated fashion. - Roland Krause
issues: How to measure over-representation and how to find them. - Roland Krause
184 publications in pubmed for "motif discovery algorithm" - Roland Krause
aim to establish an (almost) exact method based on a rigorous motif statistics - Mikhail Spivakov
given: query text, IUPAC motifs, random text model (background) - for now, iid - Mikhail Spivakov
Calculating a p-value of a given query text, a IUPAC motif and a random text model. - Roland Krause
want: a p-value for a motif - Mikhail Spivakov
use a novel device called probabilistic arithmetic automata - won't go into details - Mikhail Spivakov
Need to compute the distribution of occurrences by chance. Not a straight forward task, recently proposed a new approach by building a probabilistic arithmetic automata. - Roland Krause
an exact calculation - Mikhail Spivakov
The problem is that computing p-values is infeasible due to large number of motifs. - Roland Krause
matches occur in clumps. use compound possion approximation (almost exact) to calculate exact distribution of clump sizes. approximate number of clumps by Poisson distribution - Mikhail Spivakov
Use of a Compound Poisson Approximation on a set of clumps (sets of overlapping motifs) - Roland Krause
clump = overlapping occurences - Marcel Martin
The clump size can be used the probabilistic atutomata. - Roland Krause
nice: clumP SIze is abbreviated with an uppercase psi - Marcel Martin
How to bound the p-value to prune the search space? - Roland Krause
p-value: the probability of observing >k occurences (when found k in the real data) - Mikhail Spivakov
bound no. of occurences using the no. of clumps - Marcel Martin
Motifs with the same composition have the same expectation. - Roland Krause
iterate over all possible compositions, not over the motifs themselves to take advantage of the same expectation - Marcel Martin
but iid model for DNA isn't very appropriate - Mikhail Spivakov
Use of a suffix tree of the sequence, iterate over the motifs, use the lower bound for pruning, walk the tree and identify overrepresented motifs. - Roland Krause
so re-evaluate the motifs producing a good p-value with iid on a Markovian text model - Mikhail Spivakov
designing a good benchmark set is hard - Marcel Martin
other tools: Weeder, MEME - Marcel Martin
Benchmark sets are not easy, used a set by Sandve et al. http://www.pubmedcentral.nih.gov/article... - Roland Krause
Outperforms Weeder and MEME at the cost of higher running time of ~12 hours. - Roland Krause
algorithm is not as fast as the other tools. is easily parallelizable - Marcel Martin
with a 4.4 Mbp genome of M.tuberculosis, found motifs in ~250 CPU hrs in a parallelized setting - Mikhail Spivakov
best motif: AGACSCARAA (or sth like that), found in literature - Marcel Martin
computationally demanding, but possible with modern computers - Mikhail Spivakov
List of models, the first is described, others are under investigation. - Roland Krause
in the future: use modern hardware (eg GPUs) - Mikhail Spivakov
optimise wrt Markovian models directly (rather than iid) - Mikhail Spivakov
Future work could incorporate Markovian models directly or use phylogenetic information. - Roland Krause
Q. Is the implementation available? A: Given in the paper. - Roland Krause
question: is the tool available? yes, URL in the paper - Marcel Martin
you have to take into account overlapping motifs for doing proper statistics - Marcel Martin
Q. Are the data in Jasper or Transfac? A. Had an expert looking at it. - Roland Krause
# Jasper and Transfac do not really cover Mycobacterium motifs - Roland Krause
Q: Performance of the algorithm on short motifs. A. Length 10 is the upper bound for the algorithm which is quite dependent on the length. - Roland Krause
Q: applying to protein models? A: problematic because alphabet is larger and indels would need to be modelled - Marcel Martin
Q: how is the iid text model? how do you justify that the text fulfills the model? A: the iid model is estimated from the text. dependencies between characters are incorporated by using the Markovian model - Marcel Martin
Q. (Marcel Schulz) Differences in Markov models of different orders. A. Shorter orders give spurious results. - Roland Krause
Q: why only a part of the motif space? A: tried to come up with a plausible set that includes most motifs - Marcel Martin
Other ways to read this feed:Feed readerFacebook