STRING, actually comes close to this concept of multi-species database. I assume they can just add genomes to STRING and get all the annotations and predictions transferred to it (it could be a bit more open ;). One problem would be how to combine the current single species databases (ex SGD) with the multi species database so that information that is added by curators or large datasets to the model species is accessible to other species. This multi-species database could also serve as a platform. Say for example that someone develops an excellent predictor for some feature it would be great to be able to plug it in to the database and let people use it. That would create a little market for predictors that could be used by anyone else. Something similar to what cytoscape is doing at network level analysis but genome centric. (I am ignoring computation time costs :)
- Pedro Beltrao
Pedro, that's the key. The database must become a service, otherwise you're just limiting yourself
- Deepak Singh
The issue of computation time is precisely the problem. To do proper transfer of predictions you need to detect orthology, which requires complete genome comparisons to be done. In other words, to have a web service that transfers annotation to one gene in one genome, that service would need as input the entire genome that the gene is a part of. After that, one would need to do hours if not days of computation on running BLAST searches. This is why we make heavy use of precomputed results in STRING.
- Lars Juhl Jensen
Concerning the possibility of plugging predictors into genome browsers, this is pretty much the idea of the Distributed Annotation System (DAS). It allows you to set up any predictor that you make with an interface that enables people to view its predictions within, for example, the Ensembl genome browser. Still, it obviously requires that your prediction service is fast enough to be able to handle the load.
- Lars Juhl Jensen
Lars, what are the bottlenecks outside of computational performance of the prediction service? Are there any you could point me to?
- Deepak Singh
It's clear at this point that there won't be funding for a database of every organism (or even groups of closely related species). How can we recreate what MODs have done on a smaller scale? First, we need simple, out-of-the-box tools that allow even the smallest labs to easily parse, analyze, and present/host common data types. Bio DBs need to implement web services as Deepak points out, and we need standard frameworks for presentation. And as Lars notes, we need precomputation to unify resources.
- Todd Harris
Deepak, do you mean bottlenecks with respect to making a web service or bottlenecks related to integration of multiple genomes in general?
- Lars Juhl Jensen
The bottlenecks depend largely on how the data are distributed. If all the data relevant to a given service is already gathered on a single server, then I see few bottlenecks. But if the data are distributed, then I would expect major trouble in terms of network latency and in some cases bandwidth. I will give a couple of examples as separate comments (due to FriendFeed length restrictions).
- Lars Juhl Jensen
Protein interaction data: Imagine that each interaction database (BioGRID, IntAct, MINT, DIP, etc.) each made their interactions available via web services. To get the relevant set of interactions, I would now have to query each of these databases with the gene of interest as well as query each of them with the orthologous genes every other genome. This can easily amount to thousands of web service requests just to fetch one kind of data.
- Lars Juhl Jensen
Text mining: Unless the corpus of text to be mined resides on the same server as the web service that performs the text mining, you would have to send all of Medline across the network to do a query. Even if it resides locally, you need precomputed results to get decent speed - at least in the form of an index.
- Lars Juhl Jensen
Got it. That's a pretty clear example. But if you had the appropriate infrastructure and the ability to scale it, in theory, you could achieve that, correct? The datasets and algorithms are large, but not prohibitively large.
- Deepak Singh
I completely agree on the index bit. Would be nice to have something spidering the life science web
- Deepak Singh
Orthology detection: This requires that all genomes are stored (or at least cached) at a single server as one would otherwise have to send entire genomes around for every web service request. Moreover, precomputation is clearly needed to get a response time that is not on the scale of hours or days. Just for the record, I have burned in the order of 60 CPU years on precomputing alignments for the upcoming version 8 of STRING.
- Lars Juhl Jensen
Lars ... we should talk about this offline. I have a million questions
- Deepak Singh
I think the trick is to not try to make the services too "atomic". One would need some sort of caching/precomputing meta-services to make this scale. For example, one could imagine having an orthology metaserver that gathers the sequences from genome database web service and automatically updates the precomputed results when the genome annotation changes. Similarly, a meta-service could gather all the relevant interaction data, assigns quality scores, and transfers them by orthology.
- Lars Juhl Jensen
Caching/pre-computation will be critical if you want performance. I can't see it working any other way, even if you have all the iron in the world
- Deepak Singh
Computation is handled locally. For example, a lab predicts ortholog/paralog assignments for their genome against genomes of interest. Unified IDs, web services, and generic presentation allow layering of data via mashup.
- Todd Harris