Nature, Vol. 462, No. 7273. (03 December 2009), pp. 656-659. Estimates of the total number of bacterial species1, 2, 3 indicate that existing DNA sequence databases carry only a tiny fraction of the total amount of DNA sequence space represented by this division of life. Indeed, environmental DNA samples have been shown to encode many previously unknown classes of proteins4 and RNAs5. Bioinformatics searches6, 7, 8, 9, 10 of genomic DNA from bacteria commonly identify new noncoding RNAs (ncRNAs)10, 11, 12 such as riboswitches13, 14. In rare instances, RNAs that exhibit more extensive sequence and structural conservation across a wide range of bacteria are encountered15, 16. Given that large structured RNAs are known to carry out complex biochemical functions such as protein synthesis and RNA processing reactions, identifying more RNAs of great size and intricate structure is likely to reveal additional biochemical functions that can be achieved by RNA. We applied an updated...
- Neil Saunders
A Global Proteomic Analysis of the Insoluble, Soluble and Supernatant Fractions of the Psychrophilic Archaeon Methanococcoides burtonii Part II: The Effect of Different Methylated Growth Substrates. - http://www.citeulike.org/user...
Journal of proteome research (1 December 2009) Methanococcoides burtonii is a cold-adapted methanogenic archaeon from Ace Lake in Antarctica. Methanol and methylamines are the only substrates it can use for carbon and energy. We carried out quantitative proteomics using iTRAQ of M. burtonii cells grown on different substrates (methanol in defined media or trimethylamine in complex media), using techniques that enriched for secreted and membrane proteins in addition to cytoplasmic proteins. By integrating proteomic data with the complete, manually annotated genome sequence of M. burtonii, we were able to gain new insight into methylotrophic metabolism, and the effects of methanol on the cell. Metabolic processing of methanol and methylamines is initiated by methyltransferases specific for each substrate, with multiple paralogs for each of the methyltransferases (similar to other members of the Methanosarcinaceae). In M. burtonii, most methyltransferases appear to have distinct roles in...
- Neil Saunders
@citeulike I don't see bioinformatics/computational biology under Biology/Mathematics or any of your profile research field categories?
Neil, great post. And you're right, we do make things too complicated sometimes, but do we do that at the level at which we ask questions, or at the software implementation level? My take is the latter, cause you need to ask questions the way you want to, but that doesn't mean what makes it all come together has to be one complex mess
- Deepak Singh
Glad you like it. One of those that bubbled up out of frustration at inability to achieve! I feel that science is the business of turning complex (real-world) things into simple models - and that we've moved away from that idea.
- Neil Saunders
I'm a sucker for this kind of ambitious thinking. Go Neil!
- Bill Hooker
I think it's a good sign that things like this are now obvious. Things start out as a complex mess of disconnected things, overlapping complicated ways of connecting them are devised, then it becomes obvious what the simpler thing to do is.
- Mr. Gunn
Great ! But aren't you re-inventing something like RDF Neil ? feature/probe/value is nothing but a RDF statement...
- Pierre Lindenbaum
No, I don't want to reinvent anything. If RDF will work for me, I'll use it. I'll also use SQL, NoSQL, key-value pairs, document-oriented or whatever it takes. I just think that trying to integrate data by combining other peoples large, complex representations is not working. We need to simplify the whole business.
- Neil Saunders
I think there is a middle road here - we need high level generic descriptions like what Neil is proposing (and like my "We have stuff, we do stuff to it, which makes stuff"), but also a way of pointing to more sophisticated information that might be useful in specific contexts. I think we can have the best of both worlds as long as the data representation is separated from the metadata and the organization of each can be described in a machine readable (and agreed!) form
- Cameron Neylon
I'm too old school, leaving comments on blogs... who does that any more. I’m sure you’re aware that you’ve just described a model using *triples*. Which means you could start storing these kinds of simple relationships in a triple store like virtuoso etc. As you say, you don't have to reinvent anything, just simplify the use (conventions) of existing approaches (e.g. RDF). I would like...
more...
- Greg Tyrelle
I like blog comments :-) Yes, my example looks like RDF triples. No, that was not really my intention. Let's ask these questions: (1) what data relationships would make sense to a biologist? (2) what are the commonalities in the data, which a biologist may not have considered at an abstract level? As I wrote in the post, many datasets that look different are really different ways of looking at the same thing.
- Neil Saunders
The joys of data modelling :-) For (1): I'm afraid asking for a definition of some data relationships is building an(other?)) ontology.
- Pierre Lindenbaum
Let's put it another way. What we have, presently, are quite complete, often large and complex, but useful and usable descriptions of individual experiment types. "Integration" essentially means "parse them individually and mash-up the results". That's what makes it difficult. Perhaps we need an "ontology of integration" :-) But let's keep it really, really minimal.
- Neil Saunders
I actually think you will struggle to find data commonalities across bioscience. Even the simple proposal of target, measurement, value could break down in many cases e.g. we tried ages ago to get some intensity data from a bunch of microarray experiments and we gave up because we couldn't get across what we needed. What are you really measuring? Does it mean the same thing to different...
more...
- Cameron Neylon
I think there's a good case for storing, in the first instance, raw values. Figure out how to process them later (that's statistics). Focus on trends (up, down, stayed the same). Focus on well-defined variables that do mean the same to everyone (intensity, in theory = amount of transcript, regardless of the very real difficulties). And I think more experiments fall into...
more...
- Neil Saunders
@Pierre freebase is exactly what I had in mind, however the web client (the best part) is not open. @Neil Store the data first, ask questions later. Nice. One of my hopes for semantic web technology was that is could be come a universal mashup system (RDF+ontologies+triplestores). But you start down that path, and you suddenly realise that the semweb is asking you to get your data...
more...
- Greg Tyrelle
But for me your example of a gel isn't raw data. The raw data is the image. Which might have several targets or assays on it. Up/down stayed the same is only really of interest in particular types of science. And I challenge you to find any well defined variables :-) Intensity to me is a measure of optical density but questions of background, object size, masking, averaging algorithm...
more...
- Cameron Neylon
from twhirl
But agree with what you and Greg are saying, first thing get the data somewhere, with allt the metadata you can automatically collect. Then worry about capturing more metadata as people do stuff with the data. Writing this grant proposal right at the moment.
- Cameron Neylon
from twhirl
And in microarrays, "raw" data is the image of the slide. But aside from a cursory inspection to ensure that it isn't complete rubbish, nobody much cares about that. I'd argue that there's a point in the preprocessing at which a numerical value emerges which could be called "useful" and which encapsulates the object being measured. It needs more work (e.g. normalization) to get information from it, but it's the "value" in feature/reporter/value.
- Neil Saunders
To me this about finding something a bit like an upper ontology that describes the general category that objects (targets, assay, value, inputs, outputs, data, process, sample) fall into. That lets you do the general integration, and the more detailed local data structures become more useful as you can agree more and more on what details are important. So I absolutely agree with what...
more...
- Cameron Neylon
Heh heh It was exactly that image that we did care about - which was the problem :-) I will admit to being an edge case, but in some ways we're all edge cases, they're just different edges...
- Cameron Neylon
Neil, may I link to this FF thread from Book of Trogool?
- D0r0th34
:-) Sure, different questions, different "levels" of data. I guess my angle is more a statistical one: how do I compare (seemingly) quite different datasets - what numbers can I extract and crunch? Less interested in the capture and description of data at every stage in the process.
- Neil Saunders
Sure, and those are very complete descriptions of experimental components. But what I want is: "I saw A on my gel, B in my LC/MS, C on my expression array and D on my SNP array and when I plug all that into some Bayesian predictor, it says cancer" :-)
- Neil Saunders
Ontologies are not the issue, it's more low level than that. I also work with microarrays, proteomics, metabolomics, and numerous physiological data sets. To keep all the data in one place I use a relational database, in this case postgresql because I like to store raw intensity values in array datatypes, along with pylons based web interfaces to display various views of the data to my...
more...
- Greg Tyrelle
My argument would be that the reason you're less productive is not because of the RDF and ontologies per se, but because the ontologies aren't really built for what we want to do. They're for describing certain types of outcomes, not for integrating data in a discovery phase. But Neil's (entity, probe, value) is still an ontology of sorts. It is just a higher level one. My belief is...
more...
- Cameron Neylon
But keep the discussion going - this is exactly the problem that e.g the SAGE project will have - http://sagebase.org - and as a notional member of the data working group I could do with all the ideas and help that's out there...
- Cameron Neylon
We are thinking too much in terms of data representation here. In the end what you are looking at is a data warehousing problem. You have different front end systems and you want to be able to pull data in for offline processing into a warehouse. That's pretty much what you do at any company doing a lot of analytics/business intelligence. Different types of data being collected in...
more...
- Deepak Singh
Neil, I was under the impression that normalization across arrays and labs wasn't actually a solved problem, yet. Surely that would have to come first before stripping things down to just assay-key-value?
- Mr. Gunn
Normalization ... aaargh! Most definitely not a solved problem
- Rajarshi Guha
Normalizing within your own experiments is hard enough, never mind across unrelated datasets. It's something we have to solve though, to make the most of public data.
- Neil Saunders
CSIRO scientists have created ‘rogue waves’ more than 20 metres high and smashed them into virtual oil and gas production platforms to compare different mooring designs.
- Neil Saunders
@digitalbio Thanks, but I know how to search. The context (high proportion of reviews) is in the FriendFeed thread.
BMC Bioinformatics, Vol. 10, No. 1. (2009), 394. BACKGROUND:Experts in peptide:MHC binding studies are often able to estimate the impact of a single residue substitution based on a heuristic understanding of amino acid similarity in an experimental context. Our aim is to quantify this measure of similarity to improve peptide:MHC binding prediction methods. This should help compensate for holes and bias in the sequence space coverage of existing peptide binding datasets. RESULTS:Here, a novel amino acid similarity matrix (PMBEC) is directly derived from the binding affinity data of combinatorial peptide mixtures. Like BLOSUM62, this matrix captures well-known physicochemical properties of amino acid residues. However, PMBEC differs markedly from existing matrices in cases where residue substitution involves a reversal of electrostatic charge. To demonstrate its usefulness, we have developed a new peptide:MHC class I binding prediction method, using the matrix as a Bayesian prior. We...
- Neil Saunders
Microbiology and molecular biology reviews : MMBR, Vol. 73, No. 4. (December 2009), pp. 775-808. Type IV secretion systems (T4SS) translocate DNA and protein substrates across prokaryotic cell envelopes generally by a mechanism requiring direct contact with a target cell. Three types of T4SS have been described: (i) conjugation systems, operationally defined as machines that translocate DNA substrates intercellularly by a contact-dependent process; (ii) effector translocator systems, functioning to deliver proteins or other macromolecules to eukaryotic target cells; and (iii) DNA release/uptake systems, which translocate DNA to or from the extracellular milieu. Studies of a few paradigmatic systems, notably the conjugation systems of plasmids F, R388, RP4, and pKM101 and the Agrobacterium tumefaciens VirB/VirD4 system, have supplied important insights into the structure, function, and mechanism of action of type IV secretion machines. Information on these systems is updated, with...
- Neil Saunders
Wondering if the most effective way to contribute to science might be to install a bunch of BOINC projects and go home ;-)
HDF is nice because scaling up is relatively straightforward, haven't tried BerkeleyDB for comparison though
- Mike Chelen
@Mike I just read the doc of HDF5. I understand it is a good choice for storing structured data but it it isn't clear for me how it can be used for querying the data. e.g. find a genotype=f(sample, snp)
- Pierre Lindenbaum
BT boss has only broadband home
British Telecom (BT) has admitted its chairman is the only person in a village on the Oxfordshire-Buckinghamshire border with broadband. - http://neilfws.tumblr.com/post...
A new Division, CSIRO Astronomy and Space Science (CASS), has been formed today bringing together CSIRO's radio astronomy capabilities (the Australia Telescope National Facility), NASA Operations (including the Canberra Deep Space Communication Complex), CSIRO Space Sciences and Technology; and the CSIRO Boeing Advisor.
- Neil Saunders
Space station research soaring to new heights
Station residents now spend an average of about 40 crew hours per week on research, up from less than 15 hours before the expansion to a six-person crew in May. - http://neilfws.tumblr.com/post...
Among the highlights are a gruesome account of a 17th Century blood transfusion and the article in which Sir Isaac showed that white light is a mixture of other colours.
The Royal Society puts historic papers online - http://neilfws.tumblr.com/post...