Error level (near 0%) for SwissProt looks interesting. People of protein-protein interaction data claim 2-9% error rate on manually curated sets, and the same level I would expect from SP. Some things are better than I thought ;).
- Pawel Szczesny
Not too bad, but I think its even worse than they do. And evidence codes won't fix this. Manual curation and standards for automatic annotation are in dire need of a revolution, and even if we get that it'll still take years to fix.
- Paul J. Davis
@Paul How would you revolutionise it if all the data is still contained in the relatively unnaccessible journal article?
- Frank
Trust Pawel to see the half full (or 60% full) part of the glass. I agree with Frank though: it is not feasible to go through NR and fix the annotations manually. We just have to accept that NR, TrEMBL and KEGG are (mostly) over-annotating, and remember that when we rely on them when delving into the protein family level.
- Iddo Friedberg
Thanks Iddo. I lost count of how many times annotation errors came up in my discussion with experimentalists who lack experience with such databases. (Not surprisingly, they usually think these errors are negligible, especially when it comes to THEIR proteins.) Now I'll just send them a link to your post...
- Mickey Kosloff
Trackmate is an open source initiative to create an inexpensive, do-it-yourself tangible tracking system, allowing any computer to recognize tagged objects (and their corresponding position, rotation, and color information) when placed on a surface.
- Andrew Perry
Neil, great post. And you're right, we do make things too complicated sometimes, but do we do that at the level at which we ask questions, or at the software implementation level? My take is the latter, cause you need to ask questions the way you want to, but that doesn't mean what makes it all come together has to be one complex mess
- Deepak Singh
Glad you like it. One of those that bubbled up out of frustration at inability to achieve! I feel that science is the business of turning complex (real-world) things into simple models - and that we've moved away from that idea.
- Neil Saunders
I'm a sucker for this kind of ambitious thinking. Go Neil!
- Bill Hooker
I think it's a good sign that things like this are now obvious. Things start out as a complex mess of disconnected things, overlapping complicated ways of connecting them are devised, then it becomes obvious what the simpler thing to do is.
- Mr. Gunn
Great ! But aren't you re-inventing something like RDF Neil ? feature/probe/value is nothing but a RDF statement...
- Pierre Lindenbaum
No, I don't want to reinvent anything. If RDF will work for me, I'll use it. I'll also use SQL, NoSQL, key-value pairs, document-oriented or whatever it takes. I just think that trying to integrate data by combining other peoples large, complex representations is not working. We need to simplify the whole business.
- Neil Saunders
I think there is a middle road here - we need high level generic descriptions like what Neil is proposing (and like my "We have stuff, we do stuff to it, which makes stuff"), but also a way of pointing to more sophisticated information that might be useful in specific contexts. I think we can have the best of both worlds as long as the data representation is separated from the metadata and the organization of each can be described in a machine readable (and agreed!) form
- Cameron Neylon
I'm too old school, leaving comments on blogs... who does that any more. I’m sure you’re aware that you’ve just described a model using *triples*. Which means you could start storing these kinds of simple relationships in a triple store like virtuoso etc. As you say, you don't have to reinvent anything, just simplify the use (conventions) of existing approaches (e.g. RDF). I would like...
more...
- Greg Tyrelle
I like blog comments :-) Yes, my example looks like RDF triples. No, that was not really my intention. Let's ask these questions: (1) what data relationships would make sense to a biologist? (2) what are the commonalities in the data, which a biologist may not have considered at an abstract level? As I wrote in the post, many datasets that look different are really different ways of looking at the same thing.
- Neil Saunders
The joys of data modelling :-) For (1): I'm afraid asking for a definition of some data relationships is building an(other?)) ontology.
- Pierre Lindenbaum
Let's put it another way. What we have, presently, are quite complete, often large and complex, but useful and usable descriptions of individual experiment types. "Integration" essentially means "parse them individually and mash-up the results". That's what makes it difficult. Perhaps we need an "ontology of integration" :-) But let's keep it really, really minimal.
- Neil Saunders
I actually think you will struggle to find data commonalities across bioscience. Even the simple proposal of target, measurement, value could break down in many cases e.g. we tried ages ago to get some intensity data from a bunch of microarray experiments and we gave up because we couldn't get across what we needed. What are you really measuring? Does it mean the same thing to different...
more...
- Cameron Neylon
I think there's a good case for storing, in the first instance, raw values. Figure out how to process them later (that's statistics). Focus on trends (up, down, stayed the same). Focus on well-defined variables that do mean the same to everyone (intensity, in theory = amount of transcript, regardless of the very real difficulties). And I think more experiments fall into...
more...
- Neil Saunders
@Pierre freebase is exactly what I had in mind, however the web client (the best part) is not open. @Neil Store the data first, ask questions later. Nice. One of my hopes for semantic web technology was that is could be come a universal mashup system (RDF+ontologies+triplestores). But you start down that path, and you suddenly realise that the semweb is asking you to get your data...
more...
- Greg Tyrelle
But for me your example of a gel isn't raw data. The raw data is the image. Which might have several targets or assays on it. Up/down stayed the same is only really of interest in particular types of science. And I challenge you to find any well defined variables :-) Intensity to me is a measure of optical density but questions of background, object size, masking, averaging algorithm...
more...
- Cameron Neylon
from twhirl
But agree with what you and Greg are saying, first thing get the data somewhere, with allt the metadata you can automatically collect. Then worry about capturing more metadata as people do stuff with the data. Writing this grant proposal right at the moment.
- Cameron Neylon
from twhirl
And in microarrays, "raw" data is the image of the slide. But aside from a cursory inspection to ensure that it isn't complete rubbish, nobody much cares about that. I'd argue that there's a point in the preprocessing at which a numerical value emerges which could be called "useful" and which encapsulates the object being measured. It needs more work (e.g. normalization) to get information from it, but it's the "value" in feature/reporter/value.
- Neil Saunders
To me this about finding something a bit like an upper ontology that describes the general category that objects (targets, assay, value, inputs, outputs, data, process, sample) fall into. That lets you do the general integration, and the more detailed local data structures become more useful as you can agree more and more on what details are important. So I absolutely agree with what...
more...
- Cameron Neylon
Heh heh It was exactly that image that we did care about - which was the problem :-) I will admit to being an edge case, but in some ways we're all edge cases, they're just different edges...
- Cameron Neylon
Neil, may I link to this FF thread from Book of Trogool?
- D0r0th34
:-) Sure, different questions, different "levels" of data. I guess my angle is more a statistical one: how do I compare (seemingly) quite different datasets - what numbers can I extract and crunch? Less interested in the capture and description of data at every stage in the process.
- Neil Saunders
Sure, and those are very complete descriptions of experimental components. But what I want is: "I saw A on my gel, B in my LC/MS, C on my expression array and D on my SNP array and when I plug all that into some Bayesian predictor, it says cancer" :-)
- Neil Saunders
Ontologies are not the issue, it's more low level than that. I also work with microarrays, proteomics, metabolomics, and numerous physiological data sets. To keep all the data in one place I use a relational database, in this case postgresql because I like to store raw intensity values in array datatypes, along with pylons based web interfaces to display various views of the data to my...
more...
- Greg Tyrelle
My argument would be that the reason you're less productive is not because of the RDF and ontologies per se, but because the ontologies aren't really built for what we want to do. They're for describing certain types of outcomes, not for integrating data in a discovery phase. But Neil's (entity, probe, value) is still an ontology of sorts. It is just a higher level one. My belief is...
more...
- Cameron Neylon
But keep the discussion going - this is exactly the problem that e.g the SAGE project will have - http://sagebase.org - and as a notional member of the data working group I could do with all the ideas and help that's out there...
- Cameron Neylon
We are thinking too much in terms of data representation here. In the end what you are looking at is a data warehousing problem. You have different front end systems and you want to be able to pull data in for offline processing into a warehouse. That's pretty much what you do at any company doing a lot of analytics/business intelligence. Different types of data being collected in...
more...
- Deepak Singh
Neil, I was under the impression that normalization across arrays and labs wasn't actually a solved problem, yet. Surely that would have to come first before stripping things down to just assay-key-value?
- Mr. Gunn
Normalization ... aaargh! Most definitely not a solved problem
- Rajarshi Guha
Normalizing within your own experiments is hard enough, never mind across unrelated datasets. It's something we have to solve though, to make the most of public data.
- Neil Saunders
Neil, you may be intersted in looking at the Ontology-Based eXtensible Data Model (OBX) that was developed by Richard Scheuermann's group at UT Southwestern. It is being used for the ImmPort database (www.immport.org) The OBX model utilizes the BFO / OBI ontology as guides in creating a data model that is robust to new datatypes. You can see a presentation about it here:...
more...
- Burke Squires
Thanks Burke. ImmPort looks very impressive, I must say.
- Neil Saunders
This reminds me of what the TCGA is starting to do, by defining "data levels". For microarray data, Level 1 might be the raw images, Level 2, the intensity calls, Level 3, the normalized intensities, and Level 4 information on whether it's up or down regulated across multiple samples. For people like me, doing integrative analyses, it's easy to focus just on the higher level data and...
more...
- Chris Miller
which is exactly why you need separation of the layers and tools to bring data together for the downstream stuff
- Deepak Singh
from IM
Neil, I think you have just explained why tab-delimited files are often more useful than complex XML representations of the same data ;-)
- Lars Juhl Jensen
Tab-delimitted files would be grrrreat for me in my lab. If any of the rest of you would like to share our data, however, then you're completely screwed. Is the problem not that we're all duplicating each other's work by writing the same kind of parsers for the same kind of data? Proteomics (for example) has a standard (http://www.ebi.ac.uk/pride/). Is it really so hard to use / develop the community-based tools that are being generated around this standard?!?
- Neil Swainston
Well, the ratio of usable tools to schemas/ontologies is a whole other debate :-) But sure, in principle the tools are there - for individual types of data. What I highlight in the post is the difficulty of genuine data integration, as opposed to the current "write a parser for everything and mash it up" approach.
- Neil Saunders
#1 rule of data integration - if a format exists, it will be used
- Deepak Singh
...and if it doesn't exist there is a 70% chance someone will create it :-)
- Cameron Neylon
Chris M makes an important point wrt data levels, analogous to trace archives vs sequence dbs. Extending the sequence analogy, obsoleting levels will become important (it will rapidly become cheaper to resequence rather than store sequence).
- Chris Cotsapas
BioTorrents allows scientists to rapidly share their results, datasets, and software using the popular BitTorrent file sharing technology.
- Neil Saunders
ok, that's fine, but are there any research organizations that still allow p2p over their networks?
- Christina Pikas
still allowed at UC Davis where Morgan L. (who is in my lab) set up Biotorrents
- Jonathan Eisen
Not sure if my workplace has a policy, but it's not blocked - I've used it for Ubuntu ISOs. Might be a different story if I were downloading TV shows.
- Neil Saunders
Is there advantage / complement over institutional repository? Embarrassed to say I've never used bit torrent.
- Steve Koch
@Christina - I'm pretty sure my institution blocks (or forbids) BitTorrent, although I think they may allow certain IP addresses to be 'whitelisted'. AFAIK this method of distribution was considered for large diffraction datasets in the TARDIS project, but for the reasons above, it was not implemented. Also, BitTorrent works best when there are lots of seeders - for some datasets the...
more...
- Andrew Perry
sorry - my comment reads now as really bitchy. beyond the various copyright issues, some denial of service attacks are done by zombie networks controlled through p2p and hackers can exfiltrate information via p2p so I would suspect more institutions will block p2p in the future. maybe with whitelists or only during working hours or something would work.
- Christina Pikas
@Steve bit torrent is often much faster than an institutional repository. Download the client and then grab the latest Ubuntu ISO. You will be surprised.
- Doug Hershberger
from iPhone
Automated Experimentation, Vol. 1, No. 1. (2009), 3. The means we use to record the process of carrying out research remains tied to the concept of a paginated paper notebook despite the advances over the past decade in web based communication and publication tools. The development of these tools offers an opportunity to re-imagine what the laboratory record would look like if it were re-built in a web-native form. In this paper I describe a distributed approach to the laboratory record based which uses the most appropriate tool available to house and publish each specific object created during the research process, whether they be a physical sample, a digital data object, or the record of how one was created from another. I propose that the web-native laboratory record would act as a feed of relationships between these items. This approach can be seen as complementary to, rather than competitive with, integrative approaches that aim to aggregate relevant objects together to describe...
- Neil Saunders
Come to supercomputing. My whole talk is on this subject ;). Trying to figure out if I should start writing about it before or after
- Deepak Singh
I understand the points in the comments about using flat files for better speed. However compared to when I used to use miscellaneous scripts and data files to do my research I find that using a 'Ruby on Rails' type of database-backed approach is much better for me because of the shorter development time and how much easier the code is to maintain.
- Michael Barton
@Deepak I did consider mentioning Hadoop/NoSQL (see last paragraph of earlier draft http://bit.ly/ztjS9) as it's obvious to discuss these types of approaches when dealing with very large datasets. However I think these tools do still require a fair amount of work for maintain and use compared with a more standard kind or MySQL approach. I say that because I tried using map/reduce across the university cluster and had quite a few teething problems.
- Michael Barton
More generally, "any DB + any ORM" is A Good Thing. I can see why people stick with (My)SQL. It's tried and tested. I find a lot of the newer developments interesting, exciting, fun - but often, "too agile" for real work. Libraries change too fast, documentation (if any) goes out of date, code moves to new repositories, in the space of 3 weeks.
- Neil Saunders
@Neil I originally tried using DataMapper instead of ActiveRecord but so many Rails centric libraries assume ORM == ActiveRecord. This meant using DataMapper precluded the use of the factory_girl and shoulda libraries which I have come to find very useful. I think Rails needs to be is truly ORM agnositic and that the current changes in Rails 3.0 doesn't go far enough to address this.
- Michael Barton
I agree. I really like DataMapper (and other ORMs - sequel, mongomapper), but using them with Rails components = ugly, not fully-functional hacks, as things stand. Be interesting to see how the new ActiveSupport looks. I'm even considering abandoning Rails for now and just plugging together components myself as required (e.g. ramaze/sinatra if web frontend required).
- Neil Saunders
Michael, I am not talking just about Hadoop/NoSQL, but the fundamental challenges of operating at high scale. How you handle disk failures, node failures, approaches to managing that data, etc. The rules change once you are working in the multi TB range (and when I talk Big Data I am mean several TB's).
- Deepak Singh
This looks like a great talk! Some the slides are the black box of death eg. "rRNA Tree of Life" and (unsurprisingly) the QT stuff. At least on my setup -- have only tried it on a MacBook w/ firefox3 & safari4. I'm especially curious about the empty "Gene family number" vs "Genome Number" plots is this intentional? I imagine my Pfam office mates will be very curious about these. ;-)
- Paul Gardner
Amazingly - I would have thought slideshare could handle apple/mac PPT formatting - but NOOOOOOOOOOO - I am reuploading a PDF. As for the gene family # vs genome # - watch this talk at scivee http://www.scivee.tv/node... - it will make more sense there than in the slides
- Jonathan Eisen
Great thanks. You're a star! Will watch tonight after the Missus has gone to bed.
- Paul Gardner
By "in lab" sequencing projects, do you also mean labs doing the assembly/annotation in house using reads sequenced elswhere?
- Rob Syme
Sure - if that is fee for sequencing at commercial entity as well as university sequencing cores (or just illumina machine in someone's lab I guess). Really just a gut feeling estimate. Something it would be interesting to quantify as time goes on. We need both streams of data I think - my concern is the place that will check quality and aggregation of that data when it comes from so many sources becomes GenBank/EMBL?
- Jason Stajich
An experiment in screencasting an entire software project from start to launch. "This video series will document the development of a moderately complex program from start to finish. Everything from planning, creating class diagrams, going over the code, and creating a .deb file for Ubuntu/Debian distribution will be covered. Key libraries used will be wxPython, OpenGL, and sqlite. Most software projects fail. Will this one fail too?! Watch this series as it is published to find out!"
- Chris Lasher