Error level (near 0%) for SwissProt looks interesting. People of protein-protein interaction data claim 2-9% error rate on manually curated sets, and the same level I would expect from SP. Some things are better than I thought ;).
- Pawel Szczesny
Not too bad, but I think its even worse than they do. And evidence codes won't fix this. Manual curation and standards for automatic annotation are in dire need of a revolution, and even if we get that it'll still take years to fix.
- Paul J. Davis
@Paul How would you revolutionise it if all the data is still contained in the relatively unnaccessible journal article?
- Frank
Trust Pawel to see the half full (or 60% full) part of the glass. I agree with Frank though: it is not feasible to go through NR and fix the annotations manually. We just have to accept that NR, TrEMBL and KEGG are (mostly) over-annotating, and remember that when we rely on them when delving into the protein family level.
- Iddo Friedberg
Thanks Iddo. I lost count of how many times annotation errors came up in my discussion with experimentalists who lack experience with such databases. (Not surprisingly, they usually think these errors are negligible, especially when it comes to THEIR proteins.) Now I'll just send them a link to your post...
- Mickey Kosloff
I appreciate the irony: a review of a Creative Commons licensed book, with a strong open data slant, behind a paywall. It's Nature's business model, and that's their choice, but it's a bit of a pity. I'll self-archive it when the time comes around.
- Michael Nielsen
I was surprised about the footnote - is it really 2011? How many "notes added in proof" will this mean?
- Daniel Mietchen
Daniel - The book doesn't focus on ephemera, insofar as it's possible to avoid, for the reason you mention. A few of the examples will date, but I expect the general argument to still be essentially correct decades from now. It's a book that's meant to help change hearts and minds through a powerful general argument, not TechCrunch for scientists, if that makes sense.
- Michael Nielsen
Nice review, looking forward to your book Michael
- Frank
The review now appears to be out from behind the paywall. It wasn't when I posted last night. I don't know why the change has happened.
- Michael Nielsen
When did you drop the working title "The Future of Science"? This is the first time I've seen "Reinventing Discovery" and a publication timeline...
- Bill Hooker
I changed the title a couple of months back - I like this one more, although it's still a working title. The timeline is tentative.
- Michael Nielsen
"Human geneticists have reached a private crisis of conscience, and it will become public knowledge in 2010. The crisis has depressing health implications and alarming political ones. In a nutshell: the new genetics will reveal much less than hoped about how to cure disease, and much more than feared about human evolution and inequality, including genetic differences between classes, ethnicities and races."
- Itachi
from Bookmarklet
See rebuttal by Luke Jostins at The Sanger Institute "The Economist has a rather distressingly bad article by the evolutionary psychologist Geoffrey Miller, about the supposed general failure in human disease genetics over the last 5 years...." http://www.genetic-inference.co.uk/blog...
- Duncan Hull
(via pedrobeltrao) A proposed author ID system is gaining widespread support, and could help lay the foundation for an academic-reward system less heavily tied to publications and citations.
- Pierre Lindenbaum
Great start! Pullquote: "In his classic book Management Teams, UK psychologist Meredith Belbin used extensive empirical evidence to argue that effective teams require members who can cover nine key roles. These roles range from the creative 'plants' who generate novel ideas, to the disciplined 'implementers' who turn plans into action and the big-picture 'coordinators' who keep everyone...
more...
- Björn Brembs
The dude's gettin' it: "This kind of 'microattribution' could ultimately make it possible for each researcher to have a constantly updated 'digital curriculum vitae' providing a picture of his or her contributions to science going far beyond the simple publication list."
- Björn Brembs
Fantastic article! Finally: "But perhaps the largest challenge will be cultural. Whether ORCID or some other author ID system becomes the accepted standard, the new metrics made possible will need to be taken seriously by everyone involved in the academic-reward system — funding agencies, university administrations, and promotion and tenure committees. Every role in science should be recognized and rewarded, not just those that produce high-profile publications."
- Björn Brembs
"“Pdfs paralyse the proper and efficient use of scientific knowledge. By burying information in static, unconnected journal articles, scientists waste countless hours either repeating experiments that they didn't know had been performed before, or worse, trying to verify facts that they didn't know had been shown to be false.”"
- Mr. Gunn
from Bookmarklet
this is definitely worth a play. it's got some neat features, it's backed with great compsci and, from what i saw on launch night, there will be more good things to happen in future updates
- Matthew Llewellin
OT - For the record, I started off as a New Romantic, drifted into Old Schematics and ended up being a New Semanticism-ist.
- Graham Steel
"A proposed author ID system is gaining widespread support, and could help lay the foundation for an academic-reward system less heavily tied to publications and citations."
- Pedro Beltrao
from Bookmarklet
This looks closer to reality than ever. 23 organizations supporting the idea with plans to have working code within 6 months based on Thomson Reuters' ResearcherID.
- Pedro Beltrao
Had the pleasure of visiting the exhibit a couple of months ago. Highly recommend the Wellcome Collection, if you find yourself in Euston with some time to kill.
- Neil Saunders
Thanks, Neil, that sounds like a great idea!
- Cesar Sanchez
"Don't rely on your host or anyone else to back up your important data. Do it yourself. If you aren't personally responsible for your own backups, they are effectively not happening."
- Duncan Hull
OA mandate mania - "Dramatic Growth of Open Access: December 11, 2009 early year-end edition" http://poeticeconomics.blogspot.com/2009... "2009: a great year for OA!" w00t !!
"One thing to look forward to in 2010 is a very strong probably that the world's largest scholarly journal will be an open access journal, the award-winning PLoS One. OA mandate mania is sure to continue ~ not only are discussions underway at many a university and department, but now that we have a substantial and growing set of role models the benefits of an OA mandate ~ such as increased impact and web presence ~ will become that much more obvious, inspiring yet more mandates".
- Graham Steel
RT dgmacarthur Wellcome Trust Sanger Institute employees are eugenicists who will burn in hell: http://thatthebonesyouhavecrus... On the 'Net, so true.
BMC Systems Biology, Vol. 3, No. 1. (2009), 109. BACKGROUND:DAS is a widely adopted protocol for providing syntactic interoperability among biological databases. The popularity of DAS is due to a simplified and elegant mechanism for data exchange that consists of sources exposing their RESTful interfaces for data access. As a growing number of DAS services are available for molecular biology resources, there is an incentive to explore this protocol in order to advance data discovery and integration among these resources.RESULTS:We developed DASMiner, a Matlab toolkit for querying DAS data sources that enables creation of integrated biological models using the information available in DAS-compliant repositories. DASMiner is composed by a browser application and an API that work together to facilitate gathering of data from different DAS sources, which can be used for creating enriched datasets from multiple sources.The browser is used to formulate queries and navigate data contained in...
- Daniel Swan
Nice article. Not sure about "Exaflood"?! Add that to "the metaphors of doom used to describe the phenomenal pace of data acquisition (from data floods, deluges, surging oceans and tsunamis, to icebergs, avalanches, earthquakes and explosions)"... http://pubmed.gov/19929850
- Duncan Hull
Now I have scenes from a disaster movie in my head, as scientists race to save the world from the data deluge...
- Neil Saunders
FF doesn't meet all your requirements but it does seem to work well compared to the specialized services - at least in some fields
- Jean-Claude Bradley
Well I guess that's not surprising given my biases - at some level I'm more interested in what people think I've missed than my own predjudices though. FWIW I think a clever combination of DropBox, FriendFeed and some of the elements from StackOverflow, with perhaps a bit of the coordination ability of posterous would go very close to the mark. Still need better network and filter management tools though - somehow they need more configurability but less configuration...
- Cameron Neylon
OpenWetWare is looking to make a major overhaul in the next couple months, and has a bit over 1 year of funding left. I feel like this is an opportunity to at least try to do some of the things that most people think are necessary for SS4S. Not perfect, but better so that we'd have a better idea of what is really needed. I think the time frame (now; already funded) makes "not perfect" a...
more...
- Steve Koch
I really like what you said in point 10. It's something that I've seen far too many scientists being cavalier about. Federation, open protocols and specifications, along with open source, are very important to science.
- Christopher Granade
Might be worth seeing how far sourceforge meets your criteria. Certainly it's totally based around objects, i.e. software projects, and there are lots of high quality open source science projects whose code is hosted there. Although it has community/social networking tools I've personally never really used these and most visits I've had to sf have either been fleeting (to download...
more...
- Dan Hagon
Steve, absolutely we need to keep evolving with the resources available. OWW is a great place to do that.
- Cameron Neylon
Dan, there was a conversation around using Github in a similar way some months ago and I think these things have a lot of potential as a back end. I think federation is important enough that you'd want to use a DVCS rather than SVN as a back end though.
- Cameron Neylon
Sourceforge has several DVCS options in addition to svn these days. Although github is great I would be wary of anything that requires scientists to learn the intricacies of git. hg and bzr are much more friendly to non-developer types that don't need the full flexibility of git. I've had some success using them to collaboratively author LaTeX documents.
- Matt Leifer
Matt, ok, I'm behind the times (nothing new there!). The intracies are less of an issue as this would only be a back end. No SS4S that any significant proportion of scientists use is going to look _anything_ like a code repository. To start with your average scientist is never going to touch a command line. If you're dealing in Latex you're already talking about a minority I'm afraid....
more...
- Cameron Neylon
There are several wikis that use DVCS as a backend. This could be a starting point for developing the type of thing you are interested in.
- Matt Leifer
LaTeX isn't the minority in whole areas of math, CS, physics....I guess that brings up the same old complaint: "science" is defined as all biomed, all the time. I'll try to come up with some more substantive comments though
- Christina Pikas
Christina, didn't mean to say it should be excluded just that a non-command line system is non-negotiable so most online VCS aren't going to be good enough as a front end. Support for Word, Excel, video, images, XML and Latex are all non-negotiable characteristics of any such system.
- Cameron Neylon
Matt, not sure that a wiki is the right starting point - the document model doesn't seem right to me, although I'm way behind on the most recent developments in Wikis so I may be out of date on that as well. What is in my head is a DVCS back end with APIs providing access from e.g document authoring systems, databases, publishers, whatever. A feed system that looks a bit like friendfeed...
more...
- Cameron Neylon
I wasn't suggesting actually using one of the wikis, just that they have already done a reasonable job of abstracting the version control functionality (in fact, some of them support more than on DVCS in this way) so there may be some things in the codebase that are useful. It is also an example of taking a command-line DVCS and giving it a more user friendly interface. In addition, if...
more...
- Matt Leifer
Ah good to know - which do you think are the best examples of these wikis? I should take a look. In any case at this stage I'm just throwing ideas out. Have no resource to actually a build anything at moment.
- Cameron Neylon
Is there actually a need for social software for scientists? Or should scientists use and customize the existing social networking tools (FriendFeed, Facebook, LinkedIn, etc.)?
- Martin Fenner
I'm beginning to think the main issue will be that business models for consumers services are incompatible with what researchers need. So yes, customise might be better than build but if we have to go down that route we may as well have a good idea of whats required. One person's customisation is another person's build.
- Cameron Neylon
I'd be curious what you think of HubZero, Cameron.
- D0r0th34
Depends a bit on server setup. For Mercurial I like Hatta, but it requires persistent python processes, i.e. no good for most shared hosts that only allow CGI. There is a list of RCS backed wikis here: http://hatta.sheep.art.pl/Similar projects
- Matt Leifer
Cameron, I love and absolutely agree with the necessity of "scientific objects". If you lack those, then (as Martin points out) just use the general purpose sites. In that principle, I think there are some viable networks -- DVCS systems around scientific code, Mendeley around scientific publications, (eventually our BioGPS around genes). But I think we should be developing specific networks appealing to specific groups of researchers, rather than trying to serve the needs of all scientists...
- Andrew Su
Andrew agreed, but if these are federated then they can all still talk to each other. I'm thinking more framework than site or single service. Ideally all of these things can be plugged in or wired up together...my concern with general purpose sites is primarily that they don't provide the level of trust and stability that we would expect for "research enterprise"
- Cameron Neylon
Just one comment. There are protocols out there that allow different social networks to talk to each other. There are protocols out there that allow web resources to talk to each other. It's not really that hard if everyone supports some basic standards. RESTful API's, OAuth, OpeniD/Facebook Connect/Friend Connect, etc. IMO what's more important is that any sites we design have the...
more...
- Deepak Singh
@D only really had a chance to have a quick look. First impressions are that it is very slick but looks as though everything has to be on the inside - I don't see much mention of pulling stuff in and out. The multimedia talks are nice but why not pull them in from e.g. slideshare to pick an example.
- Cameron Neylon
completely agreed, federation through standards...
- Andrew Su
Twitter is far from perfect, but look at the infrastructure that has evolved around it e.g. 3rd party apps, services). You don't get that kind of traction around a social networking site just for scientists. Imagine what email or the WWW would look like if there were separate versions just for scientists.
- Martin Fenner
from iPhone
Absolutely but that actually means we can build something better, and as long as it hooks into Twitter (RSS/OAuth...Deepak's list basically) we get all the benefits and all of the functionality we want - as well as a way of drawing people in. Assuming this framework is any good of course. Imagine PubMed if it had been built for the consumer web (actually maybe not such a good example...
more...
- Cameron Neylon
Sort of responding to Deepak a few comments earlier. Something like a social network is useful for at least one reason: recruiting scientists who aren't ready for open science, or cannot communicate openly for one reason or another. So, a reasonably secure way of making data private and shared with a limited network is a good thing, I think. I think ultimately that will lead to much more open science (my own lab started out with a private wiki before doing ONS)...
- Steve Koch
Steve, but does it have to be a social network per se, or a site for say sequencing geeks (I am looking at you SeqAnswers) with the appropriate features built in. Social networks don't have to be all in the open. Facebook is a social network. 90% of my communication on there is private and you should see how much of my Twitter usage is DM's
- Deepak Singh
Deepak, I think I was just using terminology incorrectly. I was assuming Facebook = social networking.
- Steve Koch
I think a bigger reason is they don't want people seeing errors - every experiment has some kind of "error" - especially an unmeasured quantity that turns out to be important in the end
- Jean-Claude Bradley
As I said on twitter, really think this is just an instance of the control imperative. Scientists are for the most part very uncomfortable with relinquishing direct control over anything that affects them - they want to make every decision as far as possible and the idea that for a framework to scale you have to give up control over thing seems very alien. But I too would like to see some data and analysis on the issue - a fair bit of high profile data misinterpretation going on at the moment.
- Cameron Neylon
scientists concerns can help be addressed by advancing methods and best practices, and by describing case studies to provide confidence. accuracy is an inherent purpose of knowledge sharing, it is a testament to the high level of skill that it is usually accomplished
- Mike Chelen
Well, the imperative to 'show your work' can make people very uncomfortable. I see this all the time in the Open Source world, where developers sit on their code polishing it instead of releasing.
- Michael R. Bernstein
Mike, I'm with Michael on this one - in principle what you say is true, in practice it involves a huge change in attitude, psychology, and culture. You know that slightly nauseous feeling you get when pressing the button to submit a paper or grant? It's the loss of control - the worry that someone will find something you haven't - or tell someone else you're an idiot that leads to that....
more...
- Cameron Neylon
I'm with Cam, Jean-Claude and Michael on this -- scientists are terrified of being "found out" in an error, or of having someone see something they missed. This is a fundamental issue: ideally, what we want is a culture that values willingness to take those risks as much as (or maybe more than) it values demonstrations of brilliance. It's the emphasis on the latter (science as a career is basically a lifetime of "look how smart I am") that leads to the fears that block cooperation.
- Bill Hooker
Sarah, great question. I haven't found any research quantifying how often misinterpretation (or scooping, or error-finding) actually happens when scientist share their data, but it would be fascinating and important.
- Heather Piwowar
Can anyone post more details, links, information? Actually, I find it quite amusing that some people might find 'the (scientific) interpretation' more important than 'the (raw) data'. Should we start listing the examples in which people have published wrong interpretations, and how long it took for others to reproduce the data and come-up with the 'right' conclusion? For me it looks more like keeping people busy with reproducing data, even if not needed? What a waste in such resource limited times !
- joergkurtwegner
In astronomy data are re-used all the time - scientists often re-use archival data, although there is a certain etiquette to doing this. It's led to new discoveries years after the data were taken, e.g. reprocessing of Hubble images from 1998 led to the retroactive discovery of an exoplanet, see this paper by Lafreniere et al '09 on arxiv http://arxiv.org/abs/0902.3247. Many publicly...
more...
- Sarah Kendrew
It's unfortunate that you can't delete an uploaded file when editing. OK, this is the version with correct counts for each date - as opposed to the previous, which used the count for the first occurrence of the (multiple, same) date in the file (oops!)
- Neil Saunders
hmm, the nice correlations dissapeared :(
- Rajarshi Guha
Alas. Beautiful hypotheses slain by ugly facts :-)
- Neil Saunders
minor quibble here. Is it just me, or do others think that three-color heat maps are overused? It makes sense when showing bi-directional patterns (e.g., up- and down-regulation). But here, the middle color doesn't really have any meaning, does it? </quibble>
- Andrew Su
It's a work in progress :-) It's unclear to me which white is "middle" and which white is "no data". But I guess I'd argue that middle = neither high nor low.
- Neil Saunders
Is there any way to also track how many people have been subscribed? On an aggregate level, it doesn't look like the community is dying or growing, but if it were comments or likes per user -- would that be a better metric?
- Benjamin Tseng
Per user stats would be interesting. I'm not sure if the API returns "date user joined" - will look into that.
- Neil Saunders
what do you think of this? "By contrast, early bioinformatics work was almost invariably founded on biological concepts from the onset. A biological issue was raised and then a technique to address that issue was presented. That is, theoretical biology was the foundation on which [early] bioinformatics was built. I fear this is being lost in the mass-data and technology-hype driven bioinformatics. It seems to me that unless companies and research groups are careful many will waste time and money “stamp collecting and cataloging”. Certainly the organized data is useful, but only if it is applied with biological principles."
- Attila Csordas
from Bookmarklet
I like that post a lot. I don't know that data organisation is a waste of time/money but I agree that it isn't biological science.
- Neil Saunders
I think one downside of the 'data deluge', is that it's easy to get divorced from (biological) reality and treat the data as numbers. But of course, this can be said of any field where there is large amount of data.
- Rajarshi Guha
It could also be considered a challenge. How do you get meaning from all that data
- Deepak Singh
But that's the point - just playing with the numbers doesn't necessarily get you to meaning (though it might). The biological context is vital - in absence of that, what is the value of any conclusions?
- Rajarshi Guha
Rajarshi, not disagreeing at all. But to be able to reach those conclusions you need to learn how to manage that data (in a meaning biological context), which in itself is a challenge
- Deepak Singh
from IM
Then there's the "Chris Anderson" school, which says that we just store the numbers, release the robots and science will emerge :-)
- Neil Saunders
I wonder if the robots know what they're looking at?
- Deepak Singh
So was Carl Linnaeus a stamp collector doing data-driven science or a "real" biologist?
- Duncan Hull
Duncan, both :), but you can think of his stamp collection as a way of managing his data. Those days you could wear both hats
- Deepak Singh
I would be happier if this post targeted a certain, perhaps quite large, class of bioinformaticians, rather than bioinformatics. My interest in becoming a computational biologist has always been driven by my fascination with theoretical biology. My leisure reading is dominated by popular expositions of biology. I read textbooks such as _Evolution_, _Ecology_, and _Brock Biology of...
more...
- Ruchira S. Datta
A number of comments in this thread do not reflect what was written in my article. I would encourage people to ask me at my blog site if you have concerns or are confused. I did not write "data organisation is a waste of time/money" and I *certainly* did not "tar all bioinformaticians". (I do not "tar" people; implying that I did offends me.) Ruchira: I would encourage you to re-read the article, you have it the wrong way around on several points. Best wishes to everyone!
- Grant Jacobs
Ruchira, I accept your apology, but I'm not so much "offended" as concerned at being misrepresented. This article has been fairly widely read and, bad practice that it is, many people on the internet unfortunately judge an article through commentary on it, esp. if the article is long. I've added a comment to the article on the blog to try draw attention to some points that may confuse readers, esp. as some are reading this in the present day, not from the time it was written.
- Grant Jacobs
Grant: it's really good to see you here, every time I cite somebody living here I have the hidden motive to draw the author into this community, get her/him involved, make him/her part of a 'community of equals'. The paragraph cited is itself a good source to initiate a standalone conversation here w/o reflecting to everything in your article. That sad, I encourage people to continue this conversation here. :)
- Attila Csordas
A few weeks ago I had the chance to present the PLoS Article Level Metrics program to audiences at both Berkeley and UCSF (via a simulcast). The organisers allowed me to devote a full hour to our program, and as a result this is the ...
Absolutely: production + consumption, equally important, equally challenging. For me, the practical challenge is how best to do both. I know from experience that working alone on both aspects doesn't work too well sometimes - inevitably you gravitate to one or the other. Somehow we need to build teams, where all contributions are valued and data flows smoothly from production through storage to consumption.
- Neil Saunders
At the rate of production, the variety of data types, and all the other crap you have to deal with, IMO it's just not possible, even if you have the skills and knowledge. Has to be a team effort and yes, you have to value the "operational aspect"
- Deepak Singh
Agreed with Jason. I like RDF and have played with it out of interest, but not run into lots of available data in my day to day work.
- Brad Chapman
Fair enough-- I've been sitting on gigabytes of RDF for awhile, and using it in my commercial business (just because it's the early days); I'd be happy to post a few (possibly) interesting bits to you guys to try out. My only request is that we openly work together on it to do some demos possibly. Cheer!
- Eric Neumann
To Eric and everyone who has "liked" this, please explain what this means. I understand the concept of RDF. I do not understand what you want Bio* projects to do with it.
- Chris Lasher
Chris. If the public dara were formatted/normalized using RDF, most parsers, SQL queries used to cross the data between two or more datavases would be useless. bio2rdf is a good example: http://bio2rdf.org
- Pierre Lindenbaum
Pierre, I interpret your comment as meaning "using RDF, there would be no need to develop nor maintain data and query parsers", which I pretty much agree with. I would hope that would free most informaticists to spend more time writing analytics and views, and less on bit swapping- IMHO...
- Eric Neumann
@Eric , yes that's what I tried to say with my poor English skills ;-)
- Pierre Lindenbaum
I "like" most mentions of RDF + bioinformatics, it tends to get lively :-) I interpret the post to mean "there is more out there than we realise". If so, as the first 2 comments said, let's see it! I hear a lot of talk, I don't see much in the way of usable data or tools.
- Neil Saunders
++ to jason, brad and neil's comments. bio2rdf is a good start, but why do we even need RDF? BioMart is another warehouse that seems to work nicely on good old SQL. I think when people see a tool of comparable utility that depends on RDF they will take notice.
- Chris Mungall
Chris- I actually wasn't referring to bio2rdf- as you said it's a nice start, but not key for most imformaticists; alternatively, I have converted GEO, mutdb, and pathway data into rdf. Was hoping you guys would find this useful to try some investigate?
- Eric Neumann
from iPhone
I'd be interested in checking it out - but I've already paid the RDF tax. What the average bioinformatics hacker would benefit from would be a paper showing off some of this stuff with some clear non-waffly reasons why rdf works better (or not better) than sql, json, plain xml, couchdb, etc.
- Chris Mungall
A demo for RDF I'd like to see: extracting a list of genes from a RDF datastore for a given organism in a given region (chr:start-end). demo n°2: thoses genes must be the descendants of a GO term.
- Pierre Lindenbaum
"RDF tax" is a funny expression; do you mean RDF-XML or RDF in general? Most of my projects don't bother with RDF-XML; they're either RDF-N3 or RDF-JSON... remember, RDF is not a syntax, it's a data source and type binding semantic! As I said initially, I think most folks have missed what RDF is really capable of (and where it saves time). I will post my data shortly...
- Eric Neumann
I think we miss what it's capable of because we don't see demonstrations. My impression with many data formats, ontologies etc. is that once designed, their creators say "OK, we've made you a nice format, you all have XML/JSON/whatever parsers, get to work!" So here's my plea: when you create the format, create a small tool that demonstrates its worth, make it public and advertise it.
- Neil Saunders
Should direct this here in this thread also: How many of you know about/use MIT's Exhibit?
- Eric Neumann
Now, that's what I'm talking about. I didn't know about/use it before - but now I do/will (or might). Thanks!
- Neil Saunders
I'd be more than happy to share all my examples and code extensions of it to all of you... some simple examples under "demo" at http://eneumann.org/
- Eric Neumann
btw, just look at the source HTML of these examples, and that's all there is to creating your own facet RDF viewing app ("steal this code")
- Eric Neumann
OK, spent a little time playing with Exhibit and I am now officially really impressed. Giving me lots of ideas. But I'm wondering how large a JSON file can be? Hundreds, thousands, millions of items?
- Neil Saunders