Sign in or Join FriendFeed
FriendFeed is the easiest way to share online. Learn more »

Paul J. Davis › Comments

Iddo Friedberg
Gene and protein annotation: it’s worse than you thought #bioinformatics - http://bytesizebio.net/index...
I was thinking "nah, it's probably about as bad as I thought" - but you're right! - Neil Saunders
Error level (near 0%) for SwissProt looks interesting. People of protein-protein interaction data claim 2-9% error rate on manually curated sets, and the same level I would expect from SP. Some things are better than I thought ;). - Pawel Szczesny
Not too bad, but I think its even worse than they do. And evidence codes won't fix this. Manual curation and standards for automatic annotation are in dire need of a revolution, and even if we get that it'll still take years to fix. - Paul J. Davis
@Paul How would you revolutionise it if all the data is still contained in the relatively unnaccessible journal article? - Frank
Trust Pawel to see the half full (or 60% full) part of the glass. I agree with Frank though: it is not feasible to go through NR and fix the annotations manually. We just have to accept that NR, TrEMBL and KEGG are (mostly) over-annotating, and remember that when we rely on them when delving into the protein family level. - Iddo Friedberg
Deepak Singh
When “good enough” just doesn’t cut it - http://mndoci.com/2009...
Is this a symptom of writing software for publication and then moving on? - Michael Barton
It's a symptom surely of what the measured endpoint is - getting the data out for the paper - not producing something that has real utility. It's the good enough _for what_ bit that is the problem here surely? - Cameron Neylon
It's all of those things. Worth remembering that in many cases, academic researchers who write software are not professional software developers. I think in the past and to some extent now, many people would answer "yes, good enough is just fine". I'm encouraged to see a new generation of computational biologists who clearly have been trained in software development and care about things like re-usability, reproducibility, testing, version control, distribution and so on. It is getting better. - Neil Saunders
That's fair - constant suprise to me that I seem to know more about software development best practice than the academic researchers I talk to. I blame Greg Wilson of course...need to get Software Carpentry or similar course made compulsory for all science undergraduates :-) - Cameron Neylon
And to those who don't get why this is important: we should spell out the cost (both financial and in time) to a research project, every time a new person starts on a project and has to clean up the mess of files and code left by their predecessors. I've seen this time and time again. - Neil Saunders
Or write it into the grant conditions. That spells out it out pretty clearly... - Cameron Neylon
Neil, I agree with you. I think that's going to change, as more and more software developers enter the life sciences, folks who care about maintenance, quality, etc. But the PIs are still a problem. Of course, this is not just academic research though. I've seen it in companies and perhaps that's the difference between between someone who stays middle of the road and someone (someone could be an entity) who excels - Deepak Singh
couple of off topic things: I think your RSS is not working, or you might have changed it. It doesn't show up on my GReader. Also your Fork Me link to GitHub is not pointing to your account. - Paulo Nuin
Paulo, the feed seems to be OK at this end. and yep, do need to fix that Fork Me link. Thanks - Deepak Singh
Survival of the fittest will show how good things are. In the (free) open source world quality/time_to_invest will show, and for commercial world the quality/price will do the same. - joergkurtwegner
Joerg, I think that's beginning to happen, especially with open source alternatives pushing purchasing behavior. Plus expectations have changed. No one is going to use an internal search engine with a several millisecond response time, when you are used to Google - Deepak Singh
Perhaps I'm the pessimist, but if all scientific software were merely 'good enough' I'd be in heaven. Good enough would at least imply that it compiles/runs/etc. - Paul J. Davis
I think "good enough" in software is favored when an individual needs to get something done and faces limitations in terms of time or financial resources in accomplishing the task. Within those constraints, "good enough" is the best way of making progress rather than waiting 'til someone writes the best possible code. It shouldn't remain that way, but if its cutting edge research, a clear market demand may not have been established as an incentive for some one to create a particular piece of software. - Jill O'Neill
Jill, I've seen enough evidence where that's not the case. PI's tend to lose interest when they have papers published, or if a grad student or postdoc leaves. In the case of commercial entities, it's a cultural thing. Constraints can lead to phenomenal code. - Deepak Singh
Isn't it similar to the evolutionary selection, with academics having a set of "pressures" different from those needed to develop #1-type software? Once the paper gets published, there is no pressure for researcher to improve the code, and things remain "good enough". While in a commercial setting there is always strong pressure from the side of the customer/competition, which drives the development further. I.e. to solve the problem one needs to bring some kind of pressure element to the academic setting. - Yaroslav Nikolaev
Deepak Singh
Zachary Voase’s Blog — Bioinformatics and the Semantic Web - http://blog.zacharyvoase.com/post...
Also check out his other post: http://blog.zacharyvoase.com/post... Awesome awesome stuff. Also, I'm more than a little relieved to find out I'm not crazy. Or at least that I have company in crazy town depending on the interpretation. - Paul J. Davis
Nice post, but I don't think it would sell me on using RDF. It's not like you can't agree on syntax and meaning (or even use ontologies) without RDF. Better to show how RDF helps deal with the fact that not all databases are ever going to agree on a single set of non-overlapping concepts for describing their data, and -- more important -- with more fundamental disagreements (such as what an organism even is). - Eric Jain
Jan Aerts
If anyone has a #google #wave invite to spare: think of me... Thanks.
i'll also join the line, if somebody has an invite to spare :) - Endre Sebestyen
Perhaps we need to make a combined list? :-) - Cameron Neylon from twhirl
I know @dgmacarthur hopes to get lucky as well. - Jan Aerts from Android
Drat. Me too! Nominate me! And I'd nominate you too if I get one XD - RK
Setting the barrier low: firstname.lastname google email address :-) - Jan Aerts from Android
Me too! - Benjamin Tseng
I'd actually do something with it if I had an account :P I guess I do have a thesis defense coming up in 2 months though... - Brian Krueger - LabSpaces
I want one too!! - Alejandro Montenegro
maybe we should collect the existing science related wave plugins/robots or how do you call it, and some new ideas. and start to develop them after receiving our invites :) - Endre Sebestyen
Just a reminder there is a Google Group for aggregating ideas and code for Wave in Research: http://groups.google.com/group... and a Doodle Poll for a date for a UK meetup: http://www.doodle.com/partici... - Cameron Neylon
@Jan... I get a Mail Delivery reply on firstname.lastname@gmail.com ... :) - Egon Willighagen
@egon I was afraid the address would already be taken :-) - Jan Aerts
Anybody here got an invite already? Still no invite for me :( - RK
Nope... Website says "Google account not yet activated for Google Wave" - Jan Aerts
As of 8am GMT not seeing any further information in my inbox - will update as soon as I know anything - Cameron Neylon
Not spotted anything here either... - Egon Willighagen
Looks like we'll just have to be patient... It'll come when it'll come. - Jan Aerts
How do I invite? - Björn Brembs
Just read somewhere that invites will be available at 4pm BST. Didn't check sources, though. - Jan Aerts
100,000 invites sounds like a lot but worldwide soon thins out :( - Anthony Underwood
@Jan if you get the invite and you happen to have some invites as well, consider me please - george
Word on the street is that only existing wave users will get invites. The new invitees won't get any. - Chris Miller
Chris: But they did post that about the three groups of people that'll receive invites. *sigh* - RK
Me, too! I'll offer a drawing in return! - Kamilah Gill
Cameron: Where did you read that? If it's true then there's hope after all :D - RK
Various things going around on Twitter and in wave suggesting that the release will be at 9am PST which would be 4pm UK time I guess but I would give that the status of rumour rather than fact. Haven't seen any convincing info that anyone has a new invite yet. - Cameron Neylon from twhirl
Cameron: Yea. Saw it on Brizzly too. Anybody got an invite? Should be out now :) - RK
FWIW - sounds like invites won't be out until this evening, so that the Sydney-based googlers will be awake to troubleshoot. http://twitter.com/twephan... - Chris Miller
Yep - all information I have suggests tomorrow morning Sydney time. Even if we assume they are in at 6am thats still some hours away yet. I'm off to bed me-self. - Cameron Neylon from twhirl
Wave invites starting to roll out over the next few hours: http://twitter.com/twephan... - Chris Miller
The "request invitation" link has reappeared at wave.google.com. I wonder if I should fill it in again? - Neil Saunders
yes please - Thomas Power
I'm told that once you even get an invite it can take a couple days for the activation email to arrive. I wonder what email would be like now if they had required invites back in the day... - Paul J. Davis
So you people have invites now? - RK
I don't have one yet. - Carl Fulp
Got 4 more wave invites, pls send me your email. - Khader Shameer
Thats all my invites just got over. - Khader Shameer
No invites yet either.:( - RK
RK, can I have your email ID please ? - Khader Shameer
I had 12 invites on Friday: http://friendfeed.com/the-lif... - Björn Brembs
Paul J. Davis
Mozilla's Raindrop has been released http://labs.mozilla.com/raindro... - "Raindrop's mission: make it enjoyable to participate in conversations from people you care about, whether the conversations are in email, on twitter, a friend's blog or as part of a social networking site."
Zing! "We aren’t trying to invent new protocols or build new messaging systems, rather focusing on building a product that lets users get a handle on the systems we already use." - Paul J. Davis
Interesting... but the download page is blank? - Colby
Ah this message just showed up "Download Raindrop There is no official download yet. The Raindrop code is still under development but you can follow along via the code repository. Please see the Hacking page." - Colby
$ hg clone http://hg.mozilla.org/labs... or use one of the links at the top of http://hg.mozilla.org/labs... to get a tarball or zip. If you have issues installing CouchDB ping us in #couchdb on irc.freenode.net - Paul J. Davis
Sounds like a better friendfeed to me! - Mr. Gunn
Michael Habib
Fwd: Facebook for scientists gets millions in funding - http://www.sfgate.com/cgi-bin... (via http://friendfeed.com/habib...) Congratulations to Cornell/Florida/Vivo on their NCRR grant: "The University of Florida, Cornell University and a handful of other schools have been awarded $12.2...
Here's a link to UF's coverage of the event: http://news.ufl.edu/2009... -- I'm curious, though about this: "The new program will draw information about scientists from official, verifiable sources and make it available using a type of technology called the Semantic Web. For example, information about researchers’ positions will come from their employers and a listing of... more... - Mickey Schafer
How is this different from Biomed Experts, SciLink, etc? After seeing the failure of a dozen of these sites, I'm skeptical of the premise that there's real demand for them. You can build all the semantic infrastructure you want, but if people aren't going to use it, then it's a waste. - Chris Miller
Kind of what I was thinking, too, Chris. But the UF blurb does not address these concerns, so hard to know at this point. Maybe I'll send a message to Sarah Gonzalez tomorrow (one of the UF ref librarians who jump-started the idea) and see if she'll fill me in. - Mickey Schafer
It really hurts to see money be wasted like this on a platform that doesn't really address the issues plaguing these types of sites that already exist. I think someone needs to be given 12 million to figure out how to get scientists to actually use the technology! (Or code tools we'd like to use ;) ) - Brian Krueger - LabSpaces
Brian: what are the differences between this system under development and tools that might be considered ideal? - Mike Chelen
Mickey: scientists may be more likely to get involved for those reasons if they result in an effective operation. it is exciting to hear import and export of standard formats being given a priority, yet it may be longer before anyone sees if the process is functional - Mike Chelen
Chris: anytime someone mentions "facebook for ____ " it seems a little vague and hard to understand what might differentiate the service :D - Mike Chelen
Reading the press release, it doesn't sound like this platform is going to be any different from biomedexperts. I'm not sure there is an "ideal" system. It's going to be hard to offer every discipline the proper tools and content that will drive users and spawn collaboration. Having worked on my own site for the last 3 years, I've heard many scientists say the last thing they want to do... more... - Brian Krueger - LabSpaces
The question that has to be answered is what is the compelling reason for scientists to trust the people they encounter on "facebook for scientists". Non science social networking is low risk... - Richard Badge from Nambu
"The goal of the program is national networking of all scientists," said Michael Conlon, interim director of biomedical informatics for the University of Florida, in a statement. "Scientists have problems finding each other. We often find that researchers have pretty good networks with students or with scientists at institutions where they received their degree or worked before. But... more... - Attila Csordas
Previous discussion here: http://friendfeed.com/cameron... - Mr. Gunn
I have the same response to hearing this that I imagine many of you would reading a grant proposal that proposes to do an experiment that others have already done and which didn't work, and the results of which aren't cited in the new proposal. They need to address how they're going to work in the face of all these past failures. If their branding strategy is any indication, I'm not sure they're aware of the past failures. - Mr. Gunn
Mr Gunn nailed it. Where is the strategy for succeeding where so many have failed? - Bill Hooker
Mr Gunn +3 saving role against hype. - Paul J. Davis
I would like to point out that the Facebook for Science line is journalists trying to market this to the public rather than the investigators trying to address this groups concerns. I think that phrasing needs to be taken with a grain of salt. That doesn't mean the other criticisms aren't legitimate. I just think it is important to evaluate the project on its own merits rather than public mass market branding of it. - Michael Habib
One point on how it is different from some other projects. It is NIH funded. I am not aware of any other solutions with such a mandate from the NIH. Second, it is a huge amount of money.to devote to the problem. Neither of these differences directly addresses the concerns expressed, but they are both factors that give this project an edge in potentially addressing the issues. - Michael Habib
"The University of Florida, Cornell University and a handful of other schools" any people here from those schools funded or know the people funded and can invite them? Would love to hear their angle - Attila Csordas
I'll be doing a post doc at UF. I think I'll contact the head there and see if they need any help :) - Brian Krueger - LabSpaces
I am at UF. I have met Sarah G. (one of the initiating reference libs) while doing a guest lecture in her class. But I don't know the other people. We could just forward this discussion to one of the contacts usually listed. - Mickey Schafer
Michael Habib, I agree with your observation that "facebook for scientists" is journalist-speak. And in terms of explaining things to the UF community, it is a good analogy as my students constantly and consistently categorize social networks as either twitter or facebook. - Mickey Schafer
I forwarded it to to Mike Conlon at UF. He said he'd take a look at this discussion and also for more information said we should read the RFA http://grants.nih.gov/grants.... The RFA says that it wants the platform to be a federated network distributed by partner institutions, which is novel in the SNfS field. It'll be interesting to see what they come up with. - Brian Krueger - LabSpaces
Thanks, Brian (or really, should I use some southern-ism, like "Thaaank you, sweetie" which is actually what happens here, especially at places like Waffle House?). - Mickey Schafer
The research objectives section makes for a quick and interesting read -- love the "background" info! http://grants.nih.gov/grants... - Mickey Schafer
I think that background just shows how little actual background research was done before proposing this RFA :P - Brian Krueger - LabSpaces
I wonder if they'll talk to OWW, Epernicus, SciLink, Laboratree, and the dozen other SNfS services out there to import or otherwise leverage all the data that's already been contributed by scientists. I can see it being useful as an aggregator and motivating standardization and data exchange, but would hate to just see it reinvent the square wheel - Shirley Wu from twhirl
We have a few author profiles in Scopus as well :) - Michael Habib
Richard Akerman
ok copyright gurus, help me out - are scanned page images of a 400-year-old book actually under copyright?
http://contentdm.lindahall.org/cdm4... "©2008 Linda Hall Library - All rights reserved" - Richard Akerman
http://www.rarebookroom.org/Control... "© 2004 Octavo. For research use only. All rights reserved." - Richard Akerman
The scanned images are the property of the entity that created the scans. If you can get your own hands on a physical copy of Sidereus Nuncius in the original latin and scan those pages in, then you will own the rights to *that* set of scanned images. Not to the intellectual property (text) written by Sidereus, but to the scanned images of his words. Yes, that is truly how the law is interpreted. - Jill O'Neill
Excuse my language, but that is f-ing ridiculous. No wonder this Google Books thing is a debacle. - Richard Akerman
Wait a minute. Are you suggesting that, despite any costs incurred in scanning books, scanned images should somehow be free for the taking? As if there was no labor involved? When Dover Books reprints copies of old titles, no one suggests that Dover should be giving those printed copies away for free. - Jill O'Neill
Bear in mind that Dover frequently just used old printed versions of texts themselves, not resetting type or anything like that. - Jill O'Neill
"Sweat of the brow" is not basis for copyright in the US. - D0r0th34
Jill, you're conflating atoms with bits. I *bought* Starry Messenger, I had no expectation that the print version should be free, even 400 years later. http://www.librarything.com/work... However, I do have an expectation that digital page images from a centuries-old book should be placed in the Commons. But anyway, I don't want to rehash what I'm sure has been thoroughly covered in the Google Books battle. - Richard Akerman
"Not to the intellectual property (text) written by Sidereus, but to the scanned images of his words" - true, but the text is out of copyright now. So someone could scan it and place the entire work in the public domain. - Nick Lothian
@Richard - I'm not sure this is all that closely related to the Google Books battle. Most of the argument there is about books that are still in copyright. This work isn't, and the text is available copyright free online. I agree it would be great if someone created public domain or CC-licenced photos of the actual book, but that's a different argument to the Google Books battle. - Nick Lothian
There is a free photo of the cover available on wikipedia, with wikipedia-compatible licence: http://commons.wikimedia.org/wiki... - Nick Lothian
So perhaps the objection needs to be framed differently. I agree that some copy somewhere of public domain material should be made available at little or no cost to the public (ie as in the instance of Project Gutenberg). I just don't think we should yell at those who do demand some form of financial compensation for their effort. I don't remember the library community yelling at Dover... more... - Jill O'Neill
@Jill - Octavo has a right to claim revenue. I'd prefer if one of the custodians of the physical books took a scan of it and placed it in the public domain, though. (Out of interest, though - how does Octavo make revenue from that scan? Do they sell it or something?) - Nick Lothian
Octavo was a service provider; they were hired by museums and archives (and upon occasion monastery libraries) to provide them with archival quality scanned images of texts that needed to be both preserved and made accessible to scholars. They provided that service to those who protected the artifacts and that was/is their primary source of revenue; in turn, where permitted, Octavo... more... - Jill O'Neill
I understand the frustration of not being able to get to a "free" version of a text or document; I run up against it all the time. Richard was venting a little bit and expressing that frustration and I ought not to have turned this into a confrontation (and for that I apologize, Richard.) But I also get frustrated when I see the wrong people (translation vendors or publishers) blamed... more... - Jill O'Neill
Thanks, Jill, for adding the thoughtful comments to the discussion. - Peter Murray
Sorry for my tone - I'm just frustrated because I wanted to be able to show some students "look at this page from 400 years ago" and I can't. Do I have a "right" to be able to show them something that I couldn't even find before Internet search and digitisation? No. *Should* that be a reasonable right and expectation... well, I think it should be. Just for some added bizarreness, you... more... - Richard Akerman
Incidentally the full latin text is available at http://www.liberliber.it/bibliot... - so then my question becomes, what are my rights to use a screenshot of that site? Would the presentation of the text and accompanying images on the web page be some sort of protected object under Italian copyright law? (There is no copyright notice on the site that I can find.) - Richard Akerman
Incidentally in reading Jill's comments, I should have made the context clear - I don't want the text, I already bought it in translation and showing the students the original latin won't go very far - it was actually the page image itself that I was interested in showing. But naively, isn't this just "faithful reproduction" of a public domain work? Which I thought for e.g. art that's... more... - Richard Akerman
If you ran the scanned pages through OCR, the resulting text wouldn't be under copyright. Similarly, if I take a photo of the Gutenberg Bible, that photo is mine and I have copy rights to it, but I can't stop anyone else from taking such a photo, nor can someone else stop me from publishing my photo. (I'm not sure on that last point, as the Gutenberg Bible may be considered art, and I... more... - Kevin Fox
I think this will pretty much explain it all, using US & Canada law. The answer might surprise most people here, since it doesn't agree with what most of you have said. http://www.likelihoodofconfusion.com/... - April Russo (app103)
@Richard - you can show them the page from wikipedia (linked above). Also, the http://www.rarebookroom.org/Control... explicitly say "For research use", and your usage may apply. - Nick Lothian
My understanding is that the Bridgeman decision http://en.wikipedia.org/wiki.... says that Jill's argument is incorrect. Perhaps the entire Octavo CD would be considered in copyright or a Dover clip art book cover-to-cover. But not an individual page, regardless of what the website says. I am, as they say, not a lawyer. But I have covered this on my... more... - s t e v e
Richard — I, too, am not a lawyer, but I thought I would mention that, if your purpose is to show a single page to your students, then you could logically claim that the page is an "excerpt" for educational use. This could fall under the "fair use" clause of copyright law. You should understand, however, that "fair use" is a defense, and, in claiming it, you're admitting to infringement ("Yes, I made a copy, but I believe my copy is allowed") and the people who claim copyright can still try to sue you. - Glen Mistletoe
Actually, Glen, it's not accurate to say that you're admitting infringement. From section 107 "fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, *is not an infringement of copyright.* " (emphasis added) - lris
Even so, Steve's point is that this new work may not be copyrightable because it is not sufficiently original. - lris
Iris, according to my law school friends, the courts have held that, if the claim of fair use is not found, the claim itself is tantamount to admitting infringement. Again, I ain't a lawyer. It seems that what's interesting here is what constitutes a "work"—in this case, it sounds like it's not the original document, but the digital image of it. It's tough for me to see how that would stand up under court scrutiny, but I don't have the resources to take it to trial. - Glen Mistletoe
"Slavish reproductions" of public domain works are not eligible for copyright protection. No matter how much work went into making a scan, or how much skill it may have taken to do it, it's a "slavish reproduction". Now, if they added something original to it, not found in the original work, then it would be covered by copyright. But just digitizing by scanning doesn't produce a derivative work that can be copyrighted. - April Russo (app103)
But you don't have to "claim" fair use until you are sued, yes? If I copy a few sentences from a novel and post them here, I don't have to claim anything, I just do it. - s t e v e
If you exceed fair use, then yes, it's infringement (and yes, the definitions aren't set in stone). But that doesn't mean that claiming fair use is admitting to infringement. - lris
For anyone that doesn't understand what constitutes a slavish reproduction, this is an example, that can't be copyrighted: http://www.appsapps.info/russogr... This is an example of the same works, reproduced in a way that IS covered by copyright, because it is a new work of art: http://www.appsapps.info/russogr... - April Russo (app103)
I _think_ Richard is in Canada, so fair use might not apply (I'm in Australia, and we don't have fair use. We do have exemptions for educational use, though) - Nick Lothian
Dammit! You foreigners and your, your foreign-ness! I am not an international copyright lawyer, though I am an international man of mystery. - s t e v e
I think Canada may have a copyright collection agency, which automatically collects money from educational agencies for photocopying (and - web browsing). In Australia, we have one, and I think it would cover this situation. (IANAL etc) - Nick Lothian
Has anyone asked an art professor about this? Lots of art history classes taught from slides of 'non-copyrightable works'. I would imagine there's a hefty precedent for what's OK in terms of lectures. Multi-discipline FTW :) - Paul J. Davis
I am very surprised, I have to say. I usually work with databases, and what I am always told is that it is irrelevant how much hard work goes into making it - unless the result is a "creative work", it is not protected by copyright law. I find it very hard to comprehend how scanning a book can be considered creative and give rise to copyright. If I take screen shots of things from Project Gutenberg, do I get copyright on my bitmaps? What if I print out the texts and scan them in again? - Lars Juhl Jensen
@Lars - the tendency to give monopoly privileges to non-creative things like database compilation is very different around the world. The EU is particularly bad. More here http://opendotdotdot.blogspot.com/2009... - Anders Norgaard
I am in Canada, and i think our equivalent is "fair dealing", plus we have law that covers photocopies but I think is not so clear in the digital age. There is an underlying point that one shouldn't have to be an international copyright lawyer in order to work with digital objects. If I have time I will try to track down more Canadian info on the topic. - Richard Akerman from BuddyFeed
Greg Tyrelle
If we're to get continued funding, and support JBrowse to meet all the requests that you have justifiably demanded, it's absolutely critical that we demonstrate an overwhelming demand for JBrowse from the user community -- that means you. http://sourceforge.net/mailarc...
"Also, in another subproject of the same grant, we plan to develop a JBrowse version of the GBrowse_syn synteny viewer developed by Sheldon McKay, so you can browse syntenic regions of homologous genomes side-by-side." definitely convinced me that a letter is in order. - Paul J. Davis
Thanks for posting! - Mitch Skinner
Chris Lasher
What are you using to visualize biological networks? - http://www.flickr.com/photos...
What are you using to visualize biological networks?
[Image: "network" by Simon Cockell, link to original source for attribution] In particular, what are you using to view networks in a dynamic way? Cytoscape seems the behemoth for visualizing biological networks, but have you used other solutions? A group mate and I have been discussing how we could *really* use a JavaScript library that rendered as SVG in a web browser. This would allow us to call for more data from our databases and present data on the fly and only-as-needed. He's convinced we could start writing one; I'm concerned it's a lot more difficult than he thinks, but we're both fishing for other ideas. - Chris Lasher from Bookmarklet
for working with SVG and Javascript, maybe try svgweb http://code.google.com/p... or Raphaël http://raphaeljs.com/ - Mike Chelen
We're using Raphael in the Protein Geometry Database <http://pgd.science.oregonstate.edu/> for rendering our graphs. It's been a huge boon to cross-browser compatibility since IE *still* can't do SVG. - Donnie Berkholz
You might try a particle demo on Chromium's canvas element. Other than that I think the optimizations for performance could end up being the bigger project. Not undoable, just maybe not worth the time. Also, specifically Chromium as they're canvas element is current the best, though hopefully FF and Safari aren't too far behind. For IE, just give them a link to Chrome Frame ;) - Paul J. Davis
Is this meant for big graphs? Is the focus on layouts etc? Prefuse is a pretty nice toolkit for graph layout/calculations. igraph also is an excellent library with a bunch of layout algos - Rajarshi Guha
Many thanks for your input, guys. @Rajarshi We have graphs of 5k-12k nodes and 20k-80k edges. I don't know if that counts as large but they're not small. @Paul I'm worried about performance (responsiveness) given that Firefox can take a pounding in some of the demos at http://www.chromeexperiments.com. That's why I wonder if Cytoscape is really the only option (right now). @Mike and... more... - Chris Lasher
Thanks, Chris! I certainly hope it enables some folks to do interesting science instead of just looking pretty, though. =) - Donnie Berkholz
Mr. Gunn
The Dataverse Network Project (via @communicating) - http://thedata.org/
"Via web application software, data citation standards, and statistical methods, the Dataverse Network project increases scholarly recognition and distributed control for authors, journals, archives, teachers, and others who produce or organize data; facilitates data access and analysis for researchers and students; and ensures long-term preservation whether or not the data are in the public domain." - Mr. Gunn from Bookmarklet
do uploads require java? - Mike Chelen
Don't know much about it, just kinda thought it was interesting and very ambitious sounding. - Mr. Gunn
yes, looks cool, and the sort of thing people can get running to host their own collections - Mike Chelen
Impressive - they do proper citations w/ persistent identifiers and everything: http://thedata.org/citatio... . will have a closer look at this. They use Handles for this - the DOI system is based on Handles as well, so one can resolve these IDs via the CrossRef service. http://dx.doi.org/1902... => Gary King; Langche Zeng, 2006, "Replication Data Set for 'When Can History be Our Guide? The Pitfalls of Counterfactual Inference'" hdl:1902.1/DXRXCFAWPK - 'Mummi' Thorisson
I'm willing to bet that future archeologists will have less trouble making sense of URLs than any of the other schemes that are supposed to outlive the Web... - Eric Jain
The "UNF" (checksum) is supposed to be based on format-independent, canonical representation of the data. Good luck with that! - Eric Jain
The underlying idea is spot on. But no Unicode normalization? And rounding instead of BigNum? - Paul J. Davis
The "canonical" representation ends up being just another data format that needs to be supported. - Eric Jain
Eric Jain
Anyone here used HDF5 for storing data? http://queue.acm.org/detail...
Deepak's former colleagues worked with HDF5 http://twitter.com/mndoci... - Pierre Lindenbaum
I haven't used it yet, but glu-genetics, a python toolkit for gene association scans (http://code.google.com/p...) uses HDF5 through PyTables (http://www.pytables.org/moin). The relevant code is here: http://code.google.com/p... - Brad Chapman
I saw it awhile ago and thought that I should write Python wrappers. PyTables looks nice, but I think I'd try using http://h5py.alfven.org/ first as it seems simpler from scanning the docs. - Paul J. Davis
Duncan Hull
Data Management Strategies: Future of the DBMS, considering Hadoop, NoSQL, and XQuery - http://dmsblog.burtongroup.com/data_ma...
In Burton Group’s recent “2010 Planning Guide: Data Management Strategies” paper, I said, “the data management foundation that [relational] DBMSs provide is not adequate for all of the needs of modern enterprises.” Clearly, I believe the era in which enterprises use relational database servers to store all of their data is nearing an end. - Duncan Hull
"XQuery is poised to do for XML databases what SQL did for relational databases". What ? Did I just get transported back to 2001 ? - Greg Tyrelle
http://bit.ly/SY9w1 - Does wikipedia's definition of "Enterprise Software" make sense to anyone at all? Or is the actual definition "if it can be understood by a single person, its not Enterprise Software." - Paul J. Davis
Response from a non-friend-feedian, "Indistinguishable from parody." - Paul J. Davis
Eric Jain
Yet another high-throughput sequencing start-up: http://www.halcyonmolecular.com/
Examples of others? - Hope Leman
@Hope - Pacific Biosciences and Helicos maybe? Not sure on 454 and Illumina's origins. - Paul J. Davis
Complete Genomics - Eric Jain
NABsys - Eric Jain
Hi, guys--thanks for the info! - Hope Leman
Illumina has been around for a while and acquired Solexa, which was a next gen startup. 454 also been around a while and now part of Roche Applied Science. Pac Bio, Complete, there's one with nano in the name that I can't remember. Does Helicos still qualify as a startup (they are a public company) - Deepak Singh
@Deepak, perhaps you're thinking of Oxford Nanopore Technologies? http://en.wikipedia.org/wiki... - Rob Syme
Rob, I think that's it. Thanks - Deepak Singh
Hilary
On the recent embargo breach involving GWAS data and a PNAS publication (which was recently retracted). - Hilary
Good to see people taking the ethical side of this seriously. I'm less convinced about the value of specific rules and more by the idea that this should just be seen as bad behaviour but very glad to see people coming down on it like a ton of bricks. That's what will make people feel safe - not rules, not regulations, and not compulsions either, but very strong and public responses to breaches. - Cameron Neylon
@Cameron +1 . But ideally some kind of consequences/punishment surely would be order as well, e.g. the authors responsible would not be kindly received next time they ask for ethical approval to access controlled-access data from NIH (or other) repositories. Some sort of blacklisting for 'repeat offenders'? - 'Mummi' Thorisson
Not greatly in favour of blacklisting per se. I would say that it was a disciplinary offence though that ought to consider dismissal from post. Which really amounts to the same thing. - Cameron Neylon
Please correct me if I'm wrong, but I thought the consequences ("punishment") was that their paper was retracted. - Hilary
The paper was published Aug 31, retracted Sep 9, when all the authors had to do was to ask PNAS to publish it no earlier than Sep 23 to comply with the GENEVA data embargo policy. The closeness of all the dates suggests to me that it was more a serious messup than a malicious breach of policy. http://www.pnas.org/content... - Iddo Friedberg
Hilary, I would say that the retraction is just the reversal of the act rather than punishment. Paper shouldn't have been published, therefore it was "unpublished". If (and there should definitely be a proper investigation) someone thought they could get away with playing outside the rules there should be punishment above and beyond simple reversal in my view. This is "conduct unbecoming..." etc. But as Iddo says, not clear from the dates whether it might just have been a screwup. - Cameron Neylon
Cameron: a retraction is a very bad thing to have on your record. It is for all intents and purposes synonymous with"fraud". - Iddo Friedberg from Android
Without *knowing* the intent was malicious, forcing a retraction seems a bit harsh. If data is online it should be intended for use by the public. IMO this is just another argument for mandatory DOI's and better dataset citations. On the other hand, calling out a group for not having the courtesy or awareness to contact the originating lab is a good thing. Like Cameron said, the social norms are probably the best way to play this. - Paul J. Davis
Also, don't physicists have a pretty good system for the whole idea of citing datasets? NCBI's ability to provide transparency in terms of what data came from where and when is pretty atrocious, so its a bit weird to consider for biology. But I thought I read that the LHC data was pretty much available for citation. - Paul J. Davis
Iddo - I disagree on two counts with that. There are plenty of retractions out there that are honest mistakes or re-assessments. Embarrassing yes, emblematic of sloppy work yes, synonomous with fraud, nah. But more importantly if we take that kind of attitude then people will be too scared to correct things in the future - when we will (hopefully) have much more fine grained approaches... more... - Cameron Neylon
Paul - I think citing datasets at NCBI isn't so hard. I'm not sure that's really the problem in this case (if it is then it's a definite mark against the authors). The problem is the culture in biology that collecting the data isn't worth anything so having a highly cited dataset isn't useful on your CV - no matter how good or useful it is. Only the paper matters. I have to say I haven't actually had the time to look over this case in detail though. - Cameron Neylon
On citing datasets - that's the easy part. What people do not record properly is how they processed the data. Microarrays for instance - there are plenty of public datasets (NCBI GEO, EBI ArrayExpress). But when the associated description reads "Data were processed using the limma package in R" - and that's it - how are you to repeat the work? - Neil Saunders
This also raises issues of roles of journals, institution employing authors (often several, in different countries/legal systems, as papers now almost all multi-author), and funders in "policing" sci ethics. Lots of talk everywhere about this. Journals can publish policies and retract/correct (ensuring linking in A&I dbase searches etc) - but how can sci community deal with wider issues beyond the paper? (quite apart from the technical problems with enforcing eg "blacklisting") - Maxine
Neil - in response to your Q above - v hard in practice to be perfect but from journal's perspective: (1) consult with relevant community and state policies for standards all agree and (2) the peer-review process (advice from reviewers on repeatability). Also, of course, journals can in general encourage authors to disclose more rather than less. - Maxine
Thanks Maxine. I think that journals and data repositories should require, in addition to raw data, deposition of any code (e.g. scripts) used to process the data. Not at the journal or repository site, but somewhere on the web (Github, Google code, Sourceforge etc.) - Neil Saunders
+1 Neil and Maxine. There is too much of an expection for "the journals" to sort this out. Publishers have an important role to play but we need to clean our own house. Or someone will do it for us. Probably the public. And probably by saying that they're not so interested in funding science any more. - Cameron Neylon
Thanks, Cameron. I agree, journals can and should help but as part of a wider process that scientists themselves (as a profession) decide is "best practice". Neil - have had this "code" discussion with eds here before - one view is that the documentation better/more meaningful to scientists (who aren't programmers in the main) - also many programmes are not open-source. Probably other points which I don't immediately recall. Nature Biotech is running community consult at the moment on this, I think. - Maxine
Andrew Su
There are 591 tRNA genes in Entrez Gene. For 64 codons, why so much redundancy?
most tRNAs have some bases modified - Pierre Lindenbaum
46 for ala, 44 for leu, 36 for lys, 35 for arg, ... - Andrew Su
Just curious , what is your query to find those numbers ? - Pierre Lindenbaum
among the 46 coding for ala, 30 use the AGC codon, 10 use UGC, and 6 use CGC. Presumably pseudo-tRNAs are easily removed (when secondary structure is disrupted), so why so many real and redundant genes? - Andrew Su
admittedly a potentially imprecise hack, but I'm parsing gene_info from NCBI: <shell>gzip -cd gene_info.gz | gawk -F"\t" '$1=="9606"&&$10=="tRNA"{print $9}'</shell> and then a bunch of sed, sort, and uniq piped after that... - Andrew Su
ah, ok, I thought it was a request in Entrez - Pierre Lindenbaum
Hmm, number of tRNA genes seems to correlate well with the amino acid usage frequency in vertebrates (http://bit.ly/nWTXc). Perhaps this is the answer? - Andrew Su
tRNA is likely redundant than you think. - Ami Iida
I'm also pretty sure that transcription throughput is affected by copy count. As in, you need lots of tRNA's floating around to keep the ribosomes busy. An easy way to make that happen is multiple genomic copies. - Paul J. Davis
Paul, point well taken. Just would have thought that we higher organisms would have developed more elegant regulatory solutions to take care of that. Copy number is just so ... primitive... (though I suppose so are tRNAs...) - Andrew Su
Also, there is apparently a rich literature around correlations between tRNA copy number and codon usage (e.g., http://dx.doi.org/10... ) - Andrew Su
I'm not an authority on this by any means, but think of it in terms of computers. A polymerase during transcription has effectively locked that gene copy. Thus you're rate limited by the time it takes to transcribe (roughly). Granted, I have no idea on time scale here. BioNumbers might have a bit of illumination on that part. - Paul J. Davis
Wow, bionumbers looks cool. Back when I thought I was interested in quantitative biology, I would have thought it was extra-awesome! (http://bionumbers.hms.harvard.edu/) - Richard Klancer
@Richard It's interesting stuff. I saw a seminar that biologists hated but I maintain it was cause the presenter sold it wrong to biologists. They need to position themselves like this: http://bit.ly/13Epan The hugely useful reference for numbers. Or the Farmer's Almanac for biology. - Paul J. Davis
You think you've got problems. Last I looked we had 500,000 of them! - Paul Gardner
Andrew Su
Citations of Wikipedia in the primary literature is a bad precedent http://www.ncbi.nlm.nih.gov/pubmed... (from http://laikaspoetnik.wordpress.com/2009...)
True... one should preferably cite primary literature... - Egon Willighagen
That pubmed article is a letter to the editor which is about half a step above a blog post AFAIK. Also, I'm fairly amused that after complaining about people citing wikipedia, the author cited wikipedia. Other than that, yeah, just because you read it on the internet doesn't make it true. - Paul J. Davis
If you cite the specific version of the page you're looking at, like this: http://en.wikipedia.org/w... ,then I think it's probably fine and no different from citing any other page on the internet. This remains a serious technical challenge for citing online sources, because what you need to do is cite a published object, whereas the object referred to... more... - Mr. Gunn
@MrGunn Definitely agree. Though I'd add (for academic publications at least) that citations should be peer-reviewed as well, though that gets fuzzy depending on context. - Paul J. Davis
Its quite nuanced. Depending on what the citation is for I dont see a problem with it. If you are citing an algorithm , its OK to cite wikipedia with a version history . If its a concept open to interpretation or something that should be backed by supporting data , then a wikipedia citation is a strict no-no. - Hari
@Hari, Exactly. - Paul J. Davis
My beef is not about citing the current page versus a specific page in the history. I just tend to think that if you can find the primary, peer-reviewed citation, why not cite that? If you _can't_ find the primary citation, should you really be citing WP? I know, I tend on the traditional side of the spectrum here in terms of citations and reliable sources... (Also, exceptions of course for academic studies about WP itself...) - Andrew Su
Come on, we're talking here about an ABSTRACT in a MEDICAL journal. There are seldom references in such an abstract and certainly not to common concepts, that every 1st year med student should know. Even my 10 year old can explain me what fever (http://en.wikipedia.org/wiki...) is. That would be comparable to linking "DNA" or "molecule" in an abstract to Wikipedia. Or to Watson &... more... - Laika (Jacqueline)
I did find it strange that they linked to such common terms, but whatever. - Mr. Gunn
Hari
I am trying to write some python code that can look at voltage time data. Can anyone point me to some pseudocode or actual code that picks peaks from a spectra , determines the width of the peak at the baseline and the height of the peak.
Wow this is reaching back, but it depends on your idea of a peak. Basic idea, create a model of your peak and convolve. Checking for a spike in the convolution should then be fairly simple thresholding or maximizing after smoothing. Peak modeling commonly starts with sinc functions: http://en.wikipedia.org/wiki... - Paul J. Davis
Hey thanks a tonne Paul. Also thanks for your previous tip on reportlab . Its a great library that I am using to make some pretty nice reports of my crystallization setups . see http://github.com/harijay... - Hari from email
Michael Barton
Is there a best practice for microbial genome annotation?
Nope. There are a variety of pipelines that perform similar tasks. Good starting point might be IMG documentation - http://img.jgi.doe.gov/w.... - Neil Saunders
Worth remembering that there is very little "best practice" in any bioinformatics. For a long time, we made it up as we went along. It's only this new generation of bioinformaticians that have any formal software engineering education and bandy around fancy terms like "best practice" to make us feel bad ;-) - Neil Saunders
I think its more like the Perl culture "There is more than one way to do it !!" Best practices in bioinformatics is currently in an ad-hoc state of practice.Just like Damian Conways's Perl Best Practices is one of the best guide for good coding practices for Perl - hope we will also have a book on "Best Practices in Bioinformatics" soon, may be by a group of authors from LifeScientists room - what say ? - Khader Shameer
@Khader thats why we need flexible guidelines and not the constrained best practice. Several minimal guidelines have been already worked out for the different aspects of the life science domain. MIBBI (http://www.mibbi.org/index...) can be a good starting point in this case. - Abhishek Tiwari
I think very often in bioinformatics, TIMTOWTDI. It's not like software development, with a "task" and an "optimal solution". What I think matters most is that however you do it, it's documented and repeatable. - Neil Saunders
I completely agree with you Neil, but some efforts towards developing well defined, documented workflows / protocols (can we call this as "Best Practices") to perform generic tasks (eg. annotation) will be useful for the community. I think several 'standards' (eg. MIRIAM/MIBBI) are developed to bring in a common frame work for routine tasks. I believe TLS is an ideal place to get a consensus about such practices and work on a wikibook of best practices in bioinformatics. - Khader Shameer
And I agree with you. I'm all for standards and best practice. I'm also a realist and a practical bioinformatician :-) - Neil Saunders
@Abishek : Best practices are not always "constrained", and constrained practices are impossible due to complexity of biological system - flexibility should be there. But my point is that even if MIBBI / other standards (http://www.mibbi.org/index...) are available for a long time - I've never seen them in research papers - is it due to poor visibility of such projects or no interest in promoting such initiative ? - Khader Shameer
Khader, that's a good question. There seems to be a disconnect between standards developers and the people who should be using the standards. I think it's a publishing problem. Developers publish in computational journals and use computational jargon; users don't read those journals or understand the jargon. - Neil Saunders
Khader, In my opinion the main motive of guidelines is to avoid the disagreement while best practices try to bring an agreement in community. Also, people are using these guidelines. Its just lack of awareness otherwise more and more people will adopt them. Take any Biomodels database model or CellML repository model, they are well annotated according to MIRIAM guidelines. Allyson... more... - Abhishek Tiwari
I find the line "it's not like software development" to pretty much sum up some of the problems in bioinformatics. Why isn't it?!? - Neil Swainston
It's complicated :-) In part, it's because researchers are more interested in quick answers (= quick fixes) than good code. In part because it's only in recent times that bioinformaticians receive formal software training. In part, because biological problems are more complex than input -> process -> output and you don't always know exactly what you want to achieve when you start. And I guess, biological information has a lot of "context", not easily captured by simple routines. - Neil Saunders
Hi Neil. Yep, all that you say is true. Just from a personal perspective, I've found that being "disciplined" in writing code (making nice, clean, interfaces to modules, unit testing, documenting) means that in the middle-to-long-run, quick answers are easier to come by. By building up a reasonably reliable library of classes (I'm a Java-geek), sticking the bits of Lego together is... more... - Neil Swainston
Neil, I absolutely agree. It took me some time to get to the point of trying to "do things right" from the outset - libraries, documentation etc. and I'm glad I got there. I think a lot of the problems stem from how academic research is conducted. "Can you just give me a table by tomorrow?" "Sure, let me write a library." "No, I just want a table." Hack together perl script, deliver table, discard, move on. Rinse and repeat, until contract expires. Leave mess behind. - Neil Saunders
Couldn't put it better myself! I guess I'm lucky in so far as that I do have the luxury of longer timescales... until my contract expires. - Neil Swainston
Thanks Abishek for the pointers to application of different standards. My point is the goal of both best practices and standards are the same - getting a consensus to do repetitive experiments / workflows. But as Neil's are discussing - the choice of individual bioinformatics projects is mainly to get a good fix, rather than an excellent code base. But hope some degree of consensus can be obtained if people can follow standards as a first step. - Khader Shameer
Science isn't set up to reward coding standards. Funding agencies reward quick biological results, not infrastructure and software development. I'd argue that for every 5 biological grants, the NIH should be funding one software/database/computational infrastructure grant. The amount of data is only getting bigger. - Chris Miller
I'd agree with that, Chris. Career wise, it's pretty much immaterial whether I churn out a hack or something "good" and reusable. It's quite annoying. Grrrr!! - Neil Swainston
@Michael / Neil : I am agreeing with "Science isn't set up to reward coding standards", but as a subject in the interface of science and technology - it is high time that bioinformatics should embrace the standards. For Michael's question I was trying to make a point that if there is a standard/best practice/generic protocol for microbial genome annotation - he could have just followed... more... - Khader Shameer
I think genome annotation is an excellent example of how bioinformatics is not like software development. You don't just run a program and annotate a genome. There are lots of biological features: protein-coding genes, non-protein coding genes, motifs - all with their own associated metadata, all with various, disparate tools written specifically for each type of feature. Annotation is... more... - Neil Saunders
too right Neil. is there a best practice for violin-making, vision quests, or coming-of-age experiences? ;) - Ian Holmes
:-) Exactly. The end result is what matters. - Neil Saunders
srsly tho -- there are plenty of papers describing microbial genome annotation. it's still an open research area, but there are commonalities (repeats, transposons, genes, typical errors, ...) so I guess the rough union of those vague concepts would constitute the current best practice. not exactly a recipe... - Ian Holmes
:D best practice for violin-making, vision quests, or coming-of-age experiences :D - Neil, in the current era of bioinformatics with Webservices and Work-flows - having an SOP/BP is always help you to kick start the work in minimal time rather than going through all genome project paper for the flowcharts for annotations. - Khader Shameer
@ Ian : OK, finally that's something that Michael/any one interested in annotation to get from this thread. - Khader Shameer
Khader, what we're saying is that in this case, there isn't an SOP/BP, because it just isn't that kind of procedure. But there is, as Ian says, plenty of advice available. I guess, in terms that CS people might understand, it's not agile. You actually have to put some work into understanding what's going on and what you want to do. - Neil Saunders
@Neil - ^(chicken|egg)? - It could and should be that kind of procedure though. All the advice in the world isn't going to help the people that actually *use* your annotations. The current 'system' for annotating anything is so mindlessly broken I'm surprised it works at all. Now all it needs is a catchy name. Blight of Bioinformatics maybe? - Paul J. Davis
Thanks for the comments everyone. I'm going to read as many genome papers as possible and try and put what I read together. - Michael Barton
Just remembered this article: http://www.nature.com/nbt... whic is a good look at current annotation practices. I also finally found http://www.ncbi.nlm.nih.gov/genomes... which describe's actual paramters that NCBI uses for gene prediction. - Paul J. Davis
Neil Saunders, I agree a lot of advice is available and it is definitely helpful. For example, I was not aware of something like MIARE (thanks to Abishek), am now implementing in our RNAi screen. But I can't agree with you if you define bioinformatics projects as non-agile. From a simple BLAST based sequence analysis to large scale data analysis is following agile approach. Think of n... more... - Khader Shameer
Thanks Paul,for the links to the articles. - Khader Shameer
Khader, your very use of the word "agile" sums up what this is all about. Clearly you are "new school" bioinformatics and appreciate software development. "Old school" bioinformatics would never even use the word :-) As I keep saying, I don't disagree with anyone here who calls for better practices, standards or "agility". Just be aware that there are still plenty of old-timers around for whom bioinformatics means "hack together something that works." - Neil Saunders
Here's a paper that describes how microbes are annotated in Swiss-Prot: http://dx.doi.org/10... - Eric Jain
Neil : Just loved the definition "hack together something that works" :) - Khader Shameer
Egon Willighagen
Dear lazyweb... does anyone know of some good introduction to random access XML file IO with indexing? Say I happen to have a CML file with 100000 molecules, and want to index them so to quickly get the 7890th? Let's generalize this that the file is not newline formated, so molecule 7890 is not the XML snippets on line 789000-789100.
Most SAX parsers should be able to do this pretty much in a flash. - Fergus Gallagher
@Fergus: I am not aware I can random access with SAX... AFAIK, I'd still have to 'pass' all the content before that specific XML snippet... - Egon Willighagen
Yes, but SAX will fly though a file that size, giving you the byte offset (from start of file) of start end end of each element of interest. Write these out to a text file (the "index"), then use "cut" to extract the snippet of interest. Of course, I don't know anything about your specific requirements so this may be b****x in your context. Another approach might be to use SAX to... more... - Fergus Gallagher
Mmmm... it's been a while since I parsed 2 GB file with SAX... but if you think average seek times are acceptable... what SAX lib are you using? - Egon Willighagen
Is it not a job for XQuery? - Neil Swainston
I guess I wasn't clear - you only SAX-process the file once to produce a list of (start,end) offsets. But I'm now thinking that my second solution (reformat) is easier as it works with any SAX lib (not all provide access to the current offset). - Fergus Gallagher
@Egon Ditto what Neil said, how about a native XML database? - Duncan Hull
Don't even think you need to go as far as that, Dunc. If you're a Java spod, you could try the following: http://www.xquery.com/tutoria.... Sure equivalents will exist for other languages. Seems a shame to re-invent wheels by implementing your own means of indexing an XML file. - Neil Swainston
@Neil but I thought re-inventing the wheel was what bio and chem-informatics was all about? ;-) - Duncan Hull
@Duncan, ha! nice - Rajarshi Guha
@Duncan :) - Egon Willighagen
@Neil interesting. Going to check that out now... - Egon Willighagen
How fast does the random access have to be, and how often does the index have to be updated? - Eric Jain from iPhone
Another question is what type of indexing are you wanting? If its purely UID based, then Tokyo Cabinet. If you want more flexibility and you're willing to deal with a bit of a slow down on indexing, then SQLite. - Paul J. Davis
@Eric the index is made once... the use case is a file with very many molecule, sort of like a file based database, mostly view only, though for the plain text we do have write-on-safe. The user is browsing this file file viewing some 2-4 molecules at the time, and scrolling down should take considerably less than a second... - Egon Willighagen
@Paul so, the indexing is a simple list index... 1087th molecules in the file, is the 1087th in the shown molecular table (as shown in http://bit.ly/4DGRmp where one entry in the file is shown) ... - Egon Willighagen
Need to be able to scroll in both directions (and skip records)? If this is meant to be used in a desktop application: Would users download the index, or build it themselves? If records can be edited: Still need to be able to update the rest of the data? - Eric Jain
@Eric. yes, both directions and skip. The index would be build on the fly. Changed can be cached and the real save is done, which could be done by a simple SAX run over the file, and replace the changed bits. - Egon Willighagen
for a possible solution in Python, check out http://ff.im/7fGFz - Adriano
So it looks like you could either have users import the file into a local database (FS, XML or key-value) and let them dump the data into a new file later when required, or you could index the file on the fly when opening it and merge changes from memory when saving. This second approach seems more appropriate if the main purpose of the application is to edit CML files? - Eric Jain
If you need to get byte offsets out of the parser, try Woodstox. Few parsers appear to support this (or they return character offsets instead). - Eric Jain
Not sure how reformatting the file to have one record per line would help -- unless each line were padded to have the same number of bytes (ugh). To jump to a specific record, you want to seek() rather than stream through the entire file (line by line, or otherwise). - Eric Jain
Khader Shameer
Is there any Genome Browser (GBrowse, Vista, UCSC) enabled resource for Leishmania genome ?
Tried searching for "leishmania gbrowse"? e.g. http://www.genedb.org/gbrowse.... Or have a go at building a custom Gbrowse yourself. It's fun! - Neil Saunders
Thanks Neil. But it is using an old version of GBrowse (1.x). Yeah, it will be interesting to build a custom Gbrowse. - Khader Shameer
You might want to check out http://jbrowse.org/ The process of building data for it looks fairly straight forward. The docs could be more detailed. I haven't quite gotten to it, but its on my agenda for the very near future. - Paul J. Davis
Have a look at some of the notes + presenter notes from the GMOD meeting earlier this month. Loads of super-cool GBrowse stuff, including next-generation sequencing alignment displays: http://gmod.org/wiki... - 'Mummi' Thorisson
Thanks Paul, Mummi for your suggestions. If Genedb is not planning for an update on the leishmania genome ( Any one from Genedb here ? ), I would like to set up a latest genome browser enabled resource for leishmania. Any idea, what will be the minimum technical requirements to setup one ? Is it possible to host it on any public resources like bioinformatics.org ? - Khader Shameer
Wladimir Labeikovsky
How XML Threatens Big Data : Dataspora Blog - http://dataspora.com/blog...
Just blogged about this one too :) http://mndoci.com/2009... - Deepak Singh
yeah, figured that post would be deepak-bait after seing this http://mndoci.com/2009... ;) - Wladimir Labeikovsky
Another one who doesn't grok XML resulting in a mindless rant... - Egon Willighagen
XML certainly has its shortcomings (overkill for data that fits into a tab-delimited file, and too low level for more complex data). But I don't quite see how JSON helps here... - Eric Jain
Eric, not for everything, but let's say you were pushing attributes related to a clinical study. Would you choose XML or something else for that? Or let's say metadata related to toxicology studies? Today a lot of that ends up getting pushed in SAS transport files (at least to the FDA). A lot of the standards, esp CDISC are good representational formats, but a bear as transport formats - Deepak Singh
XML is perfect when it comes to marking up what the data is. This just matters in science. - Egon Willighagen
But that's a representation. How about transport? That's the problem that I find frustrating - Deepak Singh
Don't invent new formats is good, but isn't JSON yet another format? - Duncan Hull
Duncan++ - Egon Willighagen
Is it? In the same representational sense as XML? Are there any data formats that are strictly JSON (representational formats)? I've only seen it used in API calls as a transport mechanism - Deepak Singh
XML and JSON both define a syntax (so you don't need to get creative with field separators etc). JSON may be a bit simpler, but lacks some features (e.g. support for different character sets) and is more specialized (i.e. good for "data" but not so much for "documents"). JSON appears to lack a standard for defining the structure of the data while XML has (several...) standards to describe the structure of a document (better than nothing, but too low level). - Eric Jain
@Eric: yes: it's about "data" versus "document". "data" is well-formed by itself. "documents" is a mix of anything. - Egon Willighagen
I think the documents vs. data debate is futile, Its all just data at the end of the day, no? - Duncan Hull
I think the issue the author of the blog post ran into is common and due to there not being a standard mechanism to split a single stream into logical chunks (trivial e.g. with tab-delimited text). But I don't think JSON solves that, either? XML at least has some (low-level, more or less painful) streaming APIs... - Eric Jain
Then we are talking about the same thing .. XML does work well to structure documents, but the problem is so many standards out there are re-engineering data that works in XML, mmCIF being the classic example - Deepak Singh from IM
@Deepak: Transport = e.g. HTTP? It's true that JSON is almost always used that way (and XML is only often used that way). But that's not a fundamental difference? - Eric Jain
Eric ... a couple of things 1. There's XML as a data format. You don't have to make everything XML when existing formats work well (FASTA, PDB, etc) 2. Moving data around in XML payloads just breaks down, especially when you try and encapsulate everything - Deepak Singh
One issue I have with XML schema languages such as XSD is that they are good at describing documents, but not so much for describing data. To describe data, you want to e.g. state that one kind of entity has some relationship with another entity, not that the element represented by latter is nested inside the former's element in the same document... - Eric Jain
This floated up the other day which might be of interest: http://short.ie/7wqjhp Its about the rise of JSON and talks about SGML/HTML/XML a bit too. Its a bit on the long side, but fairly interesting. - Paul J. Davis
As to streaming, JSON can be used as one object per line because all represented whitespace must be escaped. JSON only does UTF-8. I've started using JSON exclusively for all of my data. It's easy to use. It supports basic data types. There's probably already a parser written for your language of choice. It doesn't have a standard validation RFC, but I'd agree with Eric that a real validation language should probably be turing complete. - Paul J. Davis
@Paul: XML can be used as "one object per line" as well. But (as with JSON) this isn't how you can expect people to deliver the data. - Eric Jain
@Paul: According to http://www.ietf.org/rfc... JSON does support (and -- like XML -- requires parsers to auto-detect) different encodings: "UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE)". But only one character set (Unicode) is supported. Can't comment on how well implementations follow the RFC though. - Eric Jain
@Deepak: Agree that there is no point xml-ifying everything (e.g. fasta files, or large tab-delimited files) just for the sake of it (or because you can get money for doing it)... - Eric Jain
@Eric Good catch. Though I've never seen code for anything other than UTF-8 support. And that's easy enough to get wrong I don't think I'd recommend 16 or 32. Also, re line oriented XML, 'ewww' is the only thing that comes to mind. :) - Paul J. Davis
@Paul: How is "line oriented" XML more ewww than line oriented JSON (after subtracting the basic ewww factor of each standard)? - Eric Jain
I think of JSON as a data structure, rather than a format. Its appeal, for me, is that it reflects the natural structure of many kinds of data and the structure that comes out of our code: since hashes are just key-value pairs, it's conceptually easy to visualize e.g. a ruby hash as a JSON structure. And you can often just drop your hash straight into a datastore, e.g. MongoDB. So... more... - Neil Saunders
No, it's not a data structure... it's merely a tree data structure, with nodes labeled with key-value. - Egon Willighagen
Just like XML :-) - Eric Jain
As I say, I *think of it* as a data structure, as a conceptual aid that helps me to work with it. Whether it's a data structure according to some technical definition is rather less interesting, for me. - Neil Saunders
@Eric - <document><key>stuff</key><value>here</value></document> vs. {"stuff": "here"} - Paul J. Davis
<stuff>here</stuff> vs. { "stuff" : "here" }. See the paradigm shift? Me neither :-) - Eric Jain
@Eric +1 - Egon Willighagen
And {"foo": [true, 3.4], "baz": {"more\nstuff": 1}}? Granted, XML doesn't do datatypes so thats not even a fair fight. And the other side to contemplate: "import sys; import json; json.loads(sys.stdin.readline());" I'm not labeling it a paradigm shift. More of a throwing a huge amount of code and complexity out the window. - Paul J. Davis
In the end, the important part is that everyone agrees that designing yet another impossible to parse plain text format is an absolutely terrible idea. Now to get requirements for publishing that any data formats produced require a valid reference implementation. - Paul J. Davis
I was sniggered at recently when I said I sent my list key=value data across the intertoobs tab delimited. As if it were a kind of blasphemy not to have a DTD for that. As I see it, there's only one right answer - do what's best for you. XML is great for some stuff, total poo-poo for others. Has anyone noticed that HTTP requests are not XML? Has the world ended? - Fergus Gallagher
You could just as well have a library that allows you to do xml.loads(sys.stdin.readline()) to (almost) the same effect. I'll admit that data types are more complicated (have to be defined in a separate schema). But they're also more powerful (and include basic types such as dates and times). - Eric Jain
Nothing wrong with using simpler formats -- provided you know when it's time to stop and switch to a more complex format (e.g. check out all the creative ways of stuffing information into FASTA headers). HTTP headers work just fine as key-value pairs -- though if you've ever had to parse "Accept" headers you might disagree :-) - Eric Jain
Deepak Singh
"pre-historic interface, doesn't compare to modern IDE's" I have a powerful text editor already. I don't need to be required to learn new ones. - Chris Miller
I'd say "to R", of course. They're geared towards machine learning, but R does pretty much everything in my experience. And they do seem rather hung up on things that don't seem especially relevant, like pretty GUIs. - Neil Saunders
R is seconded - Rajarshi Guha
"if we want to run a job on 100 machines (e.g. in the cloud)" #facepalm - Paul J. Davis
Jan Aerts
I have 12 million sequencing reads to map to the genome. We know these reads (length = 36bp) are split so we will be mapping subreads between 10 and 26bp long. We also know what chromosome the split reads should map to. Any tips? Software? "Suffix tree" came to mind, but have never built/used one.
The Aho-Corasick algorithm is very fast for things like this, it's a suffix tree with some optimizations. Works in O(length_of_text_to_search) regardless of number of search strings! You can probably find an implementation in ${favourite_language}. You'll need quite a lot of memory, probably doable if you divide everything up by chromo first. http://en.wikipedia.org/wiki... - Andrew Clegg
Jan, I'm not sure I understand the subread portion of the question. Do you have concatenated tags? If so, do you have any logic to split them apart with a custom script? Once you have the subreads, there are tons of short read mappers. This is a good summary -- http://www.sanger.ac.uk/Users.... I've used BWA, Mosaik, and Bowtie for different projects: choosing often comes down to features you need like multiple match reporting or gapped alignments. - Brad Chapman
Bowtie comes highly recommended; Maq is perfectly functional as well. - Heather
Jan, have you talked to Nava Whiteford at Nanopore (Oxford startup). He built the suffix array systems we used in our work years ago and recently published an open source Illumina platform in NAR (I think). I would guess he would be up to speed on the state of the art. - Cameron Neylon
Thanks for all your tips. The data that I have are readpairs that are already mapped using Maq, but where one of the reads cannot be mapped. For those we are checking if we can split the read and check if it overlays the breakpoint of an inversion. - Jan Aerts
I'm a pretty big fan of RazerS - http://www.seqan.de/project... - Paul J. Davis
/me looks at all the answers and spots the wannabe computer scientist amongst them - Andrew Clegg
Jan, you should be able to split the non-aligning pairs in half as 2 18bp reads and align each split portion separately using Maq or one of the other short read aligners. If they overlay an inversion, one side or the other will align to the same chromosome; you can set some distance from the pair to decide what size inversions you want to handle for the case where the matching half is on the other end of the inverted block. - Brad Chapman
Brad, that's about the approach that I ended up using yesterday evening. Seems to work. - Jan Aerts
Chris Miller
Encouraging good development practices for non-professional programmers? - Stack Overflow - http://stackoverflow.com/questio...
"I collaborate with a number of scientists (mostly biologists) who develop software, databases, and other tools related to the work they do. . . . Does anyone have any suggestions for how to persuade people whose primary job isn't programming that it's of benefit to their community for them to be more open with the tools they've built?" - Chris Miller from Bookmarklet
I wish I had enough karma to vote down that first answer (the one about Perl) - Deepak Singh
I wish I remembered my password to my other account to vote up that answer (the one about Perl) XD - Paul J. Davis
lol - Deepak Singh from IM
People shouldn't be getting grants without having a plan for sustainable software development and long term data storage. I wonder if grant reviewers have gotten more clueful about such issues in the past few years? - Eric Jain
Eric ... Alas no. Although they are beginning to ask the right questions - Deepak Singh from iPhone
Sure would be interesting to see the funding agencies require posting code developed using public money to a repository somewhere and then have journals require links to code before publication similar to genome papers. Even something as simple as versioned tarball hosting. - Paul J. Davis
Hari
Anybody have recommendations for a python library that does cool graphics and text outputs to pdf and png - planning on generating some reports in an app I am writing
matplotlib - Rajarshi Guha
svgfig / cairo - Michael Kuhn
reportlab - Paul J. Davis
i use mostly matplotlib, but also the python imaging library to do some finishing touches like cutting and pasting, cropping, imagine resizing, text, and basic drawing elements. - Bosco Ho
Thanks everyone . I have started playing around with the matplotlib examples. What I want to generate is a 96 well plate with a color histogram or bars , to reflect concentration of two to three crystallization reagent concentrations . Overlayed on this graphical representation would be associated concentrations in numbers. - Hari
I guess you can get some idea with recent post of Rajarshi Guha where he implemented Plate Well Series Plots in R http://blog.rguha.net/?p=388 - Abhishek Tiwari
While we're on the subject can anyone point me in the direction of some matplotlib code or examples that output directly to a graphics file (i.e. without plotting to the screen). Thinking about making a simple web service for some data fitting. Actually if there is a nice example of using matplotlib to plot into a rendered screen in AppEngine that would be very cool as well. Or a list of curve fitting type services built on AppEngine. - Cameron Neylon
@Cameron, basically you need to use a backend that doesn't require X. Example http://helpful.knobs-dials.com/index... - Rajarshi Guha
Really hoping that someone has written a class that can print out reports for 96 well plates . Really hoping for something that gives fine grained access like grid(A,12).setbackgroundcolor(red) - currently struggling with layouts and font sizes and such in reportlab . - Hari
Why not display a heatmap type image of a matrix representing the plate (http://pyweaver.worldhoppers.org/api...) - Rajarshi Guha
Hey Rajarshi thanks for the pointer to pyweaver. I will start looking at ,matplot lib to generate graphics: In the meantime I Have used reportlab . Thanks a lot Paul J .Davis for pointing me to this. Reportlab now generates pretty pdfs of my dispence lists . My pdf generation code is not super object oriented yet..but a todo for the future. You can see a report ( work in progress ) at... more... - Hari
Jan Aerts
Need help. Have tens of millions of readpairs mapped to the genome. Want to remove those readpairs that link two loci on the genome but are not supported by any other readpairs. Just want to keep those pairs that cluster (on _both_ sides). Any ideas on how to do that?
It's like a BAC-END mapping isn't it ? I would build a BerkeleyDB with all the pairs of forward/reverse shortread-ids where the orientation (F/R) and the distance(?) would be ok with your requirements. Then, take this list of pairs and I remove all the short reads of your first list. - Pierre Lindenbaum
I'd define a cluster test (ie, same molecule, strand, left and right loci) and then just iterate over all read pairs and add to an existing set if the test passes or create a new one. Should be able to do that fairly quickly if you index the sets as appropriate. For bonus points, compress each set's coordinates for the loci pairs. - Paul J. Davis
Try SeqAnswers - lots of folks worried about similar questions. http://seqanswers.com/ - Heather
Andrew Clegg
Does anyone know who invented the PDB file format? And where they live?
I feel a case of actual bodily harm coming on - Andrew Clegg
After PDB, can we do Genbank and all three versions of GFF? - Paul J. Davis
someone is having fun parsing - Pedro Beltrao
I love the way there's actually loads of PDB parsers out there but none of them do quite what I want..! - Andrew Clegg
I like BioPython's parser, but yeah it doesn't do everything I want either. To be fair I'm not handling standard PDB files, but one of the many bastardizations produced by other programs. - Jason Winget
http://url.ie/1zj7 might give you some clues. Writing a PDB parser is a rite of passage for anyone doing structural bioinformatics. - Tom Walsh
Thanks Tom, all those guys will now be first up against the wall when the revolution comes - Andrew Clegg
Sadly the revolution (mmCIF, XML) has come and gone but PDB format lives on. - Tom Walsh
Will the address of Helen Bermnan do as a proxy? - Bosco Ho
Back in THE DAY when I was a first year grad student (early mid 90's ;), I did a rotation with Jane Richardson and I remember asking 'why do they put the residues in 13 columns' and she was perplexed as well...I had high hopes when the RCSB took over, but seems things are the same? - Mary Canady
Noah Gray
Amazing visual illusion!! If all colors but the green and "blue" in the left picture are changed to black, the image on the right appears. The spiral of "alternating colors" is actually all the same color... (via Richard Wiseman's blog at http://richardwiseman.wordpress.com/2009...)
colors.gif
colors2.jpg
Seriously amazing. I'm fighting the urge to take it to GIMP to get the codes for the individual colors... - FFing Enigma (aka Tina)
incredible! had to zoom in 1600% to start believing it... - Thomas Lemberger
That's freaky... Put it in PS to test and it is the same color... The reason why it looks different is the perpendicular lines... the ones going through the "blue" spirals are pink and the ones going through the "green spiral are orange. The orange shifts what we see toward yellow making the blue look green. - Lindsay
Somewhat related, another neat eye trick: http://faculty.washington.edu/chudler... See the section on finding your blind spot. - Paul J. Davis
Exactly, Lindsay. It is some sort of "mental averaging" of the colors within each spiral. One is bluish green with pink (i.e. red+blue), which pushes the average towards blue. The other is the same bluish green with orange (i.e. red+some green), which pushes the average towards green. In fact, if I just scale up the left image and put my face close to the screen so that the individual lines become more obvious then the illusion disappears. - Lars Juhl Jensen
ok.. but I had to check with GIMP..... - Pierre Lindenbaum
Paulo Nuin
So, if a software has a GPL license, am I able to distribute it with an interface that I created? Even if the original author doesn't know or doesn't want?
Of course, this is the point of the GPL, and author obviously wants you to be able to redistribute the software, or otherwise he wouldn't put it under the GPL. A courtesy "thank you" to the author or the project would be nice though. And if you make some changes to the code under the GPL, you must provide the changes to everybody who wants it and uses your software (i.e. who ever you... more... - Vedran Rodic
Thanks Vedram. I was just checking because sometime ago the author mentioned that he didn't want the software distributed with my interface because he wanted to control all copies of the program via registration on his webpage. I'm thinking of including his code along and mention everything on a README and help files. - Paulo Nuin
If he is the sole author of the code and he wants to change the license on a new version, he could do that any time, but the copy (version) you have is still GPL licensed, and all GPL rules should apply. - Vedran Rodic
Also, if you're planning on a non GPL license be sure to read the fine print on what constitutes linking at fsf.org. must. not. rant... - Paul J. Davis
That's what I want to confirm, Vedran, thanks again. Paul, it will be GPL too. - Paulo Nuin
Why would anyone release code under a GPL and then say they didn't want other people to distribute it? *facepalm* - Andrew Clegg
It happened, for real. I might even have the emails to prove it. He wanted full control of all copies that were downloaded from his website, so when I politely asked to distribute the executable (and source, license, etc) along with the interface, so my life would be easier, he said he didn't want me to. I accepted, so it's my fault too. - Paulo Nuin
If he released the code under the GPL, he has no choice. You may modify it, release it and distribute it as you please. If that's not what the original author wanted, tough. The onus is on the author to RTF license; caveat lector. Thanks for your code; next time think harder about it. (The same goes with people who CC license photos on Flickr and then say "Please contact me before using this photo." I'll attribute, but you explicitly gave everyone permission to re-distribute your work.) - Chris Lasher
I'm with Chris on this one. If he gave you a copy of the code with the GPL license then he doesn't have the option of preventing you from re-distributing copies. - Paul J. Davis
Yes, I understand. At the time I didn't want any animosity with the person. Now, I don't care anymore. - Paulo Nuin
It does sound like the original author doesn't understand the license that he has slapped on the code. I hope too much love isn't lost - it's always better to sort things out under the 'social code' before resorting to the 'legal code'. - Andrew Perry
@Andrew - Well put. - Paul J. Davis
Iddo Friedberg
uBlog: Subgraph isomorphism to understand functional evolution OCCBIO 2009 http://www.occbio.org #bioinformatics
Jing Li from CWRU is the speaker - Iddo Friedberg
Took a while to understand what exactly the graphs are representing. - Iddo Friedberg
My best guess is protein protein interactions. Work is not presented clearly. - Iddo Friedberg
Goals are unclear. Seems like a lot of methodology but I do not understand the scientific question. - Iddo Friedberg
Idle curiosity, did the presenter cite a reference for the algorithm in finding subgraph isomorphism? - Paul J. Davis
@Paul sorry for being remiss. Not that I remember, sorry. - Iddo Friedberg
@Iddo No prob. I'll have to go in search of a reference. I couldn't find anything on the site, but it sounds like some interesting work. - Paul J. Davis
Other ways to read this feed:Feed readerFacebook