I've been using BWA +SAMTools for coverage estimate, SNP calling. I still find SOAP and Novoalign useful for the partial alignment and conditional trimming, but BWA is definitely the critical tool in the NGS toolkit.
- Jason Stajich
@Neil You're right I should take into account plasmids. If there are large plasmids in our sample this could reduce our coverage even more.
- Michael Barton
-c was the option I needed - it samples properly and you can choose whether weighted or equal sampling
- Jason Stajich
You're doing yeast I assume? Did you weight it toward larger chromosomes?
- Michael Barton
Did you end up using their empirical model for solexa? I am trying out a bunch of different options with PE and different sized insertions, and +/- 454. I am planning on mira and velvet comparison too so we'll see.
- Jason Stajich
actually this project is for a plant ~400Mb, but will also do some of these for some other fungi - though not yeast of ~40Mb.
- Jason Stajich
The -c option samples proportional to chromosome size so no need to weight it yourself, it works just fine AFAIK - need to see how it ends up when the assemblies finish. Simulating 20-50X of 400Mb genomes with several different read sizes and sampling strategies, so taking a long time and lot of disk space.
- Jason Stajich
I'm using 454 as we've already paid up front to use this platform and therefore the number of reads we'll have available will be less that 1M. We're probably looking at less than 20X genome coverage for our sequencing. I should try velvet and mira too for assembly since I think minimus is quite conservative. QRSA also seems quite good too from reading the article, but I haven't found an option for metasim to produce quality scores yet though.
- Michael Barton
With regard to disk space, I thought about trying a log scale for genome coverage, e.g. 8 + 2X, 4X, 8X, 16X to try and interpolate the amount of coverage vs number of contigs. I'm not sure what the curve might look like though, and as you write time is a problem particularly for assembly.
- Michael Barton
Agreed with Jason. I like RDF and have played with it out of interest, but not run into lots of available data in my day to day work.
- Brad Chapman
Fair enough-- I've been sitting on gigabytes of RDF for awhile, and using it in my commercial business (just because it's the early days); I'd be happy to post a few (possibly) interesting bits to you guys to try out. My only request is that we openly work together on it to do some demos possibly. Cheer!
- Eric Neumann
To Eric and everyone who has "liked" this, please explain what this means. I understand the concept of RDF. I do not understand what you want Bio* projects to do with it.
- Chris Lasher
Chris. If the public dara were formatted/normalized using RDF, most parsers, SQL queries used to cross the data between two or more datavases would be useless. bio2rdf is a good example: http://bio2rdf.org
- Pierre Lindenbaum
Pierre, I interpret your comment as meaning "using RDF, there would be no need to develop nor maintain data and query parsers", which I pretty much agree with. I would hope that would free most informaticists to spend more time writing analytics and views, and less on bit swapping- IMHO...
- Eric Neumann
@Eric , yes that's what I tried to say with my poor English skills ;-)
- Pierre Lindenbaum
I "like" most mentions of RDF + bioinformatics, it tends to get lively :-) I interpret the post to mean "there is more out there than we realise". If so, as the first 2 comments said, let's see it! I hear a lot of talk, I don't see much in the way of usable data or tools.
- Neil Saunders
++ to jason, brad and neil's comments. bio2rdf is a good start, but why do we even need RDF? BioMart is another warehouse that seems to work nicely on good old SQL. I think when people see a tool of comparable utility that depends on RDF they will take notice.
- Chris Mungall
Chris- I actually wasn't referring to bio2rdf- as you said it's a nice start, but not key for most imformaticists; alternatively, I have converted GEO, mutdb, and pathway data into rdf. Was hoping you guys would find this useful to try some investigate?
- Eric Neumann
from iPhone
I'd be interested in checking it out - but I've already paid the RDF tax. What the average bioinformatics hacker would benefit from would be a paper showing off some of this stuff with some clear non-waffly reasons why rdf works better (or not better) than sql, json, plain xml, couchdb, etc.
- Chris Mungall
A demo for RDF I'd like to see: extracting a list of genes from a RDF datastore for a given organism in a given region (chr:start-end). demo n°2: thoses genes must be the descendants of a GO term.
- Pierre Lindenbaum
"RDF tax" is a funny expression; do you mean RDF-XML or RDF in general? Most of my projects don't bother with RDF-XML; they're either RDF-N3 or RDF-JSON... remember, RDF is not a syntax, it's a data source and type binding semantic! As I said initially, I think most folks have missed what RDF is really capable of (and where it saves time). I will post my data shortly...
- Eric Neumann
I think we miss what it's capable of because we don't see demonstrations. My impression with many data formats, ontologies etc. is that once designed, their creators say "OK, we've made you a nice format, you all have XML/JSON/whatever parsers, get to work!" So here's my plea: when you create the format, create a small tool that demonstrates its worth, make it public and advertise it.
- Neil Saunders
Should direct this here in this thread also: How many of you know about/use MIT's Exhibit?
- Eric Neumann
Now, that's what I'm talking about. I didn't know about/use it before - but now I do/will (or might). Thanks!
- Neil Saunders
I'd be more than happy to share all my examples and code extensions of it to all of you... some simple examples under "demo" at http://eneumann.org/
- Eric Neumann
btw, just look at the source HTML of these examples, and that's all there is to creating your own facet RDF viewing app ("steal this code")
- Eric Neumann
OK, spent a little time playing with Exhibit and I am now officially really impressed. Giving me lots of ideas. But I'm wondering how large a JSON file can be? Hundreds, thousands, millions of items?
- Neil Saunders
I really think this is a game changer. multithreaded now and as comparable to BLAST for a lot of things we do. We're using jackhammer as replacement for some of the family building from a founding sequence member. If this works, and I hope it does, it will be a great solution for aspects of pipelines. And doesn't require massive compute or FPGA solns. I'm really stoked.
- Jason Stajich
$4400 for 45-87X coverage. So that is say 150B bases for $5k. That would get you a lot of smaller genomes quite nicely too but not sure if economy of scale will work out to multiplex with this technology.
- Jason Stajich
from Bookmarklet
Agreed, but are the type of people that run these conferences the same ones who embrace social media, or those who fear/dismiss it?
- Chris Lasher
They may fear it or are really just not aware of it. The CSHL aspect is a little different. I think there is a generation of scientists who are ready to do it, just need to prime the pump a little. I say we attack with pictures first as everyone likes to see pictures of happy people from their conferences on the web ( not too happy mind you). Maybe I'll start adding a flickr/twitter tag...
more...
- Jason Stajich
bioperl has a mantra that I paraphrase as "working code wins" -- meaning practical implementation trumps theoretical arguments for the "right" way of doing something. Anyone know where this is described?
Ahh, dates back at least to BOSC 2002 from a talk by Jason Stajich http://open-bio.org/bosc200.... I couldn't find it on bioperl.org which is why I thought I was remembering it wrong. But perhaps this has been retired as "official" Bioperl dogma?
- Andrew Su
Ah there was formal dogma? I was almost done with Perl by 2002 :)
- Deepak Singh
You know, I too was pretty much done learning perl in 2002, but the crazy thing is that my 2002-perl knowledge still allows me to do 90% of what I want to do. Anyway, hopefully Jason will chime in here with the official line...
- Andrew Su
It hasn't changed much since 2002. I am pretty sure Jason has the official line. I just liked the basic Perl motto: TIMTOWDI.
- Deepak Singh
Once every ~6 months I convince myself that I should drop perl but every time there is something that needs to be done for yesterday and puff there goes the motivation. Its so hard to change when the transition is a period of veryyyyy annoying inefficiency
- Pedro Beltrao
Pretty sure Ewan Birney popularized this; I remember it along the lines of "the first one to code it wins." The idea being that you can discuss object models and the right way to do things all day, but in the end working, easy to use code is what people are going to adopt. I don't remember if there is an official description of the thought process; probably too busy coding.
- Brad Chapman
Unfortunately, this hides the idea that code is more than the an algorithm. If you had some clearly written, well documented, simple and well designed code which a new version of blast had broken, and a pile of spaghetti which happened to have been written more recently, perhaps it would make sense to dump the latter and fix the former. You can go too far with perfect design, of course, but you can go too far with macho programming as well.
- Phil Lord
I think Ewan was the originator - I think I was just restating the motto. There was some design/developer philosophy in the bioperl.pod that also probably included this. The philosophy is that lots of mailing lists (where our dev is primarily coordinated) are filled with people spouting or complaining about things not being perfect, but in the end working code would win the argument.
- Jason Stajich
oh yeah, what Brad said, now that I am reading backwards!
- Jason Stajich
By "in lab" sequencing projects, do you also mean labs doing the assembly/annotation in house using reads sequenced elswhere?
- Rob Syme
Sure - if that is fee for sequencing at commercial entity as well as university sequencing cores (or just illumina machine in someone's lab I guess). Really just a gut feeling estimate. Something it would be interesting to quantify as time goes on. We need both streams of data I think - my concern is the place that will check quality and aggregation of that data when it comes from so many sources becomes GenBank/EMBL?
- Jason Stajich