I received a similar take down notice when working on a set of microformats for biology: http://dret.typepad.com/dretblo.... The disservice is all theirs.
I received a bunch about my old bioformats project, asking why they were designed outside of the sanctioned 'process', that they weren't really microformats, who did I think I was etc etc. Pretty blunt. I understand the need for some of it, but the bureaucracy around some semantic web projects is stifling.
- Matt Wood
How many of you have a professional programming qualification? Either from an educational institution (e.g. a CS degree) or something like Sun Java certification? Is it even possible, useful or desirable to get "recognised" qualifications in other programming languages? Or are most of us self-taught?
- Neil Saunders
Useful for what? For landing a job with an established software company, a CS degree seems like a good idea. For a job at an outsourcing company in India, certification is a must. But if it's actual programming skills you're after, nothing beats practice (even if it's just with an open source hobby project). Ideally you get to work with people who have more experience than you do, so you're not just "self-taught", but also "group-taught".
- Eric Jain
Useful for bioinformaticians and computational biologists. Just interested to know if this has affected career development for anyone; either within those subject areas, or if they've moved out into other jobs; e.g. people who've left academic life science research to become software developers.
- Neil Saunders
I would think in most cases, commercial and academic, a good track record with some awesome projects would trump certification. FWIW, we don't have many people with certs, but Masters and PhDs are still common.
- Matt Wood
Let me put it this way: You don't want to work for a place that filters candidates for a software development position based on their formal qualifications, rather than their experience. For what it's worth: I have some formal qualifications for being allowed near computers, but as far as I can tell that was never a factor (both in and outside of academia) for being invited to an interview or hired. Arguably that's a rather small and biased sample set, but there you go.
- Eric Jain
I got RHEL certified at one point, but it was only because work paid for it to happen. But I'm a 'biologist turned informatician' and therefore have no qualifications in any CS related field - just experience!
- Daniel Swan
I am self taught, no professional qualification. I thought of getting certifications but in the end decided not to, as I thought it wouldn't be useful getting them.
- Paulo Nuin
I have a double bachelor's degree in CS and Bio. I felt that it helped me get into a comp bio graduate school program, at any rate. As for certifications, I personally see them as a waste of time, as most biologists in academia couldn't care less how you munge their data, as long as you do it quickly and efficiently. I'd be interested to hear if things are different in industry.
- Chris Miller
The only reason I know some people got certified was because it helped them focus down and learn something they wanted to. I haven't been in any situation during a hiring decision where certification comes into play. I'd rather be pointed to a website someone has developed or some code that's on sourceforge
- Deepak Singh
I've never heard of anyone asking for professional bioinformatics certification. Publish, show your previous work, yes. But certification? Never.
- Andreas Matern
from Alert Thingy
I have a bioinformatics masters degree, which was taken post-PhD. As part of that I was taught Java, but very much in a 'Java for Bioinformatics' style. Maybe my approach to projects earlier in my career would have benefited from some software engineering training, but you pick that stuff up as you go along :)
- Simon Cockell
I have no certification, just a master's in Bioinformatics where I was taught Java, Linux and R. I taught myself Ruby. I've got no interest in certification and I think Chris Wansworth's short essay is a good guide to follow - https://gist.github.com/0a2655a... . I think this echoes the same sentiments expressed above.
- Michael Barton
Surely easy deposition has to come first though?
- Cameron Neylon
Perhaps. Ideally all reads and writes would be created equal, but in reality a reasonable amount of heavy lifting is required at one end or the other. Given that data is usually written once and retrieved many times, I wonder if it's easier for those already generating and working with the information to jump through some deposition hoops once, rather than everyone being forced to do it at retrieval, time after time.
- Matt Wood
Agreed that access needs to be easy for users but put barriers in front of depositors and you will only get specific types of data (mostly big and well funded). But I don't know how to square the circle from easy to deposit blob to usefully described blob on a service.
- Cameron Neylon
Cameron, the barriers in front of depositors are cultural or systemic. The barriers in front of retrievers are often technical. But yes, we need to address both problems, but technical challenges of retrieval are real, since we aren't retrieving small data sets any more
- Deepak Singh
The most enthusiastic depositors are those whose peer-reviewed publication is tied to deposition. Get key journals to require deposition of data + metadata prior to publication, under a Public Domain or Attribution-required license, and the rest will follow.
- Andrew Perry
Deepak, absolutely agree - just was uncomfortable with "above all else". Tell you what, I'll let you and @mza get on with the technical challenges while I worry about the social ones. Division of labour and all that. @Andrew - this is absolutely true, but journals will not do this (and I agree with the logic on this) until they get a very strong steer from the community that this is...
more...
- Cameron Neylon
@cameronneylon it's ridiculous to think you can separate the technical challenges from the social ones, one cannot be understood (and solved!) without the other. Trying to tackle the technical challenges without solving the social ones is like building a car without having the blue print. Trying to tackle the social challenges without the technical ones is like building a rocket ship for the year 2150 and hoping someone will somehow magically solve the technical issues tomorrow.
- Alexander Griekspoor
Alexander, you won't get any argument from me on that, but Matt did start this manifesto with "there are no technical reasons...why an open data platform for science couldn't excel". In my view the technical problems are largely soluble with clear pathways for development, and I don't have the detailed knowledge to make a big contribution at the coalface. The social problems are much larger and require a more multipronged attack - which is directed by but not defined by the current technical capability.
- Cameron Neylon
What are the "social" issues? I'm confused by the use of the word in this context. Are we talking about cultural changes, such as acceptance of a more "open data" world, coercing people into using public repositories and so on? Or is this social as in social network? If the latter, I don't see the relevance to what Matt is discussing.
- Neil Saunders
Maybe cultural issues is better. But basically the fact that we have an entire social edifice built around control and secrecy driven by the need to publish. Fundamentally the problem that we need to rebuild the reward systems so that people actively take advantage of the potential of available technology. So yes, cultural rather than social perhaps, but I do think social networks or...
more...
- Cameron Neylon
Searching Google Wave with "tag:the-life-scientists" will get you to "Research collaborations in Wave", a good starting point for life scientists.
- Martin Fenner
I don't get how you search in public waves. I've tried searching for tag:the-life-scientists and it gets no hits -- I think it's just searching my own waves
- Andrew Clegg
there was a thread by Kol about wave usernames couldn't find the link
- ffcode
Aha -- with:public . They really should include a button for that
- Andrew Clegg
An undergraduate student in our lab, Caleb, just got his wave invite. I told him to look at this thread for possible people to connect with.
- Steve Koch
Afternoon all. I've written my first robot, which hopefully will embed an interactive mass spectrum into a blip whenever a UniProt name is encountered in the text, and corresponding mass spec data is found for this protein. I say "hopefully", as I've not been able to test it for real, as, alas, I have no account. When are the next batches released? If it's not for ages, does anyone fancy testing it anyway?
- Neil Swainston
Not sure you got all the different colors! I guess I can forgive you, though, as it would probably take more than 140 characters :)
- Allyson Lister
You're absolutely right. I missed out gold and chocolate and mauve and cream and crimson and silver and rose and azure and lemon and russet and grey and purple and white and pink and orange. They all inherit from super. ; )
- Matt Wood
They all inherit from super? Ah, Fridays :D
- Allyson Lister
I'd like images of the twenty standard amino acids in encapsulated postscript, but I can't find them on the web. Is it feasible for me to try and draw them in some chemistry program? Can anyone recommend where I might find these images?
If you can get an SD file containing the amino acids, I can give you the collection of EPS files using software my company has created for doing this kind of batch structure image processing. Also produces SVG and SWF.
- Rich Apodaca
Thanks for the tip, but I tried mol2ps but I couldn't get it compile on OSX.
- Michael Barton
So far I've gone down the route of PNG -> JPG -> PS but the resolution is really poor.
- Michael Barton
SD file is in one form just a concatenation of molfiles joined by a "$$$$" line.
- Rich Apodaca
I have mol2ps running on STITCH. I had wrapped it in mol2png, but I quickly exposed the ps file as well: if you go to http://stitch.embl.de/images... (replace 1 by your favorite PubChem compound id) then you'll get a postscript image. Please ask for only one structure at a time, since I get the SDF from PubChem
- Michael Kuhn
@Rich I put the combined mol file as well as a tgz of individual mol files at http://drop.io/amino_acid . If you have the time to produce the resulting images file that would be fantastic.
- Michael Barton
@Michael I tried using the link but I couldn't view the resulting ps output. :S
- Michael Barton
What do you get? If it's a white page, look in the lower left for the image. Works fine for me with Firefox and the Firefox PDF plugin
- Michael Kuhn
@Michael, just sent you the set of EPS files. I had to make some minor changes to the SD file, but was able to download it OK in FF 3/Linux. Feel free to redistribute the result. It's also very easy to change the look of the output (colors, line widths, atom label sizes, etc.).
- Rich Apodaca
Not in time obviously - but a lot of the functionality could be similar. The trouble is there is a fundamental break between the "participant list" functionality of both Wave and Facebook (i.e. people have to add other people to the wave) and the public participation in Friendfeed which I think is at the core of what has worked well for the research community here.
- Cameron Neylon
There are groups in Wave which allow larger collections of people to follow updates, and you can certainly import Twitter/RSS into a Wave. However, the Google Wave client app wouldn't make a good replacement as it stands: the activation energy for getting started and staying up to date is much higher then Friendfeed.
- Matt Wood
I agree the client isn't easy to use but that is not really the point in my view. It might be possible to build something that looked a bit like friendfeeed using the embed API and a wave server with OpenID authentication but I just think it wouldn't behave the same way in important ways. But the client isn't the protocol or the framework.
- Cameron Neylon
I agree with Cameron. It might well be possible to build something on top of the Wave APIs, however, from what I can see, there is a fair amount of 'magic' in Friendfeed from both the smooth user experience perspective, and the technical implementation. Either way, the Friendfeed public timeline would be very useful as a seed to provide context.
- Matt Wood
no, unlikely. More general adoption of pusubhubbub might lead to decentralization of commenting on items of interest, which would render the need for a hub like ff obsolete.
- Ian Mulvany
I think Wave is more about collaboration than aggregation.
- Euan
Agree with Ian, PubSubHubHub may provide a solution. And definitely agree with Euan - Wave is about (controlled ) collaboration, FF is more about uncontrolled commenting on aggregation.
- Cameron Neylon
Very nice. I should point out though that a "most emailed" story at BBC News is not a good measure of impact. I read a post from someone who pushed a story into that list by emailing himself only 5 times :-)
- Neil Saunders
Congrats. Not what I expected at all! I hope rails will be playing a large role.
- Neil Saunders
Or perhaps ranking papers by their citations to/from the literature to suggest further reading? Consider that a wishlist item, btw :-)
- Chris Cotsapas
@Chris, that's a major feature I'm pushing for in @Mendeley, as well. Last.fm-style recommendations are coming first, since that's their background, but "times cited" will definitely be an input into that.
- Mr. Gunn
No, no, no! We don't need any more pointless recommendation engines, iphone apps, or other trivia. Just get the core software working properly, e.g. please do something about the fact that not every document on my hard disk is appropriately classified as a "paper".
- Matt Leifer
All great feedback - thanks, folks! We'll definitely have more to talk about in a few months. ; )
- Matt Wood
I'm would have thought that not recognizing PDFs has more to do with poor/absent metadata than recognition. I suppose you could assume large font text is the title and search on it (if text PDF) but the rest is largely publisher-side isn't it?
- Chris Cotsapas
from Nambu
@Chris Parsing PDFs is a fraught and error prone process: it's really a shame that they were adopted so widely. But they're here now, and so there's plenty of opportunity to add value!
- Matt Wood
That was my understanding. There's not much you can do with an unannotated image-based file...
- Chris Cotsapas
from Nambu
I don't know much about parsing PDFs myself, but there;s definitely a different between the old ones, which look like scanned copies of pages, and the new ones which were converted to PDF from whatever other format they were in.
- Mr. Gunn
Some publishers are starting to add better metadata to their PDFs, but it's still an uphill struggle across the board.
- Matt Wood
"ActiveResearch is a great opportunity to meet and greet others working with Ruby and Rails in a scientific or technology discipline."
- Martin Fenner
from Bookmarklet
That is, let's assume for sake of argument we have in a single place the entire full-text of every scholarly article (in English, for sake of argument) back to the first journal (the Royal Society?) How big is that? 60 million articles? (I'm pulling a number out of the air.) How much CPU time to do a basic Lucene index, and how much time to do a full semantic index?
- Richard Akerman
The second question is "how much CPU do you need to serve up the search index", to which I would think "no more than Google currently has" would set the ceiling. Not sure what the floor would be.
- Richard Akerman
Building an index of the full Medline corpus (which is just abstracts) takes a little over 2 days on a quad core, high memory (32Gb) machine, as I recall. I used MG4J over Lucene though: cutting edge compression and indexing.
- Matt Wood
William Hayes told me it was about 250 CPU-hours (an admittedly hazy term) to build a semantic index using MEDLINE (18 million abstracts) and the small true open access bits of PMC (about 50k articles)
- Richard Akerman
not enough to make the CPU/server requirements a serious barrier to just doing it would be my answer. 250 CPU hours/couple of days on a fast machine feels about right to me. Wouldn't have thought serving it would be terribly hard, probably less than a terabyte of total data if its just the text, and presumably a reasonably parallelizable process?
- Cameron Neylon
I think it may be an interesting point to make (perhaps an article?) to show just how (relatively) little compute power and storage it would take - achievable with Amazon S3/EC2 - to store and index the entire scientific (journal article full-text) literature back to the Royal Society days.
- Richard Akerman
@Cameron we're already into multi-TB with just the PMC complete set (about a million articles) - it will depend on the storage formats being used
- Richard Akerman
@Richard - yes obviously the format is important in that statement. Was thinking afterwards that I was probably underestimating the amount of text and other gubbins you'd need to store.
- Cameron Neylon
hi gents....i think the corpus of scholarly literature in the 21st century is upwards of 700-800 million articles. Could even be a billion documents when you consider articles in the deep web. In other words, massive. Dean Giustini
- Dean Giustini
But this brings up the issue that Nicholas Baker (who is otherwise a jerk) and critics of Google put forth--that such initiatives tend to damage the documents that are scanned in such projects and that such projects tend to lead to the physical disposal of the original materials. It is not just a matter of computing power.
- Hope Leman
@Dean I think you need to scope "scholarly literature". I can't imagine the journal literature as we understand it (in English at least) is more than about 100 million articles - I'm pretty sure my organisation (and OhioLink, Scholars Portal) and others have a substantial portion (30%?) of it already spinning on local hard drives. I guess it's a set of questions: 1) how big is it if you get all the *currently available* (digital) journal lit in one place 2) how big if you digitise everything else...
- Richard Akerman
... 3) how big if you add in books 4) how big if you add in "everything scholarly" in text 5) how big if you add in data (I know for this last item once you add data we get into the petabytes *very* quickly)
- Richard Akerman
The main bottleneck is not so much "basic" indexing (which can often be done almost as fast as the data can be obtained) but all the additional processing required to make the index useful (smart tokenization, stemming, synonym substitution, detecting duplicates, pageranking etc). Going beyond 100M documents or 1TB of data a distributed approach seems like a must, see e.g. http://queue.acm.org/detail... which discusses hardware requirements for Nutch (based on Lucene).
- Eric Jain
@Eric great pointer - will have to think about how much Nutch experience (which was mostly for Internet Archive?) would map to indexing the scholarly literature.
- Richard Akerman
Confirmed: I'll be speaking at RailsConf! 'Orchestrating the Cloud': using Cloud approaches and Ruby to get stuff done: http://en.oreilly.com/rails20...
In case you missed it, RailsConf accepted _everyone's_ proposal today! An accident, apparently. The 'testing' jokes rang loud and clear on Twitter: http://search.twitter.com/search...
- Matt Wood
Ah darn. Of course in this case, an inevitability
- Deepak Singh
Here's one for the programming fans. I'm good at Java & competent at Perl. I'm itching to learn a new language that's less wordy than Java and less easy than Perl to write bad code with. I'm thinking Python, Ruby, or maybe Groovy or Scala to leverage my Java. Suggest a language and persuade me :-)
Python. Easy of learning, fast prototyping, lots of free packages available, "endorsed" by Google... and, one of the most important: it's a pleasure to write python code. And, if you need to do math with it, good package: http://www.sagemath.org/
- Arnaldo M Pereira
Any experience of Jython? I like the idea of being able to use all the Java libraries as well...
- Andrew Clegg
Ruby. Pure obect-orientation, elegant idioms, powerful meta-programming. And just plain fun.
- Louis Simoneau
Never heard of it, until now. The idea seems weird to me..
- Arnaldo M Pereira
If you've spent 6 months learning a specific Java toolkit (in my case Apache CXF for web services) the idea of being able to keep that investment seems appealing. But only if the implementation's good...
- Andrew Clegg
Groovy would be the easiest one to pick up given your Java experience. It's much more concise than Java but it's easy to leverage the Java APIs if you need them. It'll definitely give you the most seamless fit with Java.
- Tom Walsh
Python and Ruby are broadly similar in ease of use and elegance. In general I think Python has a stronger base for the sciences, while Rudy is a little more focused on the web. For a while it looked as if Ruby was going to overtake Python, but recently Python has gained traction and is moving up the popularity ladder fairly quickly. There's a significant advantage in using a popular language; better support and less reinventing the wheel.
- Ian York
Ruby: because whitespace shouldn't matter :)
- Chris Miller
+1 What Chris said, I hate the idea of significant whitespace in a language, and it's always put me off learning Python I'm afraid :(
- Daniel Swan
You can't go wrong with ruby or python. Personally, for some of the reasons listed, I prefer Ruby
- Deepak Singh
Python. And I don't need to persuade you, the language does. You wouldn't go wrong with Ruby too.
- Paulo Nuin
In all seriousness, if you're familiar with Perl, Ruby won't be much of a stretch. In a lot of fundamental ways, it's like perl, sans the ugly syntax that gets in your way, plus real OO.
- Chris Miller
+1 ruby. Although you won't do wrong with python either.
- Jan Aerts
''C'' . yes I know, in this world of modern 'frameworks', OOP, scripting, web applications it's a little bit provocative. But 1) I've got the feeling that less and less people know C and learning it can be an asset. 2) the BLAST algorithm was written in C, and BLAT, and... 3) managing the memory yourself can be a challenge. Oh ? did you say less wordy than java ?... hum ;-)
- Pierre Lindenbaum
You can write very bad code with ANY language. Of course, with Java/C# code is wordy so you think a little more before writing too much crap - but that's it.
- General Kafka
for something mind transforming, learn a functional programming language, such as ML.
- General Kafka
Instead of a new language entirely, you could learn some new facets of development: additional skills in testing, automation, patterns and anti-patterns are useful in any language.
- Matt Wood
Surprised no one has mentioned C or Assembly. Once you have a good foundation on exactly what a language is abstracting you'll find that learning new languages is trivial. Following Matt's guidance, learning basics of things like file systems and network protocols is a good idea.
- Paul J. Davis
Oh, missed Pierre's comment. But I think we agree on the why.
- Paul J. Davis
+ 1 ruby. May be R if you want to do a lot of statistics :) ! There was a poll about *Which computer language are you most interested in learning (next) for bioinformatics R&D?* in Bioinformatics.org http://www.bioinformatics.org/poll...
- Khader Shameer
You all are going to hate it when we start sharing all the LabVIEW code we're writing :)
- Steve Koch
Interesting that Ruby and Python are about half and half here but Python's waaay ahead in that poll. Re. C, dabbled ages ago, but it's not where I'm at now really. And too wordy :-) @Kafka yeah I was thinking about that, hence mentioning Scala. But you can do functional in Python and Ruby too right?
- Andrew Clegg
Doesn't it depend why you're learning it? I think if you already know Perl, Python or Ruby shouldn't be much of a leap, so personally, I would only bother if it was needed for a specific project. As suggested, maybe C or network protocols if you want to get a better grasp of computing. R / Bioconductor looks good on a bioinformatics CV and is genuinely handy to know. Not a new language, but have you had a look at Moose.pm? Takes some of the *urgh* out of OO Perl.
- Cass Johnston
Groovy, Ruby and Python all support some functional programming idioms; Groovy's FP goodness is one of the things that makes it much more palatable than Java for those of us who prefer dynamic languages. If you really want to experience FP then Scala looks like the one to go for. Clojure (Lisp on the JVM) is also worth a look and there's always Haskell if you're willing to leave the Java world completely.
- Tom Walsh
+1 for Cass's recommendation of Moose for Perl.
- Tom Walsh
@Cass - London.pm techtalk looks v. cool and if I lived about 600 miles closer to London I'd definitely be there. Would be nice if the slides from the talks are online at some point.
- Tom Walsh
@Tom Walsh: You on the london.pm mailing list? If not, I'll give you a yell when they post the slides.
- Cass Johnston
@Cass Thanks for that Moose link. I might go along. One of my workmates uses it extensively and it's about time I learnt about it.
- Andrew Clegg
Python is definitely my vote, but like Cass said above it will be very dependent on what type of project you want to use it for. For instance, using Jython to call CXF libraries will not give as much exposure to thinking about problems in the way python programmers learn to do. That's the major advantage of learning a new language, since you can take away the concepts to your regular daily work.
- Brad Chapman
do you twitter from your phone? People who follow my twitter get texts when I update my twitter. right now that is only one person, but still.
- Anthony Salvagno
No - in fact I've only just piped friendfeed into twitter - where I realised several people had tried to raise me over there where I wasn't monitoring. What I'd like to do is bring it into a separate friendfeed stream that I can easily dip into when there is time but ignore when there isn't. I don't really want another client on my desktop either if i can avoid it
- Cameron Neylon
Anthony, the texting-on-update behavior is set per-user by the follower.
- Mr. Gunn
Good point - it would be nice to have lists for our own feeds, as well as people. Could you make an imaginary friend, which just has your twitter stream attached to it, and add it to a 'when I have time' list?
- Matt Wood
Also thinking about this. Can't you feed it into a separate room?
- Matthew Todd
There is a problem with the feed of your own stream out of twitter - they seem to all be secured via username and password. What I would like to do is pull out the stream that I see when logged in at twitter. The other approach is to set up imaginary friends for all the people I want to follow but that seems like a faff
- Cameron Neylon
Matt, that's exactly what I did. I added the pipe as an imaginary friend.
- Mr. Gunn
To Matt Wood - Hi Matt if you are reading this, could you make Euan Adie an administrator of this room? The Nature Network consolidated feed is not updating here whereas it is in Google Reader, so Euan might need to take a look. My admin rights aren't sufficient to make him a co-admin. Thanks! Maxine.
And what's the throughput of the automatic and manual annotation?
- Egon Willighagen
Egon, for things that aren't covered by automatic annotation, I need between 6 hours and a week per protein sequence to do some serious annotation. So my weekly throughput is optimistically around 9000 bases a week ;).
- Pawel Szczesny
Each sequenced sample is automatically analysed for base calling and quality: that runs a 1000 core cluster flat out. Secondary analysis (alignment, SNP calling, annotation) is performed after that.
- Matt Wood
Matt, what does that mean? All local software/data repositories get converted to Git? What (graphical) clients are people expect to use?
- Egon Willighagen
How does this integrate with non-Git specific software? A BioMart bridge? Or services?
- Egon Willighagen
This is just an additional option for those using our central source code repositories. RCS, CVS and SVN have been in place for a while - good to see Git getting some traction in the life sciences.
- Matt Wood
I would help out, even though I am no Ruby expert.
- Paulo Nuin
I wonder if something along the lines of the Peepcode or Gitcast screencasts would be cool?
- Matt Wood
Gitcasts lite would be nice. Essentially a world in which the examples are about bioinformatics-related topics and not blogs and shopping carts :)
- Deepak Singh
That's the sort of thing I had in mind - teach Ruby from a bio perspective: define classes with biological relevance, using ActiveRecord in biology, show use of @jandot's Ensembl API etc. Perhaps some podcast discussions too.
- Matt Wood
Stands in front of line. I saw go for it
- Deepak Singh
I'd be interested in learning some Ruby, especially if it's bioinformatics-related.
- Walter Jessen
Yep. And I'll try to find time to contribute :-)
- Jan Aerts
That would be great - It would be great to help introduce the rest of my lab to ruby and programming. -r
- Rob Syme
I'm in a ruby bioinformatics lab - I may be able to contribute a guest post or two, and I'd definitely read the blog.
- Chris Miller
Don't know about a blog, but I'd love to help write a wiki book. We could start by demonstrating every bioruby method by example.
- Neil Saunders
The ruby_for_bioinformatics site could be a repository on github, written in the format of Tom Preston Warners blog engine. That way everyone can contribute sections via git. The site could then be automatically hosted on github at ruby_for_bioinformatics.github.com.
- Michael Barton