great, thanks for the link Rajarshi.
- Neil Saunders
That was a terrible post. RHIPE on the other hand rocks :)
- Deepak Singh
MapReduce does not equal Hadoop, and those R functions have very little to do with MapReduce. They just indicate that R can do functional programming. I agree with Deepak.
- Shiran Pasternak
We all learned something from that then! There are quite a few of these "map without the reduce" articles out there.
- Neil Saunders
This should help us pick sets of diverse solvents.
- Andrew Lang
Is there a reason to use approximate clustering over an exact method? The dataset isn't very large if I recall. Is there a big difference with say standard hierarchical clustering methods?
- Rajarshi Guha
from iPhone
For small datasets it should do exact. Just to make sure I used the the flag --all-pairwise. Didn't seem to change but you're right - I needed to make sure.
- Andrew Lang
"So. You, and a quarter of a million other folks, have embarked on a 1000-year voyage aboard a hollowed-out asteroid. What sort of governance and society do you think would be most comfortable, not to mention likely to survive the trip without civil war, famine, and reigns of terror?"
- Michael Nielsen
An interesting problem. But it seems problematic - if all our institutions (except ay the church) are of approximately single generation time spans, how do you go about hypothesizing that society X will last stably for 1000 years.
- Rajarshi Guha
How much more useful would a compchem SO be than CCL? (Well, you could easily filter out Gaussian questions!)
- Rajarshi Guha
Have you ever heard the same question twice on a mailing list? Also, fewer ads...sorry, I mean commercial announcements. You'd also have way more people learning about Open Source software. Take Linux4Chemistry for example - that's advertising Open Source software by having it on the same footing as proprietary software.
- Noel O'Boyle
Yeah, the commercial announcements actually make me loose interest in the CCL... and actually makes me very reluctant in mentioning the CDK when people ask about it...
- Egon Willighagen
I'm not sure I understand the animus towards commercial s/w. Yes, ads bug me as well, but it's not like CCL is deluged with them (unless you consider Gaussian help requests). And why shouldn't the CDK be mentioned in the presence of commerical s/w announcements?
- Rajarshi Guha
Oh, there is no reason not to mention the CDK... it's just that I think there are already 'but our product' replies enough... So, I just stopped doing it.
- Egon Willighagen
The BlueObelisk StackExchange is more target, interactive, I can update my replies and give more detail.
- Egon Willighagen
@Egon, all the more reason to answer the questions first :) Then everybody else is "me too"
- Rajarshi Guha
@Pierre, thanks for the catch. I actually do get the correct data. The absence of bars seems to be a limitation of the Google Charts API and the max size of the chart image. If the X-axis is too long, the bars need to be thinner for it to fit into the plot area, and you can only go so thin and then loose readbility. Like I said - this is a quick hack :) Ideally I'd use matplotlib or R as the backend for plotting, but am lazy right now
- Rajarshi Guha
interesting, and very cool. One question/comment, does the Google Charts API by default adjust the image so that the x-axis crosses at a non-zero point on the y-axis? I'm sure that breaks one of Tufte's rules about misleading graphs...
- Andrew Su
@Andrew - yes. It took me a bit for me to realize that the count for the first year was not always zero! The Charts API does have a way to set the y range, but I'm going via the pygooglechart Python module which doesn't seem to support that.
- Rajarshi Guha
i was about the link to my pubmed trends tool, but i see it has already been done :) it is indeed slow -- a clunky php proof of concept really. (it has been especially slow today because apache filled itself with processes that were trying to do i'm not sure what...) http://www.cotch.net/assed...
- Joe Dunckley
Bosco, this is a superb idea. Along with starting up a new journal/software hybrid, it will be great if existing journals insist users to submit source code, executable or VM of a bioinformatics software / database / server to a centralized repository like 'biohub.org'.
- Khader Shameer
While not linked to an actual repository (but rather, provides a snapshot of the s/w and data for the article), Journal of Statistical Software, does pretty much this
- Rajarshi Guha
I would take this further and the article text remains in the revision repo. The reviewers are sent to the article, not the other way around and it can be forked in just the same way the software can
- Frank
from iPhone
@Frank, this makes sense, since otherwise the paper would be static and refer to old versions. But then this assumes that as the s/w is updated, so is the paper
- Rajarshi Guha
@Rajarshi not neccessarily the paper should state which version/revision it refers to. It does not have to keep up with the sw. That is what documentation is for :)
- Frank
from iPhone
The more I think about it, the more I think some big-wig bioinformaticians should do a deal with Google Code to edit a journal. That might even align with Google Scholar.
- Bosco Ho
@Frank, in that case, why bother with a VCS? Why not just put a tarball with the source code for the version that goes with the paper?
- Rajarshi Guha
Great idea, but I can't see it working for data sets. Yes data sets evolve and should track provenance somehow, but having been in and around standards groups for some time now, this is an impossible task for a publishing group to take care of, especially considering the nature of big-data bioinformatics. Plus if goes against best practices for software source control (use factories, don't store your database...)
- delagoya
There are some interesting and non-trivial questions around this kind of idea as to what peer review should look like. Should such a journal provide virtualisation environments so that the code can be run? Example data should be a requirement presumably? Are peer reviewers expected to evaluate code "quality". Anyone thoughts on this would be extremely useful...and help guide a project like this into reality.
- Cameron Neylon
My answers to Cameron's points: (1) no, (2) yes, sample data would probably be used to run tests which should pass, (3) quality is somewhat subjective - minimum requirement should be that code runs and generates output as expected - but reviewers could certainly suggest code improvement where appropriate.
- Neil Saunders
So if the answer to 1) is no, does that mean that you can't necessarily expect referees to actually run the code? Or compile it? Or just that you pick referees appropriately? Or conversely that "refereeing" becomes a process of building up enough positive comments or karma points in the repository...? It seems to me that you want to bring the best of versioning systems and best practice...
more...
- Cameron Neylon
Referees should certainly be able to run code - I'm just not sure that virtualisation through the web interface is the way to do it. Seems like an additional layer of complexity that might get in the way of making this idea work.
- Neil Saunders
@Cameron & Neil: If it could be figured out how to to handle the virtualization (or having remote access to machines), I think that'd be a highly valuable addition to peer review. Easy for me to say (not knowing how to implement it), but I think it's a great goal to strive for. It doesn't seem too crazy to have the journal have a bunch of machines on hand so the authors can remotely upload / install code and referees could then remotely log in to look at and try out code.
- Steve Koch
I can't figure out where to jump into this thread. Personally, I think we just need a place to publish locations, i.e. the code is here, data is there and this is the version we used, etc. That must be maintained and being able to maintain that should become part of the funding process. Since funding agencies are the ones who are funding this research they need to include the ability to...
more...
- Deepak Singh
My feeling is that being able to run the programs somewhere on a server without downloading them is important - but that is very much a user's perspective. I often look at useful things that are made available and just have no clue how to actually make them work. A good range of downloadable executables would probably do the job for me though. Additional question: what are the standards for web services?
- Cameron Neylon
Which is why VM's and cloud services are such a big deal for demo's and provenance now. You can package up a VM with the exact stack that you want and make it available, either as a service or a VM you can launch yourself. It's too easy not to do it
- Deepak Singh
@Deepak : Cloud + VM is an an interesting combination, but should have an accessible pricing that is affordable to a larger research community
- Khader Shameer
I think there should be strict guidelines while reviewing bioinformatics software / database / servers to test the resource. I had a recent experience : a reviewer wrote extensive list of points to reject a server that we developed with out trying what exactly it is doing or to know how does it differs from other existing resources. I strongly support the hybrid journal model, also it...
more...
- Khader Shameer
Let's talk specifics. VM images are great, but you are tying your release to a particular release of a particular platform. A better approach is to start from a base OS (like a linus distro ISO) and have a set of build instructions for system set up and application building. My favorite of the moment would be Chef.
- delagoya
Second, academics love to solve a problem with a novel algorithm and then move on. In fact it is in their best interest to move on after milking a project for all it's worth, publication wise. Maintenance, or even robust testing (couch... Tophat ... cough ... Bowtie .. cough ) is not even on the radar. Frankly I am not so sure it should be. Maintenance requirements may slow the pace of...
more...
- delagoya
@delagoya, good point. If I have made significant improvements, why update the old paper? better to try for a new paper!
- Rajarshi Guha
delagoya, chef's fine too. Find a common medium/mechanism that works for the community. The resources are certainly there. It's a matter of trying things out. As someone I know says, start simple, and iterate
- Deepak Singh
Khader, that's where the funding agencies come in. They need to provide mechanisms for sustainable funding here.
- Deepak Singh
The nice thing about a hybrid journal is that it might be possible to have new dois/database entries for "significant" updates. Not perhaps just place holding papers as is the case sometimes in the NAR database issue but when something has changed significantly you can get a new paper without needing a new algorithm or service. I like the idea of funding to support "orphan" code and services as well. Make it worth money and people will do it.
- Cameron Neylon
Delagoya - as a naive user I disagree. I really don't want to have to build, I want to use in the lowest stress way possible and a hosted VM seems like a good way to enable that - as well as allow for longer term preservation. We may not be able to run linux on future hardware but will probably be able to handle VMs for longer (actually having written that I'm not sure its true - would be interested in more expert perspectives)
- Cameron Neylon
I almost missed this discussion. I really like the idea but I wonder how discovery type projects fit in. I mostly use code to look for trends. If anything I might make some predictor to enhance existing data. For these reasons most of what I do is one off scripts around perl and R. Maybe this sort of project does not belong in a bioinformatics journal at all.
- Pedro Beltrao
Pedro, great question. Personally, if we included all glue code, small scripts, etc this would be unsustainable and defeat the purpose of peer review as well
- Deepak Singh
@Pedro, I don't see a journal/software hybrid as replacing all bioinformatics journals. I think there's a place for journals that discuss pure algorithms and ideas. These would do exploratory type programming. Normal journals service these papers quite well. For me, a hybrid model targets specifically those papers that describe a program that is meant to be used by other people. In that...
more...
- Bosco Ho
Bosco, you're thinking along the lines of a communications journal aren't you. And then people can go to work on the code if it is on github or something
- Deepak Singh
@Deepak. Yep. The disconnect I see is that pragmatically, it's the open-source project that counts. The article in the bioinformatics journal is so that we can get a place-holder to collect citations that contribute to our academic CV. The journal/software hybrid provides the most efficient way to this goal.
- Bosco Ho
Very nicely summary of the problem. Really, the whole concept of a journal article about software is stupid. What does an academic article do? Alert people to a new finding/discovery. But in the case of software - well, the software is the finding. And people are "alerted" by finding it on the web, downloading it and using it. As Bosco says, the sole role of an article here is a CV tick - hence the hybrid approach. Non-academic programmers must find all of this very odd.
- Neil Saunders
Project-Focused Activity and Knowledge Tracker: A Unified Data Analysis, Collaboration, and Workflow Tool for Medicinal Chemistry Project Teams - http://pubs.acs.org/doi...
@Egon, vaguely speaking, Chempedia will attempt to make the Web a much friendlier place for chemistry. There's been a heckuva lot of innovation on the Web and Chemistry hasn't been keeping up. For example, I'll be writing up a short description of Chempedia's brand-new new substances Atom feed. There's nothing particularly new there, but it's a very simple and scalable method for...
more...
- Rich Apodaca
Wouldn't RSS feeds break a DB with large bulk updates? I tried this with PubChem - monthly updates made the RSS feed not very useful due to the large size of the update
- Rajarshi Guha
There have been a number of great posts recently with visualization tips: - Rajarshi describes customizing R heatmaps, which are immensely useful for things like visualizing next-generation sequencing results across tiles: http://blog.rguha.net/?p=419 - Revolution computing reviews R code for ti ...
- Brad Chapman
from Posterous
the time series visualization is pretty cool
- Rajarshi Guha
Dingo is an IUPAC-compliant cross-platform open-source library for molecule and reaction 2D structural formula rendering. - http://opensource.scitouch.net/indigo...
Passing on a question from a bioinformatics colleague who needs to put together lectures and practicals on systems biology: "Is there opensource/freely available software that can be used to carry out systems biology analysis?". Any information on existing teaching resources would be welcome...
That's a tough question to answer. What specifically is he looking at? Pathways? Integrated analysis?
- Deepak Singh
If only it were possible to be specific - pinning down what exactly to cover in a systems biology course is part of the problem. The practicals will probably be limited by what is available and whether case studies or tutorials are provided (or are available on the web).
- Noel O'Boyle
Thanks Khader - that list looks very useful. Regarding the type of software, really anything from metabolomics analysis, microarray analysis, studying protein-protein interactions graphs would be useful. Again, the focus is on software for use in practicals (so should be easy to use, and free or open source).
- Noel O'Boyle
Bioconductor for microarray analysis, igraph for graph analysis, Cytoscape for visualization, network integration
- Rajarshi Guha
JWS Online (http://jjj.biochem.sun.ac.za/) is quite handy for teaching. The models are of a reasonably small size and it's web-enabled, so no faffing around with installs and what-not.
- Neil Swainston
Very sad, I saw Warren introduce an early version of PyMol, it was an excellent piece of software for a one man effort. The mailing list announcement is here: https://www.jiscmail.ac.uk/cgi-bin...
- Greg Tyrelle
Ah i see. Nothing on the google yet. He was relatively young too, wasn't he? Sad...
- Shirley Wu
from twhirl
I met him in August and he seemed fine! very depressing
- Rajarshi Guha
He wasn't much older than me and I knew him a bit back in my modeling days, so this really sucks
- Deepak Singh
Very, very sad. Today in my graduate course on RNA one of the students spent most of the class showing us images of the ribosome that he generated with PyMol. The first thing I introduce to the students in this course is how to use PyMol. This is truly a great loss to the community.
- Tom Tullius
so sad. I only interacted with Warren via email, but it was always a pleasure. I greatly admired his support of FOSS, and was inspired by his ground-breaking work in molecular visualization. such sad news.
- tim
So sad--young guy, met him at a conference about 2 years ago. Accomplished so much--his science and entrepreneurship was an inspiration to me.
- Mary Canady
Anyone else interested in helping continue the PyMol codebase?
- Donnie Berkholz
Donnie, certain hope that it doesn't go away
- Deepak Singh
Last I glanced at the PyMol codebase it was actually pretty scary. Sloccount says: Totals grouped by language (dominant language first): ansic: 477951 (85.93%) python: 65182 (11.72%) cpp: 12928 (2.32%)
- Anders Norgaard
I realize I might get slapped for this but with light comes shadow (very Jungian I know). The upside of Warren releasing code as Open Source is that his work can live on and be continued. This is brilliant. The shadow is what about about his young wife that he has left behind? I talked with Warren earlier this year and PyMol was helping him to create a living. But what does his family...
more...
- Antony Williams
November’s entity of the month at ChEBI is the antimalarial drug Artemether. This accompanies release 62 of ChEBI, not just yet another incremental release but an increase of more than twentyfold in the number of entities in ChEBI, thanks to merging of data between an updated ChEBI [1] and ChEMBL [2]. ChEBI now (as of release 62) has over 455,000 total entities, compared to over 18,000 in the previous version (release 61), see ChEBI news for details. The text below on Artemether is reproduced from the ChEBI website:
- Duncan Hull
from Bookmarklet
I am very pleased that our new design is really paying off... the current applet would not have been possible with the old code base...
- Egon Willighagen
Does this mean rendering code will be merged soon into cdk master?
- Rajarshi Guha
@Rajarshi you'll have to ask Stefan, Mark or Egon: I'm not an expert on JChemPaint...
- Duncan Hull
Rajarshi, we are working hard on this... but the code is far from CDK stable... see the reports here: http://pele.farmbio.uu.se/nightly... look for the render* and control* modules... Then, there is the other thing that the current EBI applet is not based on the latest JChemPaint primary code, and we are scheduling a meeting to make that happen too... it's a lot of code, complex code, but we are getting there...
- Egon Willighagen
That said, the above linked nightly does allow you to download a working rendering and editing library already, based on CDK master (of a few weeks ago, see the git history)...
- Egon Willighagen
Question for bioinformatics experts: given a gene name is there a good way to identify papers talking about it? An obvious approach is to do a Pubmed search and see how many hits. But seems a little rough. Is there anything better ?
www.novoseek.com/, www.nextbio.com, biosemantics.org/geneE/search.jsf, and www.ihop-net.org are some resources...
- Jeff Kiefer
Thanks for the pointers, useful to start with
- Rajarshi Guha
The trick is not to start with a PubMed search for the gene name, but to start with the NCBI Gene database. All of the Entrez databases are linked so you can go from a Gene record to publications. Use either EUtils as Pierre and Andrew said, or go from the Gene page (e.g. http://www.ncbi.nlm.nih.gov/gene...) and follow the links.
- Neil Saunders
geneRIFs might not be complete, but how complete do you need to be?
- Mr. Gunn
@Mr Gunn, my application doesn't really need authoritative information. Basically, I'm trying to get a rough idea of which genes are more popular than others in terms of publications. Given that a pub may mention a number of genes in passing, it's not a very reliable measure. But it's one other feature that I can use in a summary/ranking etc
- Rajarshi Guha
Autodock would be my recommendation as well unless there is something new out there that I haven't tracked.
- Deepak Singh
Autodock seems to be widely-used and well-regarded. Could also try zdock - more lightweight and good for command-line scripting, parsing.
- Neil Saunders
zdock is more protein-protein centric. Did they add small molecule stuff?
- Deepak Singh
from IM
Ah, that's true. Haven't used it in a while. Autodock would be better for ligands.
- Neil Saunders
can Autodock actually be scripted from the command line? Or am I restricted to using their GUI to create input files?
- Egon Willighagen
I believe they had developed a python based framework
- Deepak Singh
from IM
they have command line tools but I am still wrapping my head around the basics
- Pedro Beltrao
FRED (not Open Source) is very easy to use, but maybe not as accurate as others. If accuracy is the goal it seems that Glide does a good job.
- Rajarshi Guha
Glide is definitely state of the art. The one thing about docking though is no one tool is good enough. Pretty much every company I know that does a lot of virtual screening uses two tools
- Deepak Singh
from IM
Thanks for the suggestions. Does anyone know if Glide free for academics ?
- Pedro Beltrao
"For example, if you want to search for a structure that contains an amine, but you'd also be interested if it was a methylamine or an ethylamine or even an acetamide, I don't believe that literature would be displayed. You would have to manually search all of these similar structures, or you might just give up if you weren't aware that a manual search was necessary."
- Rich Apodaca
It looks like a SMARTS style query, together with similarity cutoffs would be a better approach than using InChI's (though IIRC, the connection layer of InChI could be used for a somewhat reduced form of substructure search). The downside is that such queries will take some time
- Rajarshi Guha
Prediction of hot spot residues at protein-protein interfaces by combining machine learning and energy-based methods - http://www.citeulike.org/user...
BMC Bioinformatics, Vol. 10, No. 1. (2009), 365. BACKGROUND:Alanine scanning mutagenesis is a powerful experimental methodology for investigating the structural and energetic characteristics of protein complexes. Individual amino-acids are systematically mutated to alanine and changes in free energy of binding (Delta Delta G) measured. Several experiments have shown that protein-protein interactions are critically dependent on just a few residues ("hot spots”) at the interface. Hot spots make a dominant contribution to the free energy of binding and if mutated they can disrupt the interaction. As mutagenesis studies require significant experimental efforts, there is a need for accurate and reliable computational methods. Such methods would also add to our understanding of the determinants of affinity and specificity in protein-protein recognition.RESULTS:We present a novel computational strategy to identify hot spot residues, given the structure of a complex. We consider the basic...
- Neil Saunders
Not really, no. Yet considered publishable.
- Neil Saunders
and the thing is it's been like this for years in this particular space
- Deepak Singh
Funny - I searched for the full length paper and found this. My personal experience is that you can do MUCH better if you don't try to build a general methodology, but rather adapt an energetic-based approach to a particular type of interface.
- Mickey Kosloff
Win-Vector Blog » “I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures - http://www.win-vector.com/blog...
no substitute for images to describe Second Life projects
- Jean-Claude Bradley
Would be nice if one could transform a scene into something 3D portable... and I'd be happy if that would extend 100m or so, with the far scenery as images on the 100 circular wall around the scene... (tune 100m where you like)
- Egon Willighagen
Not sure what is meant by "3D portable". SL certainly has its limitations though the Meerkat viewer and its ilk can be used to archive structures offline though I haven't tried it with any molecules yet. In principle the molecules can also be uploaded to OpenSim grids which can run standalone and, I believe, in a web browser too. You could use the VR room or a holodeck/holoemitter to provide a context. Great paper btw.
- Peter Miller
Ah, I should have said that using Meerkat works fine with the protein sculpted prims, at least as far as SL is concerned. Just tried it with one of the molecules generated using Andy's older molecule rezzer and that works fine too: 145 prims rezzed and linked in about a minute.
- Peter Miller
And just confirmed that you can upload said molecule in an OpenSim grid (Intel's ScienceSim to be precise).Not sure whether Andy has already done this, nor indeed whether any of his rezzers work in OpenSim. I suspect the molecule rezzed in the same sim position it was archived in -- which could be inconvenient though Meerkat shows your own content in the mini-map clearly enough.
- Peter Miller
@Peter. Great stuff - I've never tried OpenSim - didn't know you could transfer content. Cool.
- Andrew Lang