Open Notebook Science Solubility Project

Room for chatting and information aggregation around the Solubility project of the Open Notebook Science community.
Egon Willighagen
Andrew Lang
Andrew Lang
New solubility data for 2-methyl-3,5-dinitrobenzoic acid:
Andrew, if there is a RSS/Atom feed for new data, we can add it to this group: - Egon Willighagen
Live melting point feed rdf: - Andrew Lang
How can I get the feeds to only list the latest 10 or 20? - Egon Willighagen
I don't have a script that does that yet. - Andrew Lang
Andrew Lang
New solubility data for itaconic acid:
Andrew Lang
Don Pellegrino
My dissertation "Interactive visualization systems and data integration methods for supporting discovery in collections of scientific information" is now available from Drexel's Theses and Dissertations collection at
Egon Willighagen
Who wants to score some #blueobelisk exchange karma points? "What Open Source tools for Aqueous Solubility Prediction are available?"
Andrew Lang
Modeling question. Why would you use principal components as descriptors instead of the descriptors themselves?
because PCA can be viewed as a dimension reduction method, so if you don't know which N of M descriptors to choose (M being large) the hope is that the first N PC's will contain the informative descriptors. - Rajarshi Guha
depends on other things... in case of multi linear regression, it's a must to get (somewhat) stable models... it addresses co-linearity between original descriptors - Egon Willighagen
I wouldn't say PCA is a must. It's just a way to get fewer descriptors - Rajarshi Guha
What's the drawback? You can't measure the importance of individual descriptors? - Andrew Lang
your coefficients gets messed up, because any linear combination will work... coefficients for descriptors can easily swap from positive to negative, resulting in multiple interpretations. e.g. xlogp increases Foo, while in another model is decreases that same Foo - Egon Willighagen
People use it because they don't understand quite what it's doing. I'd go a bit further than Rajarshi, and say that there is no reason to do it. If your descriptors have no variance in your dataset, toss them out. If they don't have any correlation with the activity toss them out. Why should the PC1 be correlated with activity? Note that if you don't scale your descriptors to the same ranges, using PC is very suspect. - Noel O'Boyle
Carl Boettiger
Question for the Open Notebook folks: I delay release of some entries in my open notebook until post-pub to satisfy reqs of collaborators, data providers or journals. (ac-d Would it be better practice to post these entries under a password until release date, or publish privately (so they are invisible)?
Password option would provide an indication of what was being done, but might be practically very annoying to anyone (or informative?) trying to browse the notebook. On the other hand, it would be slightly easier to share with collaborators using a password, as private entries would require they were registered users on the notebook (I'm on a wordpress platform, - Carl Boettiger
As long as the version history is not public anyway, I think you can safely go for keeping the post "unpublished" at a place where only your collaborators have a access. An option compatible with a public version history would be to encrypt the entry and to post the key for decryption later. - Daniel Mietchen
Carl - in my opinion requiring a password makes that part of the notebook completely closed. Selectively sharing data with collaborators, editors, etc. is standard practice in science, isn't it? - Jean-Claude Bradley
I'd say that it depends on what is useful to you and the project. If you want to share with collaborators even while the data is closed then a password is probably a good way to do that. Also means you can enable access for referees if appropriate. Either way if you have to do a manual release to make it public it is essentially closed until you do that. From my perspective one of the appeals of getting it up there even if not public is that it is at least easy to flick that switch when the time comes. - Cameron Neylon
But readers of the notebook will be frustrated if they think that you're hiding key/juicy things. In terms of public participation, that would be a little frustrating. The other thing I was going to say was that our IT guys at the Uni are very nervous of the idea of "switch flicking" to make bits of private books public. The reason being human error - a fleeting lapse of concentration... more... - Matthew Todd
The problem with that approach is that the user ends up using either the public one or the private one but not both. I know it appeals to IP and IT types but from a usability perspective its a great way to persuade people to use paper notebooks. Any credible system will have to have a good way of keeping parts private and helping the user decide how and when to make them public....and... more... - Cameron Neylon
BTW this is currently a problem for OSDD - you can access the data with a password but you have to agree to keep it confidential presumably because they want to retain the option to patent. But that makes the data effectively unusable - and it may even inhibit research. Once researchers see what is there they may avoid a research track to avoid giving the appearance of "stealing IP". They can't even credit the work behind the password because it is neither published or public. - Jean-Claude Bradley
I would start a separate password protected WP blog notebook. That way there are no annoying restricted posts in among your public ones. A separate blog also keeps author confusion to a minimum (or is that just me?). On publication you can export and import to the open notebook site, which will keep all experimental publication dates etc correct. - Dave Lunt
Thanks all for the feedback. I've been keeping the notebook completely open and attempting to make sure it really contained all content without substantial delay for about a year and a half, with daily entries. Recently ( I started some work on data I was asked not to share before publication. I could shift to a some-content, no delay model, but separate notebooks would be rather cumbersome. As security is easy to manage I just marked these posts private. - Carl Boettiger
I could make these few posts password protected instead of private -- I was curious if that would seem more open (you'd know I was doing something, see the title, could check back and see if I released it a year later, etc) instead of private, where they would not appear to exist. Just to clarify, 94% (currently) of posts are open without delay, but I need a mechanism to ensure the 6%... more... - Carl Boettiger
Carl - I don't think there is a right or wrong way to do it - it just depends on what is most convenient for you. For a substantially different project it sometimes makes sense to have a separate notebook for ease of processing and directing others to follow (we did this to separate the measurement of solubility from organic synthesis reactions). For a closed project I would think it is easier to have a separate notebook. - Jean-Claude Bradley
I would find it challenging to delay openly exposing experiments until after publication - at least in my field. You never really know exactly which experiments to combine for a particular paper. Lets say you publish a synthesis paper with 5 optimized reactions - there are probably at least 50 related reactions in your notebook that are either failed or somehow flawed that you have to... more... - Jean-Claude Bradley
Egon Willighagen
Don Pellegrino
Fwd: Supporting scientific discovery through linkages of literature and data - (via
Supporting scientific discovery through linkages of literature and data
I always get annoyed by this kind of work... it is so stupid that the community needs to do this kind of thing because the publishing industry and science community just messed up :( - Egon Willighagen
question: in your opinion, Egon, who should have done what instead and who should be doing what right now? - Claudia Koltzenburg
would this in any way resonate with your ideas: "Markets can be Egalitarian Meshworks with P2P Dynamics" - Claudia Koltzenburg
Thanks for the comments! I was not able to make the connection with the "Markets can be Egalitarian Meshworks with P2P Dynamics" article though. If you could describe this a bit more I think it might be a neat idea. - Don Pellegrino
Egon Willighagen
Introducing FigShare beta - a new way to share your data: Scientific publishing as it stands is an inefficient way to do science on a global scale. A lot of time and money is being wasted by groups around the world duplicating research that has already been carried out. FigShare allows you to share all of your data,...
Egon Willighagen
Jean-Claude... is that interesting for Ugi reactions? It would change your rankings a bit :) - Egon Willighagen
Patricia F. Anderson
Seeking examples of books written in blogs (aside from Engelbart Hypothesis). Any academic or science examples? I am mostly finding popular & fiction. Helping faculty member decide on book writing strategy/platform.
okay -- so this is a personal example which is by no means finished (I copied and pasted from Word, which turns out is not such a nice thing) -- but one of the reasons I chose wikidot is because it seemed fairly easy to "format" in a book like fashion -- it also has an explicit book template which I thought was pretty nice looking -- but didn't find until after I'd already dumped... more... - Mickey Schafer and are 'blogged books' using Wordpress as a commodity platform for rapid publication. There are a number of posts on detailing the motivations for this and the advantages it can bring. Blogs can be particularly useful in situations where a subject is too niche for a publisher to be especially interested, or when writing about a fast moving field where traditional publishing is simply too slow to keep up. - Simon Cockell
These are excellent examples! Thank you very much! I especially like the Taverna collaboration. I found Engelbart Book and Seaside programming text but still looking. - Patricia F. Anderson
"I want to make an announcement that my book entitled The P=NP Question and Gödel’s Lost Letter is now published. It is available here at, for example. I started writing a book that became a blog. Now the blog has become a book." - JoeCamel
These are awesome! I am still working on that collection of books born as blogs / born digital for a blogpost. Need to get my Yahoo Pipes alternatives post out first ... - Patricia F. Anderson
There are so many theories about what is going on with Yahoo Pipes that I am rather confused. The range of opinions is roughly: (a) they are fine and do better than most, deal with it; (b) OMG, they killed Delicious, they'll kill Flickr & Pipes too!; (c) consarn it, Pipes seems to be having more and more trouble, what else can I use?; (d) heck, just learn PERL and PHP and do it yourself. - Patricia F. Anderson
Egon Willighagen
google.loadWorksheet("SolubilitiesSum", "Sheet1") // #bioclipse #onssolubility old code, now as general manager :)
Andrew Lang
ONSC Data: Solutes with Abraham Descriptors. Graph created with Gephi: Inspired by Don Pellegrino:
Great work! Could you provide more details about the data you used for the nodes, edges and coloring? I am excited to look at this more deeply. - Don Pellegrino
I have uploaded all the data to the wiki: Gephi's quickstart guide is really good for showing how to do colours, etc: - Andrew Lang
Thanks Andy - that was very helpful - Jean-Claude Bradley
Egon Willighagen
Thanks Egon - not sure how useful it could be since we almost always use Google Spreadsheets to input data, not Excel - Jean-Claude Bradley
interesting - I hope this fixes the formatting problem between Gdocs and Word - The PDF from Gdocs is very different (in a bad way) from the PDF exported from the "same file" in Word - Jean-Claude Bradley
Egon Willighagen
Fwd: uploaded patch with a #cdk implementation for acidic and basic group count descriptors (via
Egon Willighagen
Egon Willighagen
Predict metabolization sites for malaria drugs using MetaPrint2D and Bioclipse ->
Egon Willighagen
Extensive review on Slashdot on the Beautiful Data book. About 'our' chapter it writes: "Similar to the previous chapter, Chapter Sixteen focuses briefly on chemistry and describes how data was collected "to predict teh solubility of a wide range of chemicals in non-aqueous solvents such as ethanol, methanol, etc." Having a very minimal chemistry...
Egon Willighagen
Can't wait for the first chemistry cards to appear ... :) Perhaps using ONS data? That what is now in SL, for example? - Egon Willighagen
Egon Willighagen
Or in Oxford Sunday August 1st. - Egon Willighagen
But I don't think I can make either date unfortunately... - Cameron Neylon
Wave with details summarized: - Egon Willighagen
I might be about if people are meeting up? - science3point0
Patricia F. Anderson
Our new article on Science2.0. :) Spallek H, O'Donnell J, Clayton M, Anderson P, Krueger A. Paradigm shift or annoying distraction - emerging implications of Web 2.0 for clinical practice. Appl Clin Inf 2010; 1: 96-115
I thought Heiko had said it was open access, but today I am having the same problem you are. How bizarre! Friday, I had no trouble getting to it. Buzzard. Graham, I'mm DM you my email address. - Patricia F. Anderson
Delicious stream from the paper: . - Daniel Mietchen
The irony of this article being behind a pay wall... the upside is that more biologists will probably read it. - Greg Tyrelle
Folks, an update. The authors had been told that this would be free access. The editor believed that this would be free access. It WAS free access for a day, but then the publisher changed it. The editor agrees it needs to be free access and is negotiating with the publisher to restore this, but still in negotiations. Disappointing so far, but we are working on it. - Patricia F. Anderson
Great, keep us updated, I would like to read the article. - Greg Tyrelle
Open Notebook Science Solubility Project
b30f3ad5-87f2-4610-b187-f56fd96df48b -
Log message:
Andrew, Cameron, Jean-Claude: let's add other RSS feeds with ONS (data) updates too... - Egon Willighagen
Egon I didn't think we could add live feeds - I uploaded the latest full archive of the solubility project - Jean-Claude Bradley
Andrew Lang
Using R for the very first time - need some help with Random Forests. 1. How do I pick mtry and ntree? 2. Once I have my results (see picture) how do I generate a model?
use cross-validation - Rajarshi Guha
but defaults are pretty good - since RF's don't overfit, you can just use a large value of ntree, you will just loose out on time and storage - Rajarshi Guha
you must have generated that poicture froim the model, right? just use it for prediction - Rajarshi Guha
Thanks Rajarshi, I used ntree = 10,000? Make mtry big too? I was looking for a way to get the coefficients, e.g. LogS = c1 VCCLogP + c2 ATSm3 + ... - Andrew Lang
woah! how many obs are there? Probably don't need to go beyond 800! As for coefficients - there aren't any as such. You could look at individual trees (see Klimt for a way to vsualize ensembles of trees) - Rajarshi Guha
awesome Andy! This should help us out a lot - Jean-Claude Bradley
I've been trying to build a model - using mathematica - forward stepwise selection of descriptors (I just learned what this is called - I thought I came up with it myself): but since I want to use it to make predictions, I'm worried about overfitting. I guess random forest was the wrong choice because I need an explicit formula. So I guess I want to check cross-validation at each step in my forward stepwise selection? - Andrew Lang
Yes, that's one way. But in general stepwise selection procedures can lead to biased models since you are throwing a certain descriptors with testing their combinations with others. Why do you need an explicit formula? - Rajarshi Guha
I'd like to make a webservice: SMILES -> Solubility in Methanol. - Andrew Lang
you don't need a formula for that? Just eval descriptors and get the prediction from the pre-exisitiung model. That's what my latest web deployed model service did (on toposome) - Rajarshi Guha
The way I do it is to get the descriptor values from the CDK-REST services and then plug them in to the formula and out pops the predicted value (all done with PHP). - Andrew Lang
Oh ok. But then you're stuck to OLS and OLS related models, right? - Rajarshi Guha
Yes - so I guess random forest is out. I wanted to try it because I am worried about overfitting and I guess introduced bias from stepwise selection now too. - Andrew Lang
How about using random forests to filter 'top' N descriptors, then do backward stepwise regression? - Andrew Lang
Yes, RF's can be used to get a rough idea of the top N descriptors - Rajarshi Guha
Andrew Lang
Is there any difference between phenacetin and phenacetinum?
They are the same according to wikipedia - let me know if they are in reality not. - Andrew Lang
no difference according to ChemSpider - well phenacetium is a salt name (which I don't understand because amides can't really form salts easily) - Jean-Claude Bradley
ok good - book lists them separately but I bet he just didn't check. - Andrew Lang
the phenacetinum MAY be a salt but it is an -um not a -ium. It could be a language issue... - Antony Williams
Michael R. Bernstein
A colleague is asking about software to help him do qualitative studies. Can anyone give him some pointers?:
Michael, this might help -- the first is archeology specific, the second is the larger frame of the first -- at least it demonstrates data collection/distribution/re-use in a realm that might be considered qualitative..."qualitative" is a big word, though, and differs depending on discipline: 1) 2) - Mickey Schafer
Mickey, it isn't my field, but I'm pretty sure he means it in the ethnographic sense. - Michael R. Bernstein
If he finds an answer, I'd love to know. I have also looked for an aid to qualitative research, but not sure that it has the kind of structured data collection that makes for good software. Once data is collected and some of the thematic analyses are begun, then something like Excel becomes useful, but I suspect that is not what he means. - Mickey Schafer
D0r0th34, he says yes. - Michael R. Bernstein
D0r0th34, I'd like a pointer to such a product too. - Meryn Stol
Livescribe has been working on a transcription program, but I don't think it is up yet. Dragon speaking naturally has an excellent reputation, especially the more recent and pricey versions, and they are supposed to handle vocal variety better (most voice transcription programs are trained to a single voice). - Mickey Schafer
I am the colleague - and likely going to give it a go with TAMS Analyzer. Seems well suited for coding text transcribed from interviews and not too many features, just enough to code text adequately. See what happens this summer.... Thanks Michael for posting this question. - Mark Pugsley
Weft QDA (Open Source) - Pedro Jacobetty
Jean-Claude Bradley
