This should help us pick sets of diverse solvents.
- Andrew Lang
Is there a reason to use approximate clustering over an exact method? The dataset isn't very large if I recall. Is there a big difference with say standard hierarchical clustering methods?
- Rajarshi Guha
from iPhone
For small datasets it should do exact. Just to make sure I used the the flag --all-pairwise. Didn't seem to change but you're right - I needed to make sure.
- Andrew Lang
Prediction of solubility of drugs and other compounds in organic solvents - recent article by Abraham on non-aqueous solubility prediction - thanks Andy http://dx.doi.org/10...
Google now indexing thumbs. Final page of solubility book: Solubilities of inorganic and organic compounds: a compilation of ..., Volume 1 By Atherton Seidell
Correct. Tis pink, looks like a form of shoewear with desert like background to me so, might this be "Toe in the Sahara, with shoe" Featuring Sting and @cromercrox :- http://www.last.fm/music...
- Graham Steel
@JC I think Egon is talking about people transcribing the Seidell's solubility book. @Egon I think JC is talking about the ONS solubility book. :)
- Andrew Lang
thanks for the clarification Andy - Marshall already uploaded most of the carboxylic acids and aldehydes - yes I was referring to our own book Egon
- Jean-Claude Bradley
Ah... JC, sorry... I did not realized you were compiling an own book :) @Andrew... yes, I was talking about transcribing values from the Seidell book... I might know someone who wants to help with that (or at least try it; he's not a chemist)...
- Egon Willighagen
My mistake Egon about the confusion with the book - yes we have one coming out soon. As for help with adding data from the Seidell book I think we have most of the relevant compounds. And it would require a chemist to translate the way names were done back then - also much of it requires conversion between g/100g solvent or g/100g solution to molar, etc
- Jean-Claude Bradley
seems like an obvious thing to do if possible - but is it true that there is no way to provide a source for data?
- Jean-Claude Bradley
I'm uneasy with the whole interface for this reason.
- Matthew Todd
I'm uneasy with the lack of attribution for specific data points and the lack of an interface for "corrections" as opposed to just comments. The data has to scale for this to work and I just don't see how they can do that without an open dataset
- Cameron Neylon
content submission policy - "When you submit a fact, set of facts, dataset, formula, or any other information to be considered for incorporation Wolfram|Apha, you are giving it to Wolfram Alpha LLC ("we"/"us") free and clear, to do with anything and everything we choose. Your submission has to include a transfer/disclaimer of all intellectual property rights because ..."
- Andrew Lang
Andrew - what does that mean? Data flow openly in, but not openly out?
- Matthew Todd
@Matthew: I'd say that's exactly what it means. They'll happily take but won't give back. Hell, I could even live with that -- if there were sources provided for the data, rather than a blanket "Trust Us".
- Bill Hooker
Yes. I don't like it much either - I made my suggestion before I found the submission policy - maybe we shouldn't - just as a matter of principle.
- Andrew Lang
I think there is a balance here between being helpful and trying to persuade them to open up the innards more and working just to see stuff dissappear into the bowels of the system. I would start by being positive and see what the response is really. After all it is open data so they can do with it what they like - question of how much effort we are prepared to put in really
- Cameron Neylon
Cameron++ This is a nice example of why people use copyleft data licenses :) If you truly want your data to be Open (CC0, PD, ...), you would not care if WA would remove source info and make the data proprietary...
- Egon Willighagen
Egon agreed. Which is why I phrased it like I did. But equally an example of community enforcement. WA are free to use the data, I would be happy for them to do so. But I'm not going to put much work into assisting them or working with them unless I can see that data being allowed back out in a useful form. To be fair to them they are allowing export of the "Mathematica Form" of their data objects which is presumably what they are holding in their databanks
- Cameron Neylon
Egon, and to be precise I object to them removing source data and making it proprietary because I think it means the service won't ultimately be as useful as it could be. If it were open and user editable then with a growing user curated data set and what appears to be a pretty good natural language parser and reasoning system we could do great things. Closed it won't be as good so I do object and will say so. I just don't think a license on our data is the right way of enforcing their good behaviour :-)
- Cameron Neylon
Don't mind at all if WA wants to vacuum up all the data it pleases, but as Cameron says it alters the motivation for being part of the experiment. I also can't be bothered with it if the data are not sourced and credited. Question to student = x. Student answers y. Student is asked "How do you know?", student replies "Because WA says so" etc
- Matthew Todd
Matthew, indeed! Google has the advantage that it keeps track of the source... this is what worries me about many chemistry databases too: where is the link to the (primary) source... WA is not that different from other recent efforts... BTW, I did see sources for some questions... e.g. the 42 answer did source to D. Adams...
- Egon Willighagen
:) The one thing nobody needs a source for...
- Matthew Todd
Given the amount of curation that goes on with the ChemSpider database it is totally unrealistic to expect that data are immutable "facts" - very dangerous foundation indeed
- Jean-Claude Bradley
Lets say we put some solubility info in there - what would be and example of a query or task that could be performed by WA?
- Jean-Claude Bradley
@ JC: Right - what are the really interesting questions you could use answers to? How about a prediction of other solvents you could try that might dissolve/precipitate a molecule with known (in)solubility in several given solvents. i.e. an extrapolation. Maybe I'm over-burdening WA with expectations, but I have in mind what Michael Nielsen was previously talking about - *discoveries* that have been made through the semantic web, or linked data. Isn't that what would separate WA from a search engine?
- Matthew Todd
@Egon -- "If you truly want your data to be Open (CC0, PD, ...), you would not care if WA would remove source info" -- not quite. I don't care if they come and get my data and do whatever with it -- as you say, that's why it's Open. What I am balking at is the idea that the community should actively provide them with data, only to see it disappear into a black hole. If they want community input they should be prepared to engage fully with the community.
- Bill Hooker
I'm also unhappy - our own studies show that there is no quality control - they hoover everything and use suspect algorithms to deduce from it. There are sources they have used - like MSDS collections that I did not use because I assumed I would broach copyright. Maybe they have paid, but maybe they have stolen
- peter murray-rust
@Bill: yes, I can relate to that. There is so much to do, and independent from license choice, I too do not want to spend (too much) time on something proprietary black hole.
- Egon Willighagen
I actually chatted with Theodore Gray at Google and offered to help with the quality of data on Wolfram Alpha. They do not seem to deal at all with stereochemistry (See the blogpost: http://tinyurl.com/yk7mkyt). I sent an email and can't get a response to even allow us to help improve the quality. If anybody has an inroad to Wolfram Alpha and can introduce me to someone who is...
more...
- Antony Williams
I emailed Theodore Gray after scifoo and he set me up with some W|A peeps, we're converting our calc labs to W|A, but I don't know any of the data people. Separately, I did get a reply from the data people regarding the solubility data: "Thank you for your suggestion regarding Wolfram|Alpha. We are interested in hearing more about your data, please provide us with source links if possible or attach a sample of your excel file." I replied but didn't hear back.
- Andrew Lang
yes, pset predictions would be useful. I think the main concern is the use of a stepwise like feature selection procedure. Depending on the size of the descriptor pool, I'd probably just do a brute force all combination search. Crude but easy
- Rajarshi Guha
Is there a tool to do the an all combination search?
- Andrew Lang
I don't know of the top of my head - but depending on what environment you're working in it's just a matterof making all combinations and then looping over them
- Rajarshi Guha
:) would you believe I use mathematica.
- Andrew Lang
IIRC, the Subsets function will enumerate all combinations
- Rajarshi Guha
'Out of Memory' when I tried all combinations, so now I'm trying biased subsets of combinations of descriptors - not ideal I know. Was getting better R^2 values last night but fell asleep. Will try to get something up tonight. Also as Egon suggested, using 200 data points in modeling, reserving the remainder for testing.
- Andrew Lang
I wonder if we could automatically twitter dosol, dougi, and solsum updates for people to follow.
- Andrew Lang
Andy that might be useful if we don't flood the feed - maybe just post if there is a new top priority request?
- Jean-Claude Bradley
@Andrew: yes, that's the kind of thing I am thinking about too... BTW, I have started adding support for #MyExperiment in #Bioclipse, so that Bioclipse 'workflows' can be shared on their social network...
- Egon Willighagen
no API as of yet - today was the first day I could actually import the data. It has been very unGoogle-like in exhibiting bugs. For the future I can see this is the tool we'll use for the solubility data but just something fun to play with right now.
- Andrew Lang
Call to action - wikipedia editor questions the reliability of data from open science - please go to his talk page and add your support for the ONSchallenge: http://en.wikipedia.org/wiki...
quote from editor: "Although I am sure that this school project was fun for the kids, Wikipedia needs to have data here from verifiable sources. Reference to a university site (Oral Roberts or Harvard) is not good enough. Otherwise your work risks being deleted. NIST, CRC etc, now they are authorities. I really encourage you to consult someone before launching on what looks like a well intentioned but naively planned project."
- Andrew Lang
Please help. Go to the talk page and make a comment. Thanks!
- Andrew Lang
There is probably a value in thinking carefully about the response here. I have heard some criticisms of the methodology being used and have suggested that people make those criticisms in the project notebooks but this hasn't happened as yet. Strictly the WP guidelines do require a traditionally published (non-web) verifiable source and one could make an argument that these results have...
more...
- Cameron Neylon
...but not in the traditional way...I agree with you but it's a subtle argument and it could get lost
- Cameron Neylon
I would also say that at the end of the day there is much more back up on our data than there is for a NIST or CRC value in many cases - at least you can tell what was measured. Just not sure whether that will be enough for the traditionalists. Just worth rehearsing the arguments I guess i what I am saying.
- Cameron Neylon
Can you link the actual page that's in dispute? Major flaw in WP talk imo, there are no auto links to the pages being discussed.
- Bill Hooker
Just noticed in the urea talk page someone complaining in 2007 that all the solubility values were wrong and unreferenced :-)
- Cameron Neylon
Added my comment. It seems to me that, as a source to be quoted in WP, open notebooks are a new category (see http://en.wikipedia.org/wiki...). They are not "third-party published sources with a reputation for fact-checking and accuracy", but they are "produced by an established expert on the topic" in the sense that JC runs the project and we judges are all scientists. More to the point, WP has never been asked to consider this kind of source before.
- Bill Hooker
@Cameron Good find! I can see the ppt slide now. Here's the quote from 2007 about the solubility in water - it seems to have never been resolved - "The article provides some specific information about the solubility of urea without giving a source. The values were out by at least a factor of 10 (probably g/L rather than g/100mL), which I have corrected by knocking off the last zero, but...
more...
- Andrew Lang
I added my 2 cents: http://en.wikipedia.org/w... . Still do not get why none of you guys - so much engaged in doing open science - seem to have ever shown up at Citizendium, a place where expertise is actually valued. It is currently very small compared to Wikipedia, but so is Open Science compared to the rest. See also http://en.wikiversity.org/wiki... .
- Daniel Mietchen
Thanks for adding a post to the discussion on wikipedia Daniel. I will check out Citizendium.
- Andrew Lang
I believe the reason most of us stick to Wikipedia is that we believe that it is (a) the appropriate general resource and (b) since it is the source most people, and Google, go to, information will be found there.
- Deepak Singh
Certainly nothing wrong with reading or editing Wikipedia entries. But the sort of problem discussed in this thread would be much rarer over there at Citizendium (they have others), and since it tends to affect scientists quite often, I am wondering why they do not give this alternative a try. Andrew Su has done that, and he was disappointed (a feeling I share for his case, since I...
more...
- Daniel Mietchen
Daniel - thanks for the comments and for the link to Wikiversity and Citizendium - we'll certainly check it out. A first look does not show any entries for common chemicals like methanol. Additional portals are always of interest though. The reason Wikipedia is useful is that it turns out to be a significant way people looking for specific non-aqueous solubility find our results.
- Jean-Claude Bradley
It was a little troubling to me too until I read JC's comment "[I don't think that this experiment proves very much either way - the NMR is very poor quality and the solvent peak overlaps with the reference compound. This should be redone with a reference compound that does not overlap with any peaks JCB]"
- Andrew Lang
JC, Cameron, others... This open website is a project of a PhD I spoke yesterday over a beer, here at BMC/Uppsala/SE, and when he mentioned it, I immediately thought about the protocols used for Ugi reactions and solubility measurements... this site isolates the protocol descriptions in a social web like idea. The PhD was much interested in feedback from chemists on his work... Can you help him out?
- Egon Willighagen
No this cannot replace a lab notebook where a detailed record of a specific experiment with all generated data is made available. We could however generate general solubility measurement protocols from the experience gained from lots of experiments. This is very similar to the myExperiment free text protocols.
- Jean-Claude Bradley
No, surely it cannot replace the lab notebook... an experiment is never an exact copy of a protocol...
- Egon Willighagen
Sorry, just trying to catch up here. It seems like a reasonably nice functionality for a protocol site but its not clear to me what the killer feature is here that would bring more people in. It looks nicer than OWW for instance but doesn't let people edit directly. Not sure whether that is what people want but again what is bringing people into make comments? One thing that is appealing is the hint that they are interested in linking materials across multiple protocols.
- Cameron Neylon
That is technically appealing but I don't really see how it will in and off itself bring in enough users to make it work. Like most of these things the challenge lies in getting a community together that is big enough to make the content happen.
- Cameron Neylon
Thanx for the feedback. I will point the author to your comments.
- Egon Willighagen
I've been trying to get my head around Python the last few weeks but perhaps we need duck typing for data storage and processing. Looks like a spreadsheet, quacks like a spreadsheet, but is actually just a display layer on top of something nicer for more powerful manipulation
- Cameron Neylon
Bottom line. Spread sheets are not going away - if you want to interact with the (vast) majority of scientists we need to find ways of translating and communicating with spreadsheets
- Cameron Neylon
Well I contacted Rajarshi about this and he said it was a temporary moment of frustration brought on by extra spaces at then end of entries - we just have to make sure to check often after students make additions
- Jean-Claude Bradley
Jean-Claude... tied up in grant applications now, but will soon make the RDF generation life... and make a web front end... that can also include life validation of the spreadsheet... so that students can have the additions automatically validated...
- Egon Willighagen
Oh I appreciate the frustration - I am hacking through spreadsheets and all sorts of nonsense repetitive processing at the moment as well. Immensely frustrating. Just wanted to make the point that this is really important circle to square - don't really have any technical ideas on how at the moment though. But hopefully this process will actually provide some guidance as the conversion from spreadsheet to RDF to whatever else flows through
- Cameron Neylon
Actually, if it was Excel I wouldn't mind so much. The problem is validation of input - unlike a RDBMS (or Excel) I can't add constraints for a column in Google spreadsheets which would simplify things a lot. Egon - how will RDF help here?
- Rajarshi Guha
The RDF would help, because the OWL behind it could put restrictions on field content... but more practically, it's just a nice spinoff when creating the RDF... as I need to do some sanity checking at that stage anyway...
- Egon Willighagen
Yes, those spaces :) - I Trim everything to remove them - actually I got the idea for trimming from Rajarshi's code - I think.
- Andrew Lang
Yes, in the end that's what I do - but if the spreadsheet is the "container of record", it'd be nice to have validation/constraints in the sheet itself.
- Rajarshi Guha
Egon - thanks it will be interesting to see if that can help
- Jean-Claude Bradley
Rajarshi - is there an API to add to of edit GoogleSpreadsheets yet?
- Jean-Claude Bradley
got your email Carlos - looks like a very interesting study - will you make your findings public?
- Jean-Claude Bradley
Yes, I will publish an article in an academic journal and the ICMPIM 2009 Conference. I also plan to share my findings with the people engage in open notebook science.
- Carlos Torres