My dissertation "Interactive visualization systems and data integration methods for supporting discovery in collections of scientific information" is now available from Drexel's Theses and Dissertations collection at http://hdl.handle.net/1860....
because PCA can be viewed as a dimension reduction method, so if you don't know which N of M descriptors to choose (M being large) the hope is that the first N PC's will contain the informative descriptors.
- Rajarshi Guha
depends on other things... in case of multi linear regression, it's a must to get (somewhat) stable models... it addresses co-linearity between original descriptors
- Egon Willighagen
I wouldn't say PCA is a must. It's just a way to get fewer descriptors
- Rajarshi Guha
What's the drawback? You can't measure the importance of individual descriptors?
- Andrew Lang
your coefficients gets messed up, because any linear combination will work... coefficients for descriptors can easily swap from positive to negative, resulting in multiple interpretations. e.g. xlogp increases Foo, while in another model is decreases that same Foo
- Egon Willighagen
People use it because they don't understand quite what it's doing. I'd go a bit further than Rajarshi, and say that there is no reason to do it. If your descriptors have no variance in your dataset, toss them out. If they don't have any correlation with the activity toss them out. Why should the PC1 be correlated with activity? Note that if you don't scale your descriptors to the same ranges, using PC is very suspect.
- Noel O'Boyle
Question for the Open Notebook folks: I delay release of some entries in my open notebook until post-pub to satisfy reqs of collaborators, data providers or journals. (ac-d http://onsclaims.wikispaces.co...). Would it be better practice to post these entries under a password until release date, or publish privately (so they are invisible)?
Password option would provide an indication of what was being done, but might be practically very annoying to anyone (or informative?) trying to browse the notebook. On the other hand, it would be slightly easier to share with collaborators using a password, as private entries would require they were registered users on the notebook (I'm on a wordpress platform, http://carlboettiger.info).
- Carl Boettiger
As long as the version history is not public anyway, I think you can safely go for keeping the post "unpublished" at a place where only your collaborators have a access. An option compatible with a public version history would be to encrypt the entry and to post the key for decryption later.
- Daniel Mietchen
Carl - in my opinion requiring a password makes that part of the notebook completely closed. Selectively sharing data with collaborators, editors, etc. is standard practice in science, isn't it?
- Jean-Claude Bradley
I'd say that it depends on what is useful to you and the project. If you want to share with collaborators even while the data is closed then a password is probably a good way to do that. Also means you can enable access for referees if appropriate. Either way if you have to do a manual release to make it public it is essentially closed until you do that. From my perspective one of the appeals of getting it up there even if not public is that it is at least easy to flick that switch when the time comes.
- Cameron Neylon
But readers of the notebook will be frustrated if they think that you're hiding key/juicy things. In terms of public participation, that would be a little frustrating. The other thing I was going to say was that our IT guys at the Uni are very nervous of the idea of "switch flicking" to make bits of private books public. The reason being human error - a fleeting lapse of concentration...
more...
- Matthew Todd
The problem with that approach is that the user ends up using either the public one or the private one but not both. I know it appeals to IP and IT types but from a usability perspective its a great way to persuade people to use paper notebooks. Any credible system will have to have a good way of keeping parts private and helping the user decide how and when to make them public....and...
more...
- Cameron Neylon
BTW this is currently a problem for OSDD - you can access the data with a password but you have to agree to keep it confidential presumably because they want to retain the option to patent. But that makes the data effectively unusable - and it may even inhibit research. Once researchers see what is there they may avoid a research track to avoid giving the appearance of "stealing IP". They can't even credit the work behind the password because it is neither published or public.
- Jean-Claude Bradley
I would start a separate password protected WP blog notebook. That way there are no annoying restricted posts in among your public ones. A separate blog also keeps author confusion to a minimum (or is that just me?). On publication you can export and import to the open notebook site, which will keep all experimental publication dates etc correct.
- Dave Lunt
Thanks all for the feedback. I've been keeping the notebook completely open and attempting to make sure it really contained all content without substantial delay for about a year and a half, with daily entries. Recently (http://www.carlboettiger.info/archive...) I started some work on data I was asked not to share before publication. I could shift to a some-content, no delay model, but separate notebooks would be rather cumbersome. As security is easy to manage I just marked these posts private.
- Carl Boettiger
I could make these few posts password protected instead of private -- I was curious if that would seem more open (you'd know I was doing something, see the title, could check back and see if I released it a year later, etc) instead of private, where they would not appear to exist. Just to clarify, 94% (currently) of posts are open without delay, but I need a mechanism to ensure the 6%...
more...
- Carl Boettiger
Carl - I don't think there is a right or wrong way to do it - it just depends on what is most convenient for you. For a substantially different project it sometimes makes sense to have a separate notebook for ease of processing and directing others to follow (we did this to separate the measurement of solubility from organic synthesis reactions). For a closed project I would think it is easier to have a separate notebook.
- Jean-Claude Bradley
I would find it challenging to delay openly exposing experiments until after publication - at least in my field. You never really know exactly which experiments to combine for a particular paper. Lets say you publish a synthesis paper with 5 optimized reactions - there are probably at least 50 related reactions in your notebook that are either failed or somehow flawed that you have to...
more...
- Jean-Claude Bradley
I always get annoyed by this kind of work... it is so stupid that the community needs to do this kind of thing because the publishing industry and science community just messed up :(
- Egon Willighagen
Thanks for the comments! I was not able to make the connection with the "Markets can be Egalitarian Meshworks with P2P Dynamics" article though. If you could describe this a bit more I think it might be a neat idea.
- Don Pellegrino
Introducing FigShare beta - a new way to share your data: http://figshare.com/figblog... Scientific publishing as it stands is an inefficient way to do science on a global scale. A lot of time and money is being wasted by groups around the world duplicating research that has already been carried out. FigShare allows you to share all of your data,...
Seeking examples of books written in blogs (aside from Engelbart Hypothesis). Any academic or science examples? I am mostly finding popular & fiction. Helping faculty member decide on book writing strategy/platform.
okay -- so this is a personal example which is by no means finished (I copied and pasted from Word, which turns out is not such a nice thing) -- but one of the reasons I chose wikidot is because it seemed fairly easy to "format" in a book like fashion -- it also has an explicit book template which I thought was pretty nice looking -- but didn't find until after I'd already dumped...
more...
- Mickey Schafer
http://ontogenesis.knowledgeblog.org/ and http://taverna.knowledgeblog.org/ are 'blogged books' using Wordpress as a commodity platform for rapid publication. There are a number of posts on knowledgeblog.org detailing the motivations for this and the advantages it can bring. Blogs can be particularly useful in situations where a subject is too niche for a publisher to be especially interested, or when writing about a fast moving field where traditional publishing is simply too slow to keep up.
- Simon Cockell
"I want to make an announcement that my book entitled The P=NP Question and Gödel’s Lost Letter is now published. It is available here at Amazon.com, for example. I started writing a book that became a blog. Now the blog has become a book." http://rjlipton.wordpress.com/2010...
- JoeCamel
These are awesome! I am still working on that collection of books born as blogs / born digital for a blogpost. Need to get my Yahoo Pipes alternatives post out first ...
- Patricia F. Anderson
There are so many theories about what is going on with Yahoo Pipes that I am rather confused. The range of opinions is roughly: (a) they are fine and do better than most, deal with it; (b) OMG, they killed Delicious, they'll kill Flickr & Pipes too!; (c) consarn it, Pipes seems to be having more and more trouble, what else can I use?; (d) heck, just learn PERL and PHP and do it yourself.
- Patricia F. Anderson
Great work! Could you provide more details about the data you used for the nodes, edges and coloring? I am excited to look at this more deeply.
- Don Pellegrino
interesting - I hope this fixes the formatting problem between Gdocs and Word - The PDF from Gdocs is very different (in a bad way) from the PDF exported from the "same file" in Word
- Jean-Claude Bradley
Extensive review on Slashdot on the Beautiful Data book. About 'our' chapter it writes: "Similar to the previous chapter, Chapter Sixteen focuses briefly on chemistry and describes how data was collected "to predict teh solubility of a wide range of chemicals in non-aqueous solvents such as ethanol, methanol, etc." Having a very minimal chemistry...
Our new article on Science2.0. :) http://www.schattauer.de/de... Spallek H, O'Donnell J, Clayton M, Anderson P, Krueger A. Paradigm shift or annoying distraction - emerging implications of Web 2.0 for clinical practice. Appl Clin Inf 2010; 1: 96-115
I thought Heiko had said it was open access, but today I am having the same problem you are. How bizarre! Friday, I had no trouble getting to it. Buzzard. Graham, I'mm DM you my email address.
- Patricia F. Anderson
The irony of this article being behind a pay wall... the upside is that more biologists will probably read it.
- Greg Tyrelle
Folks, an update. The authors had been told that this would be free access. The editor believed that this would be free access. It WAS free access for a day, but then the publisher changed it. The editor agrees it needs to be free access and is negotiating with the publisher to restore this, but still in negotiations. Disappointing so far, but we are working on it.
- Patricia F. Anderson
Great, keep us updated, I would like to read the article.
- Greg Tyrelle
Using R for the very first time - need some help with Random Forests. 1. How do I pick mtry and ntree? 2. Once I have my results (see picture) how do I generate a model?
but defaults are pretty good - since RF's don't overfit, you can just use a large value of ntree, you will just loose out on time and storage
- Rajarshi Guha
you must have generated that poicture froim the model, right? just use it for prediction
- Rajarshi Guha
Thanks Rajarshi, I used ntree = 10,000? Make mtry big too? I was looking for a way to get the coefficients, e.g. LogS = c1 VCCLogP + c2 ATSm3 + ...
- Andrew Lang
woah! how many obs are there? Probably don't need to go beyond 800! As for coefficients - there aren't any as such. You could look at individual trees (see Klimt for a way to vsualize ensembles of trees)
- Rajarshi Guha
I've been trying to build a model - using mathematica - forward stepwise selection of descriptors (I just learned what this is called - I thought I came up with it myself): http://onschallenge.wikispaces.com/Solubil... but since I want to use it to make predictions, I'm worried about overfitting. I guess random forest was the wrong choice because I need an explicit formula. So I guess I want to check cross-validation at each step in my forward stepwise selection?
- Andrew Lang
Yes, that's one way. But in general stepwise selection procedures can lead to biased models since you are throwing a certain descriptors with testing their combinations with others. Why do you need an explicit formula?
- Rajarshi Guha
I'd like to make a webservice: SMILES -> Solubility in Methanol.
- Andrew Lang
you don't need a formula for that? Just eval descriptors and get the prediction from the pre-exisitiung model. That's what my latest web deployed model service did (on toposome)
- Rajarshi Guha
The way I do it is to get the descriptor values from the CDK-REST services and then plug them in to the formula and out pops the predicted value (all done with PHP).
- Andrew Lang
Oh ok. But then you're stuck to OLS and OLS related models, right?
- Rajarshi Guha
Yes - so I guess random forest is out. I wanted to try it because I am worried about overfitting and I guess introduced bias from stepwise selection now too.
- Andrew Lang
How about using random forests to filter 'top' N descriptors, then do backward stepwise regression?
- Andrew Lang
Yes, RF's can be used to get a rough idea of the top N descriptors
- Rajarshi Guha
no difference according to ChemSpider - well phenacetium is a salt name (which I don't understand because amides can't really form salts easily)
- Jean-Claude Bradley
ok good - book lists them separately but I bet he just didn't check.
- Andrew Lang
the phenacetinum MAY be a salt but it is an -um not a -ium. It could be a language issue...
- Antony Williams
Michael, this might help -- the first is archeology specific, the second is the larger frame of the first -- at least it demonstrates data collection/distribution/re-use in a realm that might be considered qualitative..."qualitative" is a big word, though, and differs depending on discipline: 1) http://www.youtube.com/watch... 2) http://www.youtube.com/watch...
- Mickey Schafer
Mickey, it isn't my field, but I'm pretty sure he means it in the ethnographic sense.
- Michael R. Bernstein
If he finds an answer, I'd love to know. I have also looked for an aid to qualitative research, but not sure that it has the kind of structured data collection that makes for good software. Once data is collected and some of the thematic analyses are begun, then something like Excel becomes useful, but I suspect that is not what he means.
- Mickey Schafer
D0r0th34, I'd like a pointer to such a product too.
- Meryn Stol
Livescribe has been working on a transcription program, but I don't think it is up yet. Dragon speaking naturally has an excellent reputation, especially the more recent and pricey versions, and they are supposed to handle vocal variety better (most voice transcription programs are trained to a single voice).
- Mickey Schafer
I am the colleague - and likely going to give it a go with TAMS Analyzer. Seems well suited for coding text transcribed from interviews and not too many features, just enough to code text adequately. See what happens this summer.... Thanks Michael for posting this question.
- Mark Pugsley