Sign in or Join FriendFeed
FriendFeed is the easiest way to share online. Learn more »
'Mummi' Thorisson
Data producers deserve citation credit - http://www.nature.com/ng...
"Datasets released to public databases in advance of (or with) research publications should be given digital object identifiers to allow databases and journals to give quantitative citation credit to the data producers and curators." - 'Mummi' Thorisson
The proverbial penny is gradually droppping: "..Because publications are currently the main source of scientific credit and because publishers have already developed citable digital object identifiers (DOI), it would seem to be their opportunity to grasp or to fumble. We propose citing DOIs that tag a combination of repository, database, accession, version, contributor and funder." - 'Mummi' Thorisson
I like the suggestion of using Nature Precedings as a simple way of getting a DOI for a data dump - Jean-Claude Bradley
But what if the data are released under a public domain dedication or CC0. There is no obligation to cite (although most would do so). Also what happens when data are forked and modified? Do we still need a DOI? Or are do we always keep the data DOI in a very narrow context? - Deepak Singh
Also when data sets are really large you aren't going to dump them into Nature Precedings or to any journal and you shouldn't - Deepak Singh
Even for smaller datasets that *can* be 'dumped' into Nature Precedings: these won't be easily reusable if there's no restrictions on the formats used (PDFs, Word, Excel rubbish). So, no big improvement over the usual supplementary materials in my mind. But better than not publishing the data, obviously. - 'Mummi' Thorisson
For large-scale data, I can't see why the model used for satellite observation data etc. cannot work in the molecular biology domain as well. Unless I'm gravely mistaken, GenBank and other primary bio-data archives could join the recently-established DataCite consortium (http://www.datacite.org) and register DOIs for their contents. - 'Mummi' Thorisson
Deepak - of course for very large datasets Precedings is not a great choice. I just meant I liked it as a convenient way to share miscellaneous data that doesn't have an existing larger repository. You can always re-deposit the data elsewhere when an appropriate database becomes available. Supplementary materials is an option only if you have an accompanying paper. Precedings would let you publish the data prior to writing the full paper - if there ever is a paper. - Jean-Claude Bradley
@Deepak - even if you, as a submitter, released your stuff under CC0, with no strings attached, you would still like to be cited, wouldn't you? Problem is that it's hard to do this in a persistent way coz URLs aren't stable, a problem which DOIs of course were designed to address. - 'Mummi' Thorisson
@Deepak - on your other point on dataset modification: you can have a DOI for each revision of a published paper, all linked together as referring to the same publication, so the same would apply to datasets. - 'Mummi' Thorisson
and Precedings does have version control as well - Jean-Claude Bradley
@Jean-Claude - actually, I think do like your notion of NP as a 'data dump' when a suitable repositor doesn't exist (yet). How easy/streamlined is it to upload stuff and provide descriptive metadata, tags and stuff? And is NP being used in this way, do you know? - 'Mummi' Thorisson
I don't think doi's are the right way to do this but they may be a necessary evil to get buy in. The acceptance that doi's are "real" and URLs are not is pretty pervasive. But we do need to maintain the line on cc0/PDDL for data IMO - Cameron Neylon from twhirl
Mummi - NP has the standard Web2.0 tools - tags, voting, commenting, etc. We've used it for very small data sets - like malaria assays http://precedings.nature.com/documen... I think the DOI and the fact that the citation has a formal author list makes it attractive to the traditional science community. For example, I've found my NP entries cited in articles when a better reference might have been our wiki - because the citation format is recognizable. - Jean-Claude Bradley
Mummi, If being cited was a must for me, I'd release under CC-BY. If I don't care, then it would be CC0. I would assume most people would cite, or enough to make me feel good about myself :). - Deepak Singh
CC-BY will offer you no protection at all if you want citation. Best case scenario it has no effect (data not copyrightable - nor is its collection - database rights are a separate right). Worst case it prevents interoperability with other data (actually worst case is that it does that differentially in different juridisdictions). Really, Creative Commons licencses (other than ccZero) are bad news for data. - Cameron Neylon
Re: "URLs aren't stable" -- DOIs provide additional "stability" by introducing an additional level of indirection. But the exact same thing can be accomplished with URLs. Seems to me that standards are sometimes more about control than technical needs... - Eric Jain
Cameron, agreeing that CC0 is the ideal situation, how would you assign licenses to data sets, especially if the core intention is allowing people to fork and modify? Note that data != database - Deepak Singh
Re: "it prevents interoperability with other data" -- it prevents *automatic* interoperability with other data. Having data under a CC license does not prevent anyone from asking if they can use or republish certain data in some way not covered by the license and be granted that right -- which is what people do anyway (either that or they ignore licensing issues altogether). So I think that at least for the time being it's less of an issue than it's made to be. - Eric Jain
Not entirely sure I quite follow your question? If the core intention is to allow people to fork and modify then your only sensible option is to place it in the public domain. If you want/need to maintain some level of control then you need a purpose built data licence that binds via a contract - probably via a click wrap agreement. But this creates strong downside risks of breaking interoperability. Hence the objections of Science Commons to the ODbL - precisely that it introduces contract law because there is no other way of doing it. - Cameron Neylon
Eric - no, the point is that CC _cannot possibly_ cover data in any form whether collected or isolated because there is no copyright that inheres. The objection is that rather than providing any clarity it confuses the issue. It is absolutely reasonable if one is looking to use one dataset to ask for special permission but this makes no sense if I want to e.g. trawl everything in the NAR database issue. If we want to do that kind of science - i.e. integrating data across the whole knowledge web - then we need to focus on interoperability - Cameron Neylon
I suppose the reason I can't ask the question properly is that I'd always use CC0, but have run into enough people who want a little control. Purpose-built licenses do not scale and you can't use a public platform or distribution services, so that's not a long term solution. - Deepak Singh
You can get interoperability by having published APIs. Developers mash up data sets of all types all the time. The documentation (and data consistency) is the key there - Deepak Singh
Or let me put that another way - twenty years ago no-one realised that giving over copyright of academic papers would prevent us from effectively text mining. My aim is make sure we don't throw away another set of rights over our own output that prevents us doing cool stuff in 20 years time. - Cameron Neylon
Deepak - definitely feel for you over the issue of "a little control" - but there is no simple answer. We either completely change the legal framework to make it work. Or somehow magic up a licence that everyone agrees to use and that is absolutely future-proof. I think I am seeing some traction though for giving people the option of either enabling re-use indefinitely or maintaining control. People like to have a legacy. - Cameron Neylon
I agree that organizations should be encouraged to use CC0 -- especially given the current situation where the extent to which a database can be copyrighted differs between legislations. But would you rather see an organization lock up their data than publish it under another CC license? - Eric Jain
No - and I agree there is a balance to be found here. I just see my role as trying to push the final balance in the right direction. So I'll keep pushing in that direction and try to find the best arguments and soundbites to keep the agenda moving on. In the end if they publish it under a CC licence I'm going to assume its public domain anyway ;-) - Cameron Neylon
@Eric - to jump back dozen comments or so, back to RE: "URLs aren't stable"; I'm aware of that you can do the same thing with PURLs, and there's LSIDs (but let's not go there..). However, it's not just the indirection and other tech-bits - it's the whole social infrastructure brought in, as in e.g. CrossRef harassing publishers if stuff moves and the DOI metadata are not updated. - 'Mummi' Thorisson
I agree with what Cameron said that DOIs are a 'necessary evil' in terms of overhead, costs etc., but the tradeoff would be worth it I should think. They should be used only where appropriate for this specific purpose, citability of datasets as a unit, but not for say individual data elements within a large-scale dataset. - 'Mummi' Thorisson
@Jean-Claude - thanx for the NB info, very useful. In fact, I couldn't resist citing you over here: http://www.gen2phen.org/blogs... - 'Mummi' Thorisson
Deepak: often at least in academic communities, there is more concern about getting penalized by a teacher or coworker for incomplete citation (or even plagiarism) than being criticized or prosecuted for copyright infringement - Mike Chelen
Cameron: perhaps researchers submitting data or collections could publish accompanying metadata and commentary under a CC license if they choose - Mike Chelen
Mummi: its great to get the data published in any format, provided there is an ability for other researchers to perform a conversion, and then publish the updated data set in a way that is similarly cited and discoverable. btw whats wrong with excel files?? :p - Mike Chelen
@Mike Excel is a great application, but there are much better ways to share data on the web (think databases) - Duncan Hull
Duncan: there are libraries to convert excel tables directly to sql, which is difficult to say about word documents or pdf :) - Mike Chelen
Mummi - thanks for using our NP example on your blog - I would like to use services like datacite but isn't just limited to specific data sources? - Jean-Claude Bradley
@JC - the TIB DOI registration agency has handled data only from several primary data archives in the environmental sciences (see figure on http://www.std-doi.de), so yes it's limited. But this recently-created DataCite consortium aims to expand the TIB model into other research domains. But stuff moves slowly and it'll take some years. - 'Mummi' Thorisson
Mummi - I'll keep an eye out for when they include chemistry - Jean-Claude Bradley