Sign in or Join FriendFeed
FriendFeed is the easiest way to share online. Learn more »
Andrew Clegg
I'm thinking about putting a proposal to bring our data resources kicking and screaming into the Semantic Web. I'm thinking RDF publishing, Linked Data compliance, microformats, ontological tagging. Has anyone been involved in a project like this? Anyone care to share experiences or ideas?
No huge amounts of experience but starting down that same road as and when we can so would very much like to share the experience and ideas - Cameron Neylon
What do you mean with "our data resources" ? :-) - Pierre Lindenbaum
I work in Christine Orengo's lab at UCL, so primarily CATH and Gene3D, any parts of our internal data warehouse BioMap that might be useful to others, and spin-off projects from these. http://cathdb.info/ - Andrew Clegg
First things that come to my mind; describe your projects using the DOAP ontology, your staff using FOAF, create an ontology for your data (e.g. see Eric Jain's work (http://friendfeed.com/ejain) on uniprot)( see the NCBO for not re-inventing the wheel), add RDFa to your web pages, make your data available as RDF , etc... - Pierre Lindenbaum
I like the idea of RDFa as a good way to start sneaking semantic markup in without major upheaval, thanks Pierre. DOAP for biological data resources -- interesting -- could maybe do with extending a little - Andrew Clegg
It's still unclear to me what the value of this is today, to the people doing it or anyone else ... could anyone help me out? - Donnie Berkholz
We just put in a proposal to build our data repository, and then publish our data. I would be interested too in hearing others experience. FYI, the Bio2RDF site has some info on how they rdfized their data - http://bio2rdf.wiki.sourceforge.net/Namespa... - Melanie
@Donnie that is a fair point and one often mentioned. Fine if you are starting from scratch like Melanie, but if you already have you data in an RDMS, what is it that this RDM system is not providing that RDF could? Anyone....? - Frank
@Frank easy integration via common protocol (sparql) with other resources? If I want to access your data I don't need to know about your webservices, or parse your files for example. If I query your database and get a result, I can directly pass it on to the next resource, no need to input it to an other script or website. - Melanie
@Melanie are you not making the assumption the "other" resource is in RDF? RDBMS have a common protocol, SQL. - Frank
@Frank - Indeed, the point being that it is easier with RDF. My repository is planned as an Oracle RDBMS, and I am looking into rdfizing my resource, or translating at query time. Some template queries are available at http://sparql.neurocommons.org/. Doing the same querying individual databases would be much more painful. - Melanie
@Donnie something like DOAP allows robots to find your data (and their meaning) - Pierre Lindenbaum
Pierre has an excellent set of suggestions, and let me add one more: Write a detailed account of what you did and problems you faced along the way to make it easier for the next guy to go down that road. - Mr. Gunn
@Frank: If you think SQL is a "common protocol", then you haven't been using more than one RDBMS :-) - Eric Jain
Liked cause this is a fascinating discussion - Deepak Singh
@Gunn yep, that'll be the paper at the end :-) @Frank and Donnie -- you wouldn't want to give Johnny Random access to your RDBMS remotely would you? Even read-only, the ability to run arbitrary queries could bring it to its knees. But with RDF, the raw data is there at a sufficient level of granularity that they can run arbitrary and unlimited queries at their expense across all your data... - Andrew Clegg
... not to mention, mashups of your data with other people's. Plus, I like where this is going and how quickly it's growing: http://www4.wiwiss.fu-berlin.de/bizer... ... and I don't want to continue not being on it - Andrew Clegg
@Mr Gunn : I started to think about how to RDF-ing my data. I'm currently stopped, because I would like to handle those data with some mysterious tools (Taverna, Biomoby, semantic-ws, etc... ) but I still haven't understand how to use them.while I'm sure there is something good to do with them. - Pierre Lindenbaum
@Andrew Ensembl, UCSC and GO allow an anonymous access to their mysql DB. e.g. mysql -N --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 - Pierre Lindenbaum
Is that entirely sensible? As an occasional DBA and SQL developer I'm well aware that a badly-written query on a large join can be indistinguishable from a DoS attack - Andrew Clegg
Some databases (e.g. Oracle) have detailed options to restrict the resources that can be used per user or query. For databases that use one-process-per-query (e.g. PostgreSQL) you may be able to use OS-level process limits. I'd be interested to hear how Ensembl handles this! - Eric Jain
you can limit the resources of an account http://dev.mysql.com/doc... - Pierre Lindenbaum
It would still make me nervous. I've worked at organisations where even internal people outside the DW team aren't allowed to execute arbitrary queries. However I've just derailed my own thread so what do I know :-) - Andrew Clegg
In the case of UCSC I think they have a dedicated and isolated mysql server where the data is copied for this external use. - Pierre Lindenbaum
Just so we're not comparing apples to oranges: Direct access to a RDBMS via SQL is comparable to direct access to an RDF database via SPARQL. The data we can get this way (i.e. tab-delimited or RDF data, resp) could be obtained through other channels (e.g. FTP, which is less flexible, but more suitable for large amounts of data). - Eric Jain
I kinda wish I hadn't mentioned query cost as an argument against direct SQL queries. That's actually a side issue. The most important part of RDF publishing seems to me to be the ability to mash up one source of data with another (and another and...) -- breaking down those silo walls and following chains of knowledge across the web - Andrew Clegg
That's why we're all looking for a stable common identifier for our data (remember LSID http://en.wikipedia.org/wiki... ?) and that's why you should use an already existing ontology ( http://bioportal.bioontology.org/ ) rather than creating your own vocabulary. - Pierre Lindenbaum
That might be the main argument: SQL (and tab-delimited data, and XML) have no (standard) way to express that some column/value isn't just a string, but represents some concept in another database (which we may even be able to retrieve if we need it and don't have it locally). - Eric Jain
I'm all about the not-reinventing-the-wheel Pierre - Andrew Clegg
If we were able to agree on one central, specialized identifier system and all use the same, non-overlapping vocabularies, RDF might be overkill. Until then, it seems rather useful to have a system that allows us to disambiguate and map our different wheels to each other (and disagree over what ought to be mapped to what). - Eric Jain
One project mentioned on another thread is Okkam(http://www.okkam.org), an EC/FP7 funded project to build infrastructure for an entity naming system. Dunno about others, but personally I find it a bit weird that it has such a low profile online, being a big (?) project. Anybody know more about this? - 'Mummi' Thorisson
"The key objective of the [Okkam] project is to deploy a global and open service called Entity Name System (ENS), which will support the systematic reuse of identifiers for "things" on the Web and generally on the Internet" - 'Mummi' Thorisson
Someone from Elsevier mentioned Okkam in a talk at SESL. I couldn't help thinking that the goal of giving every 'entity' on the web a UID was a bit, well, unachievable. Still, aim high I suppose - Andrew Clegg
To pass on a comment from a mate (who looked into Okkam some more), regarding their low profile: " Since the project is due next year, there's not much time left to involve the community... Amazingly they do offer plenty of tools, like the Word and Protege plugin." - 'Mummi' Thorisson
How did they call the community ? I didn't find any ref on pubmed .... - Pierre Lindenbaum
@Pierre: Their community involvement strategy is all layed out in this password protected document: http://www.okkam.org/deliver... :-) - Eric Jain
Their website isn't very communicative, but here is a paper that explains the project: http://events.linkeddata.org/ldow200.... Still doesn't make much sense to me though... - Eric Jain
The Shared Names - http://sharedname.org/ - project is worth mentioning too. They are meeting next week (http://sharedname.org/page...) - Melanie
With all those competing projects for centralizing identifiers, you'd think that people would have realized by now how futile this approach is... - Eric Jain
@Eric, yes, I was playing devil's advocate in these sense with regards to SQL protocol, well of course putting all you data in the same format will give you a single protocol. Which I thinks is the same thing you said in your "apples and organges" comment. I am not sure that answers the question, why RDF? - Frank
So just to clarify, I am all for publishing data as RDF, linkedata etc and this is something I am trying to raise the profile of where I am working now. However, the standard response so far has been, " We have our data in SQL and we can do what we want to do, when people ask for some information we give them a spreadsheet of it". @Eric, I would be interested to know if anything like this occurred in Uniprot and how you counteracted it. We have FASTA. TEXT, XML, why do we need another format in RDF? - Frank
Sumarising so far, I should use RDF becasue I get: 1)integration with other RDF resources with a common protocol 2) Dont need to understand the datastructure/schema, therfore no file parsing - (bioinformaticians out of jobs :) ) 3) Similar to 2, but the direct output of a query can be the direct input/integrated with outher resources without the need for parsing 4) Access to data, without exposing the source, e.g direct access to database, a plus for security issues. 5) User bears more of the cost of queries, than the data provider, when compared to database query load (is this really true?) - Frank
6) Extra value (similar to 1) to data can be added via mash-ups with other resources - either by the provider or external parties. The provider has the potential to gain extra value without putting effort in to do so 7) Ontolgy annotation, the ability and a standard way to add meaning to data that is computationaly ameneable, which is harder to achieve via SQL - Frank
Plus, you can annotate web services with the semantic types they accept and produce (see SAWSDL) so you know any data published with a conforming type will fit. Not sure how many people are actually doing this yet (we're not). But you certainly can't do that sort of thing with direct DB access or table dumps - Andrew Clegg
fantastic discussion :) Might be suitable to also include a link to the article about the 10th anniversary of RDF, as linked on FF by Frank, Abhishek, Yann, Pierre, etc: http://friendfeed.com/e... - Allyson Lister
Yeah, I just downloaded that one, looks like a good intro - Andrew Clegg
FASTA is perfect for people who just need the sequence data (including an identifier and a human-readable label). But it's not advisable to put much more than that in there (some have tried)... - Eric Jain
TAB-delimited data is great for "flat" data such as lists of identifiers. But it won't work well for complex data (data types, references) without a RDBMS-specific schema. Also, once the data has to be split into multiple files, it's no longer practical to process the data without loading it into an RDBMS first. - Eric Jain
TEXT has a lot going for it: It's by far the cheapest solution to work with as it can be edited and displayed directly. But it gets quite ugly as the data gets more structured (natural for a growing database) and code needs to manipulate the structure in more detail rather than just passing around blobs of text. - Eric Jain
XML can get rid of the custom syntax quirks that tend to accumulate in text formats. But we now need to invest quite a lot of effort into creating human-readable/editable views of the data, and shuffling around the data (especially large amounts) is a lot more involved. There are some code generation tools, and to pass around complex data without handling it there is a generic tree data model. XQuery can be used to query within (sets of) documents. - Eric Jain
RDF addresses some of the issues XML leaves open (e.g. how to identify and reference resources globally) and provides a much more useful (for data integration, at least) non-document centric, graph-based data model (plus a standard query language). The "schema" language (OWL) is more focused on describing logical constraints than syntax (as is the case with XSD). - Eric Jain
Note that all of these formats can either be generated on the server-side based on some (more or less standard) query mechanism, or they can be pre-generated and distributed in bulk (e.g. via FTP). So that's really a separate issue. For publishing data, I consider TEXT and XML to be legacy formats (i.e. you might need to continue to support them, but wouldn't bother providing them for new databases). - Eric Jain
@Eric have you any indication that anybody is using RDF_uniprot and what for? Or if anybody else for that matter knows how its being used. - Frank
The RDF distribution of UniProt was being used in various RDF demonstration projects. Don't know if anyone (outside of UniProt) has started using it (the main part, not just the smaller parts such as the taxonomy) for "real" work since then (i.e. end of 2007). - Eric Jain