Sign in or Join FriendFeed
FriendFeed is the easiest way to share online. Learn more »
Daniel Mietchen
Looking for ways to convert #XML into #MediaWiki. Any pointers/ suggestions?
Direct import via Special:Import does not work for XML from places like http://www.plosone.org/article.... - Daniel Mietchen
try to use XSLT.. . @rdmpage has already worked on the PLOS files (e.g. http://code.google.com/p... ) and, on my side, I wrote a stylesheet pubmed2wiki: http://code.google.com/p... - Pierre Lindenbaum
I am not sure whether XSLT can handle all of the formatting but your links look promising - will definitely dive in there. - Daniel Mietchen
Brief test, following up on @rdmpage (for background, see http://iphylo.blogspot.com/2010... ): using http://toolserver.org/~diberr... ( script variant at http://search.cpan.org/~diberr... ) to import http://iphylo.org/~rpage... (which he had converted from http://dx.doi.org/10... ) into a demo wiki gave http://www.science3point0.com/coasped... . Reasonable for a first try, but are there better HTML2wiki tools, and ones that could be accessed with (existing) scripts? I am aiming at a fully automated process, though I would not mind some manual interaction or formatting imperfections in the demo phase. - Daniel Mietchen
If you're using Semantic MediaWiki, http://www.mediawiki.org/wiki... and http://www.mediawiki.org/wiki... work pretty well. - Adam Kraut
As mentionned above, Extension:External Data is probably the best option out there. I'm not sure it even requires Semantic MediaWiki - all it does is to import data from XML, CSV, MySQL or LDAP into Templates. - Laurent Alquier
Thanks for the hint at these Extensions, Adam and Laurent. I haven't tested them yet, but would they allow to produce something like http://www.science3point0.com/coasped... (or better, i.e. with properly formatted images, equations and so on) on the basis of the XML or HTML of the article http://dx.doi.org/10... ? - Daniel Mietchen
Also, if HTML2wiki is necessary anyway (as it is if we follow Roderic's path), is it then better to produce a HTML version from the XML, or can we simply take the article's HTML? - Daniel Mietchen
Another interesting starting point would be the PubMed Central XML - it avoids the hassle of dealing with the XML formats of all the individual publishers. As per http://www.science3point0.com/evomri... , a direct search by license is not yet possible, but given that a search for the term evolution and for review articles entered into PubMed over the last two years whose full text is available in PubMed Central yields 507 articles, 52 of which are CC-BY. Assuming that this 10% rate holds when the search limiter "evolution" is removed (11094 hits then, hence about 1k under CC-BY), or both "evolution" and "review" (http://www.ncbi.nlm.nih.gov/pubmed... : 181409, hence about 18k CC-BY-licensed articles), that would be a sizeable seed for the wiki. - Daniel Mietchen
Just used Special:ImportXML (provided by Extension:DataTransfer) to import the nxml file from the Mar. Drugs article contained in ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/66/f2/ and got the error message"Expected 'Pages', got 'article'Expected <Page>, got <front>Expected <Page>, got <journal-meta>Expected <Page>, got <journal-id>", along with a note "0 pages will be created from the XML file." Repeated with the XML files from http://www.plosone.org/article... and http://pensoftonline.net/zookeys... - all the same. Extension:ExternalData does not appear to be suitable for importing entire articles, as it was made for grabbing smaller pieces of information from external sources. Could be helpful to import from databases, though (e.g. into infoboxes). - Daniel Mietchen
Some more info on the XML used by PLoS and PubMed (via Richard Cave): "They are all tagged to parse against the NLM/NIH journal publishing DTD. This is a standard DTD for life-sciences publishers and all articles in PubMed Central have to be published in this DTD. PLoS articles are currently using DTD v2.0 (but we'll be at v2.3 within the next few months). Information about v2.0 is at: http://dtd.nlm.nih.gov/publish... (NLM is internally converting articles to v3.0 for publishing on PubMed Central. Information on this version is at: http://dtd.nlm.nih.gov/publish... ). The NLM/NIH group has written some tools for previewing and transforming the XML. Info on these tools at: http://dtd.nlm.nih.gov/tools/ ." - Daniel Mietchen
Some more on Data Transfer (via Yaron Koren): it requires XML in its own specific format - in order to get your XML into the wiki, one would first have to convert it, probably using XSLT, into the other format. Though it might be easier to just convert the XML into CSV format, i.e. a simple table of data - Data Transfer handles CSV as well, and that one's easier all around. - Daniel Mietchen
Do check out http://developer.marklogic.com/news... , especially the bibliography at the end, for a lot of information about wiki interchange formats. - Chris M
Thanks, Chris - this page is indeed full of interesting links, though I couldn't yet find one that would be directly relevant to our problem. Meanwhile, Konrad Foerstner has started to code a PMC2wiki converter, as per http://github.com/konrad... . It is not functional yet, so suggestions, improvements, fixes etc. are very welcome. He also pointed me to ftp://ftp.ncbi.nih.gov/pub/archive_dtd/tools/ViewNLM-v2.3.zip which may be useful to display PMC's XML as HTML. - Daniel Mietchen
That looks good -- I hadn't known about the ViewNLM-v2.3 distribution, that it contained an XML to HTML transform stylesheet. - Chris M
An XML2HTML conversion is outlined at http://www.science3point0.com/coasped... . - Daniel Mietchen