Sign in or Join FriendFeed
FriendFeed is the easiest way to share online. Learn more »
Daniel Mietchen
Looking for ways to convert #XML into #MediaWiki. Any pointers/ suggestions?
Direct import via Special:Import does not work for XML from places like - Daniel Mietchen
try to use XSLT.. . @rdmpage has already worked on the PLOS files (e.g. ) and, on my side, I wrote a stylesheet pubmed2wiki: - Pierre Lindenbaum
I am not sure whether XSLT can handle all of the formatting but your links look promising - will definitely dive in there. - Daniel Mietchen
Brief test, following up on @rdmpage (for background, see ): using ( script variant at ) to import (which he had converted from ) into a demo wiki gave . Reasonable for a first try, but are there better HTML2wiki tools, and ones that could be accessed with (existing) scripts? I am aiming at a fully automated process, though I would not mind some manual interaction or formatting imperfections in the demo phase. - Daniel Mietchen
If you're using Semantic MediaWiki, and work pretty well. - Adam Kraut
As mentionned above, Extension:External Data is probably the best option out there. I'm not sure it even requires Semantic MediaWiki - all it does is to import data from XML, CSV, MySQL or LDAP into Templates. - Laurent Alquier
Thanks for the hint at these Extensions, Adam and Laurent. I haven't tested them yet, but would they allow to produce something like (or better, i.e. with properly formatted images, equations and so on) on the basis of the XML or HTML of the article ? - Daniel Mietchen
Also, if HTML2wiki is necessary anyway (as it is if we follow Roderic's path), is it then better to produce a HTML version from the XML, or can we simply take the article's HTML? - Daniel Mietchen
Another interesting starting point would be the PubMed Central XML - it avoids the hassle of dealing with the XML formats of all the individual publishers. As per , a direct search by license is not yet possible, but given that a search for the term evolution and for review articles entered into PubMed over the last two years whose full text is available in PubMed Central yields 507 articles, 52 of which are CC-BY. Assuming that this 10% rate holds when the search limiter "evolution" is removed (11094 hits then, hence about 1k under CC-BY), or both "evolution" and "review" ( : 181409, hence about 18k CC-BY-licensed articles), that would be a sizeable seed for the wiki. - Daniel Mietchen
Just used Special:ImportXML (provided by Extension:DataTransfer) to import the nxml file from the Mar. Drugs article contained in and got the error message"Expected 'Pages', got 'article'Expected <Page>, got <front>Expected <Page>, got <journal-meta>Expected <Page>, got <journal-id>", along with a note "0 pages will be created from the XML file." Repeated with the XML files from and - all the same. Extension:ExternalData does not appear to be suitable for importing entire articles, as it was made for grabbing smaller pieces of information from external sources. Could be helpful to import from databases, though (e.g. into infoboxes). - Daniel Mietchen
Some more info on the XML used by PLoS and PubMed (via Richard Cave): "They are all tagged to parse against the NLM/NIH journal publishing DTD. This is a standard DTD for life-sciences publishers and all articles in PubMed Central have to be published in this DTD. PLoS articles are currently using DTD v2.0 (but we'll be at v2.3 within the next few months). Information about v2.0 is at: (NLM is internally converting articles to v3.0 for publishing on PubMed Central. Information on this version is at: ). The NLM/NIH group has written some tools for previewing and transforming the XML. Info on these tools at: ." - Daniel Mietchen
Some more on Data Transfer (via Yaron Koren): it requires XML in its own specific format - in order to get your XML into the wiki, one would first have to convert it, probably using XSLT, into the other format. Though it might be easier to just convert the XML into CSV format, i.e. a simple table of data - Data Transfer handles CSV as well, and that one's easier all around. - Daniel Mietchen
Do check out , especially the bibliography at the end, for a lot of information about wiki interchange formats. - Chris M
Thanks, Chris - this page is indeed full of interesting links, though I couldn't yet find one that would be directly relevant to our problem. Meanwhile, Konrad Foerstner has started to code a PMC2wiki converter, as per . It is not functional yet, so suggestions, improvements, fixes etc. are very welcome. He also pointed me to which may be useful to display PMC's XML as HTML. - Daniel Mietchen
That looks good -- I hadn't known about the ViewNLM-v2.3 distribution, that it contained an XML to HTML transform stylesheet. - Chris M
An XML2HTML conversion is outlined at . - Daniel Mietchen