Peter asks: "So why does academia use PDF? Because the publishers like it. Because it looks like a good way to preserve things" -- There are many reasons why publishers should not / do not like PDFs. But I've heard more than once that it is the scientists and academics themselves who like / are most comfortable with PDFs and that publishing companies ought to provide documents in the format that their readers want.
- Hilary
Yeah, I'm not sure that PMR is being entirely fair there. I think that if you gave most researchers of my acquaintance anything other than a pdf, they'd pitch a monumental hissy fit because it wasn't what they were used to.
- Bill Hooker
Agree with Bill. Researcher told me he didn't have full text access to something . When I checked he had HTML full text but not a PDF and thats what he actually meant.
- suelibrarian
I remember reading in WSJ or NYT or similar a few years ago that there is an age divide between people who can read electronically and those who need to print it out on paper. My age (1974 birth year) was right on the boundary. I've definitely grown much more accustomed to reading electronic PDFs. However, my ability to read HTML versions of articles is awful. For example, I recently poured over a PLoS ONE article that I knew I was going to comment on. I used the PDF version for reading and note taking, and then used another laptop (dual monitors!) for occasionally commenting on the HTML version.
- Steve Koch
I haven't read PMR's post, but I can think of two reasons why I prefer PDFs. (1) I've been taught how to read scientific papers based on the printed old-school versions. Thus I'm wired for PDFs and need lots of practice to deal with HTML versions. This is probably a huge factor for most people, but I think I could get over it. (2) I like to take notes in the margin, highlight, etc. and save these for future reference. There's pretty much no good way to do this with HTML so far. As far as I know.
- Steve Koch
Thanks, Neil -- I will look into it. I've looked for it several times in the past, but never found something that I could use with confidence of never losing my notes and / or being able to share them. The first I can remember was I think called "Third Voice" in the late 90's? Recently, Cameron gave me a couple good links (http://friendfeed.com/steveko...). I am still using Evernote, but obviously it's not what we're talking about here.
- Steve Koch
In addition, many bookmark managers (I use Diigo) allow you to add notes, highlights and so on. See also: http://a.nnotate.com/
- Bill Hooker
I guess basically since PDFs are always available (for most scientific articles) AND importantly, you can count on everyone else being able to read a PDF that you share with them (I'm thinking students), I've not yet been able to switch.
- Steve Koch
If Mendeley (or others) come out with an ability to store a PDF of an article, that allows multiple people to scribble on it in their own colored ink, without sync issues, I think that's going to be great. Probably that will perpetuate whatever problem PMR is talking about (which I still haven't read :) )
- Steve Koch
Not to mention, Neil & Bill, I demand some credit for at least being able to read a paper without printing it out on paper :)
- Steve Koch
PS I'm older than Steve (I hate you accomplished whippersnappers) but I have no trouble with html and on-screen reading. It did take me a while to break the print-it-out-and-read habit though. There are applications for which printouts are still useful -- reading on the bus, for instance, if like me you can't afford a laptop -- and this should be taken into account when developing scholarly html, since regular web pages usually look like shit when printed.
- Bill Hooker
You're older than me? You seem like more of a whippersnapper definitely.
- Steve Koch
@Neil, I like Diigo quite a lot. It's fast, intuitive to use and has some neat features like highlighting (though I haven't used that yet). There are some problems -- tag import from Simpy is messed up, and tag editing in general seems to be fubar at the moment -- but the forums are active and progress does seem to be being made on the bugs. I liked Simpy much better (cleaner, everything worked) but it just got too slow and had too much downtime -- not being actively maintained -- and Diigo is the best replacement I could find (I tried more than a dozen).
- Bill Hooker
Typesetting of PDFs, in most journals, is superior than HTML, which is why I prefer to read a PDF version if it is available. It is nicer to the eyes. But It certainly has not a friendly backend for our expected semantic future...
- Bruno C. Vellutini
Common everyone. PDF does suck! Big time! (HTML can too, if used badly.) We probably spent millions of grant money on text mining. Which is money down the drain. If we would have saved the facts in a useful data format, these millions could have been spend to *your* research. Think about that!
- Egon Willighagen
As I seem to be the only physicist/mathematician who comments on these sort of things, I feel like a broken record, but math support in browsers currently sucks extremely badly and this is a primary reason why we will continue to use PDF for quite some time. Whilst I am in favor of things being stored in a semantic, searchable format such as HTML, we cannot win in physics/math until every commonly used browser supports MathML natively, MathML fonts are extended to contain ALL commonly used mathematical notation, and LaTeX to MathML tools are developed that don't require you to extensively edit the paper you are going to submit to a journal.
- Matt Leifer
Matt, I had the impressions that the HTML/MathML technology was well established... there is quite a few MathML in this page: http://qsar.sourceforge.net/dicts... And except for a few markup that got outdated (it seems), many equations show up nicely in off the shelf Firefox...
- Egon Willighagen
Is it historical? A lot of academics have used type setting languages like Tex which produce postscript files for printing and as PDF is just postscript but with a free reader for the desktop it makes sense.
- John Cooper
Just off plane... great to see discussion. Points: (i) why does citation need page numbers? why not just a DOI (ii) Anyone tried marking up a PDF so the terms are hyperlinked. I have, it's ghastly (iii) The reason academics use PDF is that the publishers weaned them onto it in the lat 1990's - the rest of the web was fine with HTML. If we want an intelligent medium we have to go beyond pages
- peter murray-rust
As an aside a comment I heard yesterday "one of the things we got out of user testing is that our online services were not enabling the very important user activity of reading in bed"
- Cameron Neylon
Yet, they seem to have no problem using the web for accessing any other data in bed...
- Egon Willighagen
At the end of the day the only way I think this will change is by journals priviledging the online html form over the pdf and making it more functional. Once paper printout (which for most is synonymous with pdf) is a second rate version of the document then things will start (slowly) to improve I think.
- Cameron Neylon
occurs to me the other community that like pdf is government. Need for a sense of control rearing its ugly head again?
- Cameron Neylon
D0r0th34, you mean your editor does not import PDFs??
- Egon Willighagen
We need proper versioning and provenance rather than immutability. Treat the root cause rather than the symptom?
- Cameron Neylon
Yes, I'm begininig to feel that we mostly agree to the extent that people feel they need to take contrary positions on this :-) How do we actually break the conversation out to the people causing the problems...?
- Cameron Neylon
from twhirl
Scientists should talk to their peers... reviewers and authors talking to editors (who are scientists too). Publishers are listening...
- Egon Willighagen
PDF for article publishing is like TIFF for image publishing -- the practice started a long time ago and even though there are better standards for each (HTML and JPEG2000, respectively) the old ways continue to hang on. Inertia, perhaps?
- Peter Murray
One nice thing about a PDF, though, is that it is an all-encompassing file format. With HTML, you have separate files for the marked up text, the style sheet, the graphics, and so forth. We don't have a similar widely-adopted file structure for bundling up all of the parts of an HTML page into one physical file. (Although WARC sounds promising.)
- Peter Murray
I really think it's more about typesetting and formatting. Publishers like to make things look the same everywhere, and this cannot be achieved using HTML (Unless you're extremely anal and write a CSS file for every known browser). Peter's comment above also gets at another problem with using HTML. It's not all encompassing in that inserted images are not a part of the document. They must be externally linked files (You don't need a separate CSS or javascript file, programmers just do that to keep their code clean and non-redundant.)
- Brian Krueger - LabSpaces
I'm really enjoying this conversation, by the way. Just wanted to point you to an (open source) HTML/Markdown/Runtime-rendering with plain text fallback (that's a mouth full) documentation system I developed called Mandown: http://wittman.org/mandown/ . It's aimed at building accessible 'How To' manuals rather than academic/research compositions, but I'd certainly invite any feedback as to how it might be useful or modified for that context.
- Micah
Egon - I know that LaTeX to HTML/MathML is possible (although I'd recommend tex4ht as the current best tool rather than ttm). There are still plenty of problems with it - too many to go into here, but perhaps I'll write a blog post on it at some point. However, perhaps the main problem is that it is not currently integrated into scientist's workflows. I don't know any user friendly LaTeX authoring system that offers a one click "publish as html on the web" option. Most journals do not produce an HTML version and neither does the arXiv. It would be fairly easy for them to do this by incorporating a tex4ht script into their existing LaTeX processing system. Additionally, even on things like blogs, forums and wikis (including Wikipedia), it is still much more common to see latex equations rendered as images rather than as MathML, which provides no more semantic content for the equations than a PDF. Although MathML scripts have existed for many years, people are still aware of the fact that images produce more reliable output.
- Matt Leifer
I'm not sure I entirely agree with the premise that PDF only exists in academia. All but a few of my online bills come in PDF format. My insurance company ships me forms and documents in PDF format. Just providing some counter-examples.
- Peter Murray
I think there is no doubt that journals should always provide a hypertext full-version of articles so people can access and read within the browser, extract data with ease and build semantically driven databases for any purpose. It is scientifically better. However, if journals will still print hard copies they need typesetting. And why not provide a PDF with good quality formatting for personal printing or even offline reading? Simply an additional option if you just want to download a single file to read an article. As typesetting is a not-so-easy task, online journals or journals who can't afford it would simply not provide a PDF with no loss for the future of science (as articles would be officially in HTML).
- Bruno C. Vellutini
Peter Murray: a lossless compression format such as TIFF is preferred over JPEG at times when the image quality is significantly more important than the file size
- Mike Chelen