« What's the point of a Facebook group? | Main | Knock, knock, who's there? »

August 06, 2007

Open, online journals != PDF ?

I note that Volume 2, Number 1 of the International Journal of Digital Curation (IJDC) has been announced with a healthy looking list of peer-reviewed articles.  Good stuff.

I mention this partly because I helped set up the technical infrastructure for the journal using the Open Journal System, an open source journal management and publishing system developed by the Public Knowledge Project, while I was still at UKOLN - so I have a certain fondness for it.

Odd though, for a journal that is only ever (as far as I know) intended to be published online, to offer the articles using PDF rather than HTML.  Doing so prevents any use of lightweight 'semantic' markup within the articles, such as microformats, and tends to make re-use of the content less easy.

In short, choosing to use PDF rather than HTML tends to make the content less open than it otherwise could be.  That feels wrong to me, especially for an open access journal!  One could just about justify this approach for a journal destined to be published both on paper and online (though even in that case I think it would be wrong) but surely not for an online-only 'open' publication?

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8345203ba69e200e5508631128834

Listed below are links to weblogs that reference Open, online journals != PDF ?:

Comments

Go find 'em a workflow that produces good HTML as well as PDF, and I'm sure they'll sign right on.

I'm right with you on this one. I HATE these PDF journals and articles -- or rather my rather ancient, but still functioning happily until it locates a PDF file, computer hates them.
No, the answer is not to update my computer since I've grown rather fond of the old thing. It still does everything (except reading large PDF files) that I want it to do including updating my websites running DreamWeaver (admittedly MX 2004 but what the heck -- it works!)

Absolutely right - PDF files are the scourge of the Internet in my opinion :-) What's the point of needing your browser to access another application before you can read the file? In an age when you can see pictures embedded in text, and video embedded in blogs, it's positively stone-age to do this!

Check out Information Research for a scholarly journal that sticks with html - or, rather, xhtml.
(http://Informationr.net/ir/)

An alternative that we try to do at BioMed Central is to have pdf AND html. People like to read printed-out pdfs (over 90% of accesses to the fulltext are of the pdf version) - but machines like to read marked-up text. We also make the xml versions availble for precisely this purpose.

PDFs are there cos people want to print them and read them on the train/in the bath etc. However, I agree that there should always be at least a html version alongside this.

@Dorothea: Your comment is spot-on, it's the tools.

I have blogged a response.

I'm sorry if this comes out a little bit brash, but as someone who has been wrestling with problems related to formats and formatting for a long time I feel the need to comment and come to PDF's defense (as silly as it is to defend a *file format*, but whatever).

I've been hand-coding HTML for close to 10 years and there are many reasons why using PDF or PDF + HTML can be the right choice for your journal. Here are a few:

1. HTML is an interpreted markup language, which gives you a virtual guarantee that it will render incorrectly in somebody's browser. Do you want to be responsible for your author's beautifully formatted article looking like total crap in Netscape Navigator 4.7 on a Mac, or on something else that is as incompatible with modern web standards as can be and that you can't feasibly test unless you curate a museum of personal computing or happen to own a time machine? I don't think so.

2. Printing HTML is still horribly inconsistent and will remain so in the foreseeable future, because the format was never meant for printing (that's what we have PS for). What happens to the page layout, tables, numbering when you print can vary from browser to browser, OS to OS and article to article. The creative variations are wonderful, but only if you like dealing with enraged authors.

3. PDF is arguably more portable if the document contains images, charts etc. People like to email articles, stick them on thumb drives or move them around in other ways. Is it "the Open Access spirit" to prevent them from doing that?

4. PDF != Adobe. A lot of people don't seem to be aware of that. I wholeheartedly agree that Adobe Reader's browser plug-in sucks, but that's not the file format's fault. Setting Firefox to download PDFs instead of rendering them inline and using something more lightweight instead of Adobe Reader (Foxit Reader is decent on Windows) helps. Hazel, if your box can run an incredible resource hog such as Dreamweaver it can surely render PDF files.

5. "Odd though, for a journal that is only ever (as far as I know) intended to be published online, to offer the articles using PDF rather than HTML. Doing so prevents any use of lightweight 'semantic' markup within the articles."

Yeah, but if you really want semantic markup why not do it right and use XML? The problematic thing with OJS (at least to some extent) is/was that XML article versions are not the basis for the "derived" PDF and HTML, which deal almost purely with visuals. XML is true semantic markup and therefore the best way to store articles in the long term (who knows what formats we'll have 20 years from now?). HTML can clearly never fill that role - it's not its job either. From what I've heard OJS will implement XML (and through it neat things such as OpenOffice editing of articles while they're in the workflow) via Lemon8 in the future.

6. "One could just about justify this approach for a journal destined to be published both on paper and online (though even in that case I think it would be wrong) but surely not for an online-only 'open' publication?"

Wait, who are we to decide whether an article can be read online, offline, on a PC, on a mobile device or on paper? I know that's not what you meant, but effectively shunning PDF means giving readers less reliability when it comes to printing and file mobility because HTML was never optimized for these things. It means giving them less options to choose from. Is "Open Access" really "the *correct* way we want you to use our stuff"?

7. "An alternative that we try to do at BioMed Central is to have pdf AND html. People like to read printed-out pdfs (over 90% of accesses to the fulltext are of the pdf version) - but machines like to read marked-up text. We also make the xml versions available for precisely this purpose."

I know I'm being annoying here, but what exactly would it be that machines "like" about mark-up? Last time I checked Googlebot was perfectly happy indexing all of our PDFs in the full and finding lots of useful search results in them. From a spider's point of view 95% of HTML is useless anyway as it just makes things look pretty. The issue of PDFs not being crawled is a historic one and while there are lots of good arguments for XML when it comes to indexing, I can't find any for HTML.

I'm surely no traditionalist (heck, I hope to publish my thesis as a wiki), but PDF is "stone-age" only if our entire concept of "documents" is stone-age, and from what I see most people seem to disagree with that. Certainly the scientific article is bound to evolve beyond its current form in its new Web environment, but if I look around the vast majority of what I see is still very much print-looking articles. Using a format that was developed for publishing in contrast to one that developed for building web sites while that is still the case seems quite sensible to me. My impression is that there's a common association of PDF with Adobe and its sucky reader and bad experiences with inline rendering. But that doesn't make HTML the silver bullet or "more OA" than PDF.

You make some very valid points but I think people are a little harsh on PDF. Personally, I end up mostly downloading PDFs of articles when available. But both formats should be available. HTML is great for browsing and getting a quick sense of an article as well as opening up the other possibilities that you mention.

I was just checking through some OJS-based journals and noticed that several of them are only in PDF. Hmmm, but a few are in HTML and PDF. It has been a couple of years since I've examined OJS but it seems that OJS provides the tools to generate both HTML and PDF, no? Ironically, I was going to do a quick check of the OJS documentation but found that it's mostly only in PDF!

I suspect if a journal decides not to provide HTML then it has some perceived limitations with HTML. Often, for scholarly journals, that revolves around the lack of pagination. I noticed one OJS-based journal using paragraph numbering but some editors just don't like that and insist on page numbers for citations. Hence, I would be that's why they chose PDF only.

As an academic, I prefer the XHTML + PDF option myself. There are times I just want to quickly view an article in a browser without the hassle of PDF. There are other times I want to print it and read it "on the train."

With new developments like microformats and RDFa, I'd really like to see a time soon where I can even copy-and-paste content from HTML articles into my manuscripts and have the citation metadata travel with it.

I've edited an OA journal for 11 years. For the first 9 we published in PDF and HTML. Back in 1996 readers had all sorts of problems installing and using Adobe Reader and virtually everyone accessed the HTML versions of the articles. Over time it transitioned to almost everyone accessing the PDF version and two years ago we quit publishing in HTML.

Converting the manuscripts into HTML took and extra 1/2 hour or so. Not much time but it adds up on top of all the other work it takes to operate an OA peer reviewed journal. At the time I was operating the journal myself and just said the hell with it, it's not worth publishing in two formats.

I don't disagree with the complaints about PDF format but Peter Sefton makes some good points as well. The fact is most readers want PDFs. Journals like Postgraduate Medicine and Journal of Medical Internet Research actually make money selling the PDF version so they can publish the HTML version for free. I think that pretty much says it all.

I suspect the vast majority of articles are still written in MS Word, and that an easy MS Word --> HTML converter would probably do a lot to encourage journal submissions in both PDF and HTML. Using Word's own HTML export features and then HTML Tidy (see http://tidy.sourceforge.net/) is still much more painful than printing to PDF.

Hey, thanks for all the comments on IJDC, PDF etc. Actually I made a blog posting a few days earlier, partly explaining WHY we use PDF (which I don't like doing), explaining how we plan to take this forward, and also asking for suggestions on how we might do some other things better (notably adding access to data).

See the blog entry at http://digitalcuration.blogspot.com/2007/07/ijdc-issue-2.html.

The comments to this entry are closed.

About

Search

Loading
eFoundations is powered by TypePad