« Write blog postings, not articles :-) | Main | OpenID and education »

July 12, 2007

Journal articles, metadata formats and woes

In a post on his Digital Library Technology Jester weblog, Peter Murray of OhioLINK points to an XML format developed by the Directory of Open Access Journals (DOAJ) for representing descriptions of journal articles.

First, I think I'd qualify Peter's point that

Prior to this addition the only scheme available was Dublin Core, which as a metadata schema for describing article content is woefully inadequate. (Dublin Core, of course, was never designed to handle the complexity of the description of an average article.)

I think the reference here to "Dublin Core" is really to the specific "DC application profile" (or description set profile, as we are starting to refer to these things) commonly known as "Simple DC", i.e. the use of (only) the 15 properties of the Dublin Core Metadata Element Set with literal values, for which the oai_dc XML format defined by the OAI-PMH spec provides a serialisation. On that basis, I'd be inclined to agree that the Simple DC profile is not the tool for the task at hand: the Simple DC profile is intended to support simple, general descriptions of a wide range of resources, and it doesn't in itself offer the "expressiveness" that may be required to support all the requirements of individual communities, or more detailed description specific to particular resource types.

However, the framework provided by the DCMI Abstract Model provides the sort of extensibility which enables communities to develop other profiles to meet those requirements for richer, more specific descriptions.

I guess DCMI still has its work cut out to try to convey the message that "Dublin Core" doesn't begin and end with the DCMES.

But perhaps more specifically pertinent to the topic of the DOAJ format is the fact that the work carried out last year on the ePrints DC Application Profile, led by Andy and Julie Allinson of UKOLN, applied exactly this approach for the area of scholarly works, including journal articles. From the outset, the initiative recognised that the Simple DC profile was insufficient to meet the requirements which had been articulated, and shifted their focus to the development of a new profile, based on applying a subset of the FRBR entity-relational model to the "eprint" domain.

I haven't yet compared the DOAJ format and the ePrints DCAP closely enough to say whether the latter would support the representation of all the information represented by the former. I guess it's quite likely that the two initiatives were simply not aware of each other's efforts. Or it may be that the DOAJ folks felt that the ePrints DCAP was more complex than they needed for the task at hand.

But it does seem a pity that we seem to have ended up with two specs, developed at almost the same time, and applying to pretty much the same "space", leaving implementers harvesting data from multiple providers with the probability of needing to work across both.

(Hmmm, it occurs to me that a quick spot of GRDDL-ing might make that less painful than it appears... Watch this space.)


TrackBack URL for this entry:

Listed below are links to weblogs that reference Journal articles, metadata formats and woes:


I think we might just have to accept the fact that there can be no single spec in this space. I'm in fact involved in a third effort based on RDF. Our scope is in some sense much broader than these two efforts (the entirety of scholarly citation), while in other senses still narrow (to my mind, the focus is not libraries or repositories, but enabling end-user scholarship with tools like Zotero).

If you look at examples of the two that you mention here, I wouldn't really blame the DOAJ people for not using the ePrint work. If you compare the syntax (which is what most developers look at), the DOAJ is clear and straightforward, and easy to process with standard XML tools. The ePrints stuff as encoded in the DCMI abstract model XML syntax is not (it is complex, obtuse, etc.).

The issue of the right target complexity in modeling and syntax is a hard problem. In doing the RDF work, I'm constantly tripped up on one particularly annoying problem: the fact that contributions are often ordered. XML is of couse ordered by default; RDF (and the DCMI AM?) not. The result is that we're headed toward using blank nodes to represent contributions, which I absolutely hate to do, but see no other reasonable way.

As I say, then, I think we're going to have to be content with a variety of different formats, for different kinds of communities. Perhaps as you note, though, GRDDL will indeed allow them to be better harmonized.

Hey, Pete,

You are, of course, quite correct in pointing out my sloppy use of the phase "Dublin Core" -- one suspects were going to have the same problem with "OAI" versus "OAI-PMH" and "OAI-ORE" when the latter makes its debut.

I'm sure the DOAJ folks will chime in if I get this wrong, but the DOAJ article schema first came into existence to support their efforts to get article metadata from OA journal publishers for their (relatively new) articles index. It isn't clear to me whether DOAJ is using OAI-PMH harvesting to get that article metadata from journal publishers, but I get the sense from conversations with them that it isn't. Since they had that schema laying around, they put it into their OAI-PMH repository.

I'll be interested in your comparison posting, should you find time to put it together...

I too leapt on the "Dublin Core" comment and am glad to see what Peter really meant. I was intrigued by Bruce's comment about the ordering of articles. I've spent a fair amount of time this past week pondering the future of small print journals, and I'm curious what work is being done in the space of providing... I don't know how to describe it... an ability to reclaim the intellectual integrity of a collection of articles known as an "issue."

The comments to this entry are closed.



eFoundations is powered by TypePad