« More RDFa in UK government | Main | HE in a Web 2.0 world? »

May 08, 2009

The Nature of OAI, identifiers and linked data

In a post on Nascent, Nature's blog on web technology and science, Tony Hammond writes that Nature now offer an OAI-PMH interface to articles from over 150 titles dating back to 1869.

Good stuff.

Records are available in two flavours - simple Dublin Core (as mandated by the protocol) and Prism Aggregator Message (PAM), a format that Nature also use to enhance their RSS feeds.  (Thanks to Scott Wilson and TicTocs for the Jopml listing).

Taking a quick look at their simple DC records (example) and their PAM records (example) I can't help but think that they've made a mistake in placing a doi: URI rather than an http: URI in the dc:identifier field.

Why does this matter?

Imagine you are a common-or-garden OAI aggregator.  You visit the Nature OAI-PMH interface and you request some records.  You don't understand the PAM format so you ask for simple DC.  So far, so good.  You harvest the requested records.  Wanting to present a clickable link to your end-users, you look to the dc:identifier field only to find a doi: URI:

doi:10.1038/nature01234

If you understand the doi: URI scheme you are fine because you'll know how to convert it to something useful:

http://dx.doi.org/10.1038/nature01234

But if not, you are scuppered!  You'll just have to present the doi: URI to the end-user and let them work it out for themselves :-(

Much better for Nature to put the http: URI form in dc:identifier.  That way, any software that doesn't understand DOIs can simply present the http: URI as a clickable link (just like any other URL).  Any software that does understand DOIs, and that desperately wants to work with the doi: URI form, can do the conversion for itself trivially.

Of course, Nature could simply repeat the dc:identifier field and offer both the http: URI form and the doi: URI form side-by-side.  Unfortunately, this would run counter the the W3C recommendations not to mint multiple URIs for the same resource (section 2.3.1 of the Architecture of the World Wide Web):

A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

On balance I see no value (indeed, I see some harm) in surfacing the non-HTTP forms of DOI:

10.1038/nature01234

and

doi:10.1038/nature01234

both of which appear in the PAM record (somehwat redundantly?).

The http: URI form

http://dx.doi.org/10.1038/nature01234

is sufficient.  There is no technical reason why it should be perceived as a second-class form of the identifier (e.g. on persistence grounds).

I'm not suggesting that Nature gives up its use of DOIs - far from it.  Just that they present a single, useful and usable variant of each DOI, i.e. the http: URI form, whenever they surface them on the Web, rather than provide a mix of the three different forms currently in use.

This would be very much in line with recommended good practice for linked data:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8345203ba69e2011570781190970b

Listed below are links to weblogs that reference The Nature of OAI, identifiers and linked data:

Comments

I agree with the point that for the (non-DOI-aware) majority of consuming applications, surfacing the http URI is more "useful" than surfacing the non-http DOI.

I guess I'd probably take a softer line on surfacing the non-http doi alongside the http DOI.

I don't really think we can ignore the fact that those identifiers are "out there". So if we can provide the information that tells a non-DOI-aware application (i.e. one that doesn't know anything about the mapping between the two forms) that http://dx.doi.org/doi:10.1038/nature01234 and doi:10.1038/nature01234 both identify the same resource, then that means that "dumb" application can infer that any info it has about the thing identified by http://dx.doi.org/doi:10.1038/nature01234 also applies to the thing identified by doi:10.1038/nature01234 .

I agree that in an ideal world, it would be better to have avoided having the aliases, but given that they are out there, and there is data using the URI doi:10.1038/nature01234, then that horse has probably bolted, and I suspect we have to find the best ways of managing them.

I don't think that the argument that the horse has bolted is a good one for continuing to mint, use and share multiple URIs for the same thing.

The point is that we don't have to keep pushing out new stuff with multiple URIs associated with it. The sooner we stop - the less mess there will be to clean up?

The situation is actually worse than I feared because both

http://dx.doi.org/10.1038/nature01234

and

http://dx.doi.org/doi:10.1038/nature01234

resolve to the same thing.

For a single agent to create multiple URIs for the same resource is harmful. Full stop! It is harmful to scholarly communication, which is what we are supposed to be supporting.

If one person uses the first of these as a bookmark on del.icio.us and another person uses the second, the two records will not be recognised as being about the same resource :-( Communication will be broken. This is not a step forward.

There are situations where multiple URIs for the same thing are unavoidable - typically where multiple agents are working separately but want to describe the same thing. But that is not the situation here.

If you can stand it, there was some discussion on the code4lib mailing list that touched on the use of http URIs vs other types of URIs (specifically info: URIs)

http://www.mail-archive.com/code4lib@listserv.nd.edu/msg05010.html

http://www.mail-archive.com/code4lib@listserv.nd.edu/msg05018.html

and others

Hi Andy:

I will aim to make a proper response to your post since you raise some interesting points. Meantime I just wanted to address your comment regarding the putative URI

http://dx.doi.org/doi:10.1038/nature01234

This as far as I know has *never* been used or advocated.

One should not confuse the functionality of a service access point with an identifier. The DOI resolver is merely applying Postel's Law in being liberal with what it receives.

This looks to me like your own cooked-up construct and not one that the CrossRef community would recognize or endorse.

Cheers,

Tony

Thanks Tony. Actually, I copy-and-pasted it from the HTML view of the simple DC OAI record you serve at

http://www.nature.com/oai/request?verb=GetRecord&identifier=10.1038/nature01234&metadataPrefix=oai_dc

Mouse-over the dc:identifier field and you'll see what I mean - unless I'm doing something stupid.

Lol...

Well that will be a bug then. :))

What do you want? A finder's fee. (We may not be as confident as DEK in terms of paying out.)

One argument I can think of is that by publishing only the identifier without a service connected to it, a party willing to reuse the data is free to connect whatever service he want. An example of this is the use of an local OpenURL server to redirect users first to a local copy if available instead of the copy registered at doi.org to which the user might not have a access.

Marco,
there is nothing in the 'http' prefix of the http URI that says, "this must be dereferenced using HTTP. In that sense, there is no single 'service' permanently associated with the http URI - rather, there happens to be a current, and very helpful, default de-referencing mechanism available.

At the point that HTTP dies, which it surely will at some point, people will build an alternative de-referencing mechanisms (which might be based on Handle, or LDAP, or whatever replaces HTTP). The reason they will have to build something else is that the weight of deployed http URIs will demand it.

That's the reasoning behind my, "the only good long term identifier, is a good short term identifier" mantra.

I was thinking more or alternative services also based on HTTP. A reusing party might the DC-record to direct its users via the URL http://localOpenURLserver/resolve?ID=doi:10.1038/nature01234 to a copy of the article at the local repository or some aggregator. I know that's also possible by parsing the identifier out of the DOI URL but that doesn't seem logical to me.

Marco,

Couldn't the reusing party do exactly the same thing simply using the http URI, i.e. something like

http://localOpenURLserver/resolve?ID=http://dx.doi.org/10.1038/nature01234">http://dx.doi.org/10.1038/nature01234">http://localOpenURLserver/resolve?ID=http://dx.doi.org/10.1038/nature01234

Pete

I find this a very interesting discussion, and I would like to make a few observations:

(*) In principle, I agree with PeteJ that one shouldn't make a sport of minting multiple identifiers for the same resource. But the DOI people would not be the only ones to do so. Actually, it is happening abundantly in the Linked Data effort of which Tim Berners-Lee himself is a strong advocate. As PeteJ indicates, this can leave one with a significant (de-duplication) mess to clean up. As a matter of fact, my main take-away from the LDOW2009 Workshop in Madrid was that one of the major challenges the Linked Data effort faces is related to exactly this problem: too many URIs for the "same" thing floating around, and only one (very strong) relationship (owl:sameAs) to conflate (graph-merge) them.

(*) Now, I don't think the DOI people are causing that big a problem by using both http://dx.doi.org/10.1038/nature01234 and doi:10.1038/nature01234 to refer to the same article. A few reasons:

- There is a use case for both variants: the HTTP variant is the one to use in a Web context (for obvious reasons; so totally agreed with both Andy and PeteJ on this), and the doi: variant is technology independent (and publishers want that, I think, if only to print it on paper).

- Both identifiers are, in essence, minted within the boundaries of a same (distributed) organization. That is in contrary to all those Linked Data URIs that are minted by unconnected entities. Obviously, it is quite more straightforward for the DOI people to "clean up their own mess" (to use PeteJ's words) internally than it is to do so in a setting of unconnected entities that mint different URIs for the same thing. Their problem starts by trying to figure out, preferably in automated ways, whether two URIs indeed stand for the same thing. It's quite a problem as the paper "Managing Co-Reference on the Semantic Web" (http://events.linkeddata.org/ldow2009/papers/ldow2009_paper11.pdf) illustrates. The DOI people don't have this problem: they just _know_ about the relationship between those two identifiers. Hence, it is straightforward for the DOI people to express a form of equivalence between http://dx.doi.org/10.1038/nature01234 and doi:10.1038/nature01234. Which, I feel, they should do.

- As a matter of fact, with ORE, we did some thinking on behalf of the DOI people (arrogant we are!). We thought (rightfully so, I guess) that owl:sameAs would not be appropriate as an equivalence expression because way too strong. Just think about the multiple copy problem (multiple copies of a paper with the same identifier residing in multiple repositories - same doi: identifier different http:// identifier) and it's kind of clear that owl:sameAs doesn't do it. I happily refrain from more philosophical arguments. Anyhow, for the DOI case and for cases like it, we introduced a weaker equivalence relationship named ore:similarTo. And one of the main uses we saw for it is to point (in Resource Maps) from the HTTP URI for a resource to the non-HTTP URI of it. The DOI identifiers totally fit the bill. Well, almost. See below.

(*) There is, however, an elephant in the room: doi:10.1038/nature01234 is not a URI in any registered URI scheme. It is an identifier, not a URI. Don't get me wrong: it is totally valid to use it as a value of dc:identifier because the description of that DC element says: "Recommended best practice is to identify the resource by means of a string conforming to a formal identification system. " So, identifiers that are not URIs are allowed. But one still needs to wonder why this variant is included in the Nature metadata records, and not its info-URI version info:doi/10.1038/nature01234? Maybe exactly because of what I mentioned above: doi:10.1038/nature01234 is a technology-neutral variant, whereas both info:doi/10.1038/nature01234 and http://dx.doi.org/10.1038/nature01234 are (URI) technology based? Maybe because doi:10.1038/nature01234 is what publishers want to see appear in citations made by journal articles and in print? Whatever the reason, the equivalence that I referred to above could not be expressed between http://dx.doi.org/10.1038/nature01234 and doi:10.1038/nature01234 (or at least not using ore:similarTo) because the latter does not identify a web resource because it is not a URI.

(*) To cut what has become a very long story short, I would suggest that both the doi:10.1038/nature01234 and the http://dx.doi.org/10.1038/nature01234 variants would be included in dc:identifier fields. Why not? It allows for technology-based and technology-neutral applications, pleasing both users of scholarly information that continuously live in a technology-based world, and publishers that ... I don't know ...

Disclaimer: I would like to end by stating that I am (and have always been) a big fan of the DOI system. The infrastructure that was deployed around DOIs has demonstrated that it has flexibility that allows it to be leveraged in many different ways, cf CrossRef's OpenURL DOI lookup service, redirection of DOI (HTTP version) dereferencing requests to institutional linking servers, etc. I just wish they would now start looking into ORE. As I told Tony Hammond in a private e-mail exchange, the http://dx.doi.org/10.1038/nature01234 URI would make a perfect ORE Aggregation URI. And we could make machine-readable variants for all those publisher splash-pages. And then we would be able to ... Well that is another story.

@Pete: As Marco already suggested, using HTTP URIs for DOIs would be possible from a technical perspective. But the OpenURL 0.1 specification (e.g. http://www.exlibrisgroup.com/?catid=%7B2F09A3E3-A22B-4CB1-8434-930CDB7264A7%7D) encourages using the DOI value only. Therefore, I wouldn't expect that many link resolvers support such detecting mechanisms.

OpenURL 1.0 meanwhile established URIs for all resource identifiers, thus the usage of HTTP URIs is valid as well. However, providing a DOI as a HTTP URI would still require some additional logic to parse the DOI value from the URL.

Herbert,
on the multiple-URIs thing... as I said in my response to Pete, the wider case of people minting multiple URIs for the same resource because they are working independently is understandable and unavoidable. You suggest that the case of a DOI owner (inadvertently) minting multiple URIs simply by virtue of assigning a DOI to something is more 'straightforward' because the DOI system 'knows' how to "clean up the mess". I accept this is true. But it is only really true within the bounds of the DOI system (i.e. within the bounds of those people who know something about doi URIs), at least currently. It is not (yet) true on the wider Web because the Web-scale way of solving this problem is via OWL assertions, which is not a widely deployed solution. So, for the time-being at least, the DOI system is poluting the Web with multiple forms of URIs with no, real, deployable way of tidying up the mess right now.

Note: In some ways the discussion about whether doi:... and http://dx.doi.org/... identify the same thing is a moot point? DOI-folk might well suggest that they do not (I think Tony Hammond has suggested this to me in the past). The real point is not, "is there one thing with two URIs (doi and http) or two things, each with its own URI?". Rather, "if there are two things here, let's give them both a separate http URI".

On the issue of whether doi:... is a URI or not - here I think we do disagree. In a sense, it is up to the DOI-folk to tell us whether it is a URI or not - it's not for us to say from the outside. The DOI Handbook, makes a number of assertions on this, including the following:

A specification for a DOI name as a URI exists as an Internet Draft: this document defines the 'doi' Uniform Resource Identifier (URI) scheme for DOI names, which allows a DOI name to be referenced by a URI for Internet applications. The current revision of the URI specification, plus ongoing debate within the IETF and W3C communities on several proposed URI specifications, have delayed the processing of this Draft. DOI System implementation does not depend on implementation of this specification.

http://www.doi.org/handbook_2000/enumeration.html#2.9.2

I don't know if I read that the be a categoric statement that doi:... is a URI but I think it comes pretty close. The doi: URI certainly complies with the URI specification.

On that basis, I'd say that the doi URI does fall within the bounds of this discussion *as a URI*. The fact that it is not a registered URI scheme just makes the situation worse (but is not a justification for saying it is not a URI).

In passing, I don't understand what you mean when you say "because the latter [doi:...] does not identify a web resource because it is not a URI". Presumably non-URIs can identify Web resources, just as URIs can identify non-Web resources? Things identify what people say they identify! :-)

Note: like you I have nothing against DOIs per se, nor the underlying rationale for the work on the info URI - I only disagree with the implementation choice of inventing new (and in the case of DOI, multiple) URI schemes where the use of http URIs would have sufficed and would have been significantly better for the coherence of the wider Web.

@herbert just found another post about the inappropriateness of owl:sameAs (and alternatives for it) which you may find interesting: http://apassant.net/blog/2009/05/17/inconsistencies-lod-cloud

The comments to this entry are closed.

About

Search

Loading
eFoundations is powered by TypePad