WorldCat Institution Registry and Identifiers
(Or: PeteJ bangs on about http URIs again!)
In a couple of recent posts, Lorcan highlights the availability of a new WorldCat Institution Registry service provided by OCLC. The registry stores, manages and makes available descriptions of institutions, mainly libraries and library consortia, and I agree with Lorcan that this is potentially a very valuable service.
Lorcan's second post cites a post by Ed Summers in which Ed notes:
If you drill into a particular institution you’ll see a pleasantly cool uri:
…which would serve nicely as an identifier for the Browne Popular Culture Library. The institution pages are HTML instead of XML–however there is a link to an XML representation:
http://worldcat.org/webservices/registry/content/Institutions/89073
This URL isn’t bad but it would be rather nice if the former could return XML if the Accept: header had text/xml slotted before text/html.
As I was rushing out the door last night, I posted a comment to Ed's post expressing my whole-hearted agreement with both of these points, but also expressing a degree of concern at another aspect of the use of identifiers I'd observed in the data exposed by the registry. As far as I can see, my comment doesn't seem to have made it through to Ed's weblog (yet, anyway - maybe Ed moderates his comments and I'm just being impatient, or maybe - more likely - in my haste to get to the pub I clicked the wrong button on his submission form!) so I'm posting a note here.
Like Ed, my immediate reaction was that the first URI above would make a good identifier for the institution. An http URI can be used to identify a resource of any type, so no problem with the choice of URI scheme. The form of the URI itself seems to take into account all of Berners-Lee's suggestions for "coolness". Is it persistent? Well, I guess that depends on OCLC's commitment to maintaining their ownership of the worldcat.org domain and to managing the URIs they assign within that URI-space e.g. not to assign the URI above to a different institution (or indeed to a document or a DDC term or a research project or something else entirely) at some point in the future. I haven't checked whether OCLC publish any policy statement about their commitment to these URIs. If they did, that would contribute to my faith in their having some degree of persistence, but for now I'll take it on trust that OCLC will manage this URI-space in a controlled manner, and these URIs have some reasonable degree of persistence.
Furthermore, the URI is reasonably concise and human-readable/citable: I could scribble one of those URIs down in an email or read one over the phone.
And the URI also has the helpful characteristic that, if I am given (by email, on a piece of paper, in a telephone conversation) only the URI, then with no additional information, I can "look up" that URI using a tool which supports the HTTP protocol (e.g. my desktop Web browser) and get back some information about the resource identified by the URI provided by the owner of the URI (OCLC), in the form of an HTML document. Or, in the terms of the Web Architecture, I can dereference the URI and obtain a representation of the state of the resource identified by the URI. This functionality is available to me because there is a message protocol which makes use of the http URI scheme, and that protocol is widely supported in software tools.
And that representation (well, OK, a second HTML page linked from it) also contains the URIs of other related institutions, so that I can follow hyperlinks from this institution to another - and obtain representations of those other institutions too. For example (choosing a different institution from Ed's examples to better highlight the hyperlinking), Bowling Green State University is related to a number of "branches" and I can obtain a representation of the list of branches which includes the URIs of those branch institutions.
And as Ed notes, it would be useful if OCLC added support for HTTP content negotiation so that if my software application obtained the URI, it could obtain a representation in a format which made explicit the semantics that the HTML document conveys to a human reader e.g. an instance of the WorldCat XML format, as is available by dereferencing the second URI above.
So far, so good. Better than good. Great stuff, in fact.
However, when I looked a bit harder at the WorldCat XML representation, my heart sank slightly. Here's an (edited, reformatted, emphasis added) version of one of those XML instances (minus XML Namespace declarations, XML schema location declarations etc):
<institution>
<identifier>info:rfa/oclc/Institutions/2226</identifier>
<nameLocation>
<institutionName>Bowling Green State University</institutionName>
<!-- stuff snipped -->
</nameLocation>
<!-- stuff snipped -->
<branches>
<branch>info:rfa/oclc/Institutions/73034</branch>
<branch>info:rfa/oclc/Institutions/88815</branch>
<branch>info:rfa/oclc/Institutions/88840</branch>
<branch>info:rfa/oclc/Institutions/89072</branch>
<branch>info:rfa/oclc/Institutions/89073</branch>
</branches>
<!-- stuff snipped -->
</institution>
What struck me here was that, in the XML representation, a different set of URIs is exposed, a set of URIs using the info URI scheme. Certainly, these URIs share some of the (useful) characteristics that I listed above for the http URIs. They are pretty much as concise, human-readable/citable as the http URIs. My faith in their persistence is subject to pretty much exactly the same considerations as above: will the owners of the "rfa" info namespace continue to own that namespace, and will they manage the assignment of URIs within that namespace so that a single URI is not assigned to two different resources over time? (Hmm, actually, I don't see an entry for "rfa" in the info URI Namespace Registry, so I guess that does raise some doubts in my mind about the answer to the first of those questions!)
Where the info URIs aren't so useful, of course, is that I can't - at least without some additional information - take one of those URIs that has been given to me by email, by telephone etc, and obtain a representation of the resource. While the info URI scheme shares some of the characteristics of the http URI scheme as an identification mechanism, there is no widely deployed mechanism for dereferencing an info URI. While there is no global method for dereferencing info URIs, the info URI scheme does provide for individual "namespace authorities" to specify dereferencing mechanisms for URIs on a "per namespace" basis, and to disclose those methods via the info URI Namespace Registry. Clearly that introduces additional complexity - and ultimately cost - for dereferencing, at least in comparison with the http URI scheme. (And as I note above, there is currently no entry in the registry for the "rfa" namespace, so I don't know whether there is any dereferencing mechanism available.)
So that raises the question of why the registry system should coin, assign and expose one set of identifiers for institutions in the XML representation (URIs using the info URI scheme, like info:rfa/oclc/Institutions/2226) and a second set of identifiers for the same institutions in the HTML representation (URIs using the http URI scheme, like http://worldcat.org/registry/Institutions/2226). To a human reader and to a software application those two URIs are different URIs, and there is no indication of whether or not they identify the same resource. Of course the URI owner could make available the information that they do in fact identify the same resource. Certainly that enables consumers of those two different URIs to establish that they are referring to the same resource, but it does so at the cost of additional complexity: every system that uses the URIs needs to process the two URIs as equivalents. The Web Architecture document counsels against such an approach for exactly these reasons:
Good practice: Avoiding URI aliases
A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.
That's not to say that we don't have to deal with cases where we wish to assert that two URIs identify the same resource - of course we do and we have mechanisms to do so - but let's try to avoid making life unnecessarily difficult for ourselves!
Now, of course, it may well be that my question "why coin, assign and expose two identifiers for the same resource?" is based on a completely false assumption! It may be that the intent is that these two sets of identifiers identify not one single set of resources, but two different sets of resources e.g. the info URIs identify the institutions, and the http URIs identify documents which describe those institutions. And indeed the fact that when I dereference the http URI http://worldcat.org/registry/Institutions/2226 the server returns a status code of 200 suggests that this is indeed the case (because according to the W3C Technical Architecture Group's resolution to the httpRange-14 issue, that response code indicates that the resource is an "information resource"). That is a very good argument for coining and exposing two sets of identifiers, though I think it also brings with it a requirement on the system which exposes those identifiers to make clear to users - both human users and software applications - exactly what resource is identified by each URI.
However, while it is a very good argument for using two different sets of URIs, I'm still not convinced that it is a good argument for choosing to use URIs based on the info URI scheme to identify the institution: as Ed pointed out, http URIs will do the job perfectly well - and they have the (hugely significant, I think) benefit that they can be dereferenced using widely available tools. If it is necessary to identify both the institution and the document then that could still be done using http URIs in both cases e.g.
Pattern 1:
- Document: http://worldcat.org/registry/Institutions/2226
- Institution: http://worldcat.org/registry/Institutions/2226#inst
Pattern 2:
- Document: http://worldcat.org/registry/Institutions/2226
- Institution: http://worldcat.org/registry/Institutions/2226/inst
(In this case, when the second URI is dereferenced the server could/should observe the rules specified by the resolution to the httpRange-14 issue, and use a response code of 303 to redirect to the document describing the institution. Incidentally, there's an excellent, clear, practical discussion of this and related issues in a paper by Leo Sauermann and Richard Cyganiak mentioned in a recent message to the Semantic Web Education and Outreach Interest Group mailing list.)
Pattern 3: (assuming a suitable top-level PURL domain is registered)
- Document: http://worldcat.org/registry/Institutions/2226
- Institution: http://purl.org/worldcat/registry/Institutions/2226
(The PURL server uses a response code of 302, rather than 303, so this approach wouldn't meet all the requirements of the TAG resolution mentioned above.)
Any of these patterns allow a clear distinction to be made between the identifier of the institution and the identifier of the document, while continuing to allow for the dereferencing of both sets of identifiers, using a protocol that is supported by widely-deployed tools (and introducing the flexibility suggested by Ed by offering multiple representations via HTTP content negotiation).
Hmm. I think I've taken a long time to say what Andy said more concisely back here!
Pete, sorry the commenting on my blog didn't seem to work out. Luckily I found your posting anyhow :-) Great points about the use of identifiers in the XML responses themselves. As the info-uri registrar I imagine there is a fair amount of pressure on oclc to actually use info-uris! I totally agree...I like the idea of not making things difficult for ourselves (or others), and rather making resolution of identifiers with familiar tools easy as pie. Also thanks for the link to Andy's post of last year on the same set of topics.
Posted by: Ed Summers | February 22, 2007 at 10:35 AM