July 20, 2009

On names

There's was a brief exchange of messages on the jisc-repositories mailing list a couple of weeks ago concerning the naming of authors in institutional repositories.  When I say naming, I really mean identifying because a name, as in a string of characters, doesn't guarantee any kind of uniqueness - even locally, let alone globally.

The thread started from a question about how to deal with the situation where one author writes under multiple names (is that a common scenario in academic writing?) but moved on to a more general discussion about how one might assign identifiers to people.

I quite liked Les Carr's suggestion:

Surely the appropriate way to go forward is for repositories to start by locally choosing a scheme for identifying individuals (I suggest coining a URI that is grounded in some aspect of the institution's processes). If we can export consistently referenced individuals, then global services can worry about "equivalence mechanisms" to collect together all the various forms of reference that.

This is the approach taken by the Resist Knowledgebase, which is the foundation for the (just started) dotAC JISC Rapid Innovation project.

(Note: I'm assuming that when Les wrote 'URI' he really meant 'http URI').

Two other pieces of current work seem relevant and were mentioned in the discussion. Firstly the JISC-funded Names project which is working on a pilot Names Authroity Service. Secondly, the RLG Networking Names report.  I might be misunderstanding the nature of these bits of work but both seem to me to be advocating rather centralised, registry-like, approaches. For example, both talk about centrally assigning identifiers to people.

As an aside, I'm constantly amazed by how many digital library initiatives end up looking and feeling like registries. It seems to be the DL way... metadata registries, metadata schema registries, service registries, collection registries. You name it and someone in a digital library will have built a registry for it.

May favoured view is that the Web is the registry. Assign identifiers at source, then aggregate appropriately if you need to work across stuff (as Les suggests above).  The <sameAs> service is a nice example of this:

The Web of Data has many equivalent URIs. This service helps you to find co-references between different data sets.

As Hugh Glaser says in a discussion about the service:

Our strong view is that the solution to the problem of having all these URIs is not to generate another one. And I would say that with services of this type around, there is no reason.

In thinking about some of the issues here I had cause to go back and re-read a really interesting interview by Martin Fenner with Geoffrey Bilder of CrossRef (from earlier this year).  Regular readers will know that I'm not the world's biggest fan of the DOI (on which CrossRef is based), partly for technical reasons and partly on governence grounds, but let's set that aside for the moment.  In describing CrossRef's "Contributor ID" project, Geoff makes the point that:

... “distributed” begets “centralized”. For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.). This gets us back to square one and makes me think the real issue is - how do you make the centralized system that eventually emerges accountable?

I think this is a fair point but I also think there is a very significant architectural difference between a centralised service that aggregates identifiers and other information from a distributed base of services, in order to provide some useful centralised function for example, vs. a centralised service that assigns identifiers which it then pushes out into the wider landscape. It seems to me that only the former makes sense in the context of the Web.

October 03, 2007

Names

I spent last Thursday in London at a meeting of the (rather grandly named!) Expert Panel of the JISC-funded Names Project, the principal partners of which are MIMAS at the University of Manchester and The British Library. The project's aims are

to scope the requirements of UK institutional and subject repositories for a service that will reliably and uniquely identify names of individuals and institutions.

and

[...] to develop a prototype service which will test the various processes involved. This will include determining the data format, setting up an appropriate database, mapping data from different sources, populating the database with records and testing the use of the data.

The project is managed on behalf of MIMAS by Amanda Hill (from her new homestead in rural Ontario!), and Amanda led the meeting on Thursday. She concentrated on presenting three documents, which I think should all be available from the project Web site shortly: a project plan, a review of the "name authority files" landscape, and a small set of "usage scenarios" that the project might seek to support. There are certainly some issues to consider anyway.

The "landscape" document, by Amanda and Alan Danskin & Richard Moore of the BL, summarises some of the standards and specifications used for the representation of descriptions of persons and organisations, and some of the existing systems and services that hold and make available such data. The document concentrates exclusively on (what I think of) as fairly "formal" sources of data (like the Library of Congress/NACO Names Authority File and OCLC's WorldCat Identities), and excludes sources such as Wikipedia - though it may well be the case that Wikipedia's coverage of many of the persons and institutions of interest in this context is limited.

One of the issues that came up quite early in the meeting was that of the constraints imposed by the legal context within which the project is operating. Given the project's focus on supporting - not exclusively, but primarily - systems that deal largely with works created by living individuals, the storage and use of information about these persons is typically covered by legislation - in the UK, by the Data Protection Act and related legislation. Two of the core principles of the DPA are that:

  • Data may only be used for the specific purposes for which it was collected.
  • (Subject to some qualifications) data must not be disclosed to other parties without the consent of the individual whom it is about

Further, there are limitations on the jurisdictions within which the information can be transferred.

There are probably implications here for the Names project, both in terms of obtaining permission to use existing data sources, and in terms of addressing the DPA requirements for the data Names itself holds. Names is funded under the JISC Shared Infrastructure Services programme. Typically these services aren't primarily in the business of providing "user-facing" functions; rather they aggregate and make available data which other applications, developed by other agencies, then access and use to deliver such functions. Given this sort of context, I imagine it may be quite difficult for the Names project itself to specify fully the purposes for which data is being collected: in theory, those third-party services might perform functions on the data that the Names project itself can not predict.

As part of my pre-meeting truffling, I had a look at the (relatively) recent draft of the Functional Requirements for Authority Data (FRAD) specification. FRAD is another product of IFLA, and it is a sibling document to, or extension of, the (probably better known) Functional Requirements for Bibliographic Records (FRBR) specification. More specifically it's the product of an IFLA group called the "Working Group on Functional Requirements and Numbering of Authority Records (FRANAR)", with the rather confusing (to the outsider) consequence that the acronym FRANAR is sometimes used to refer to this area of work too, but I think the intent is that the model is referred to as FRAD.

Like FRBR, FRAD describes an entity-relational model, with the focus of FRAD on the entities related to "authority data" rather than to the "bibliographic record" itself. IIRC, I had looked at an earlier draft of FRAD quite some time ago, but the current version seems to have come on a long way from that version, and - from a fairly cursory reading on my part - it looks as if it may be a very useful document, both for those (like the Names project) seeking to develop applications in this area, but also for the non-librarians (like me) who want to have a better understanding of librarians' conceptualisations of the world, e.g. the relationships between persons (or personas), names, and access points.

February 21, 2007

WorldCat Institution Registry and Identifiers

(Or: PeteJ bangs on about http URIs again!)

In a couple of recent posts, Lorcan highlights the availability of a new WorldCat Institution Registry service provided by OCLC. The registry stores, manages and makes available descriptions of institutions, mainly libraries and library consortia, and I agree with Lorcan that this is potentially a very valuable service.

Lorcan's second post cites a post by Ed Summers in which Ed notes:

If you drill into a particular institution you’ll see a pleasantly cool uri:

http://worldcat.org/registry/Institutions/89073

…which would serve nicely as an identifier for the Browne Popular Culture Library. The institution pages are HTML instead of XML–however there is a link to an XML representation:

http://worldcat.org/webservices/registry/content/Institutions/89073

This URL isn’t bad but it would be rather nice if the former could return XML if the Accept: header had text/xml slotted before text/html.

As I was rushing out the door last night, I posted a comment to Ed's post expressing my whole-hearted agreement with both of these points, but also expressing a degree of concern at another aspect of the use of identifiers I'd observed in the data exposed by the registry. As far as I can see, my comment doesn't seem to have made it through to Ed's weblog (yet, anyway - maybe Ed moderates his comments and I'm just being impatient, or maybe - more likely - in my haste to get to the pub I clicked the wrong button on his submission form!) so I'm posting a note here.

Like Ed, my immediate reaction was that the first URI above would make a good identifier for the institution. An http URI can be used to identify a resource of any type, so no problem with the choice of URI scheme. The form of the URI itself seems to take into account all of Berners-Lee's suggestions for "coolness". Is it persistent? Well, I guess that depends on OCLC's commitment to maintaining  their ownership of the worldcat.org domain and to managing the URIs they assign within that URI-space e.g. not to assign the URI above to a different institution (or indeed to a document or a DDC term or a research project or something else entirely) at some point in the future. I haven't checked whether OCLC publish any policy statement about their commitment to these URIs. If they did, that would contribute to my faith in their having some degree of persistence, but for now I'll take it on trust that OCLC will manage this URI-space in a controlled manner, and these URIs have some reasonable degree of persistence.

Furthermore, the URI is reasonably concise and human-readable/citable: I could scribble one of those URIs down in an email or read one over the phone.

And the URI also has the helpful characteristic that, if I am given (by email, on a piece of paper, in a telephone conversation) only the URI, then with no additional information, I can "look up" that URI using a tool which supports the HTTP protocol (e.g. my desktop Web browser) and get back some information about the resource identified by the URI provided by the owner of the URI (OCLC), in the form of an HTML document. Or, in the terms of the Web Architecture, I can dereference the URI and obtain a representation of the state of the resource identified by the URI. This functionality is available to me because there is a message protocol which makes use of the http URI scheme, and that protocol is widely supported in software tools.

And that representation (well, OK, a second HTML page linked from it) also contains the URIs of other related institutions, so that I can follow hyperlinks from this institution to another - and obtain representations of those other institutions too. For example (choosing a different institution from Ed's examples to better highlight the hyperlinking), Bowling Green State University is related to a number of "branches" and I can obtain a representation of the list of branches which includes the URIs of those branch institutions.

And as Ed notes, it would be useful if OCLC added support for HTTP content negotiation so that if my software application obtained the URI, it could obtain a representation in a format which made explicit the semantics that the HTML document conveys to a human reader e.g. an instance of the WorldCat XML format, as is available by dereferencing the second URI above.

So far, so good. Better than good. Great stuff, in fact.

However, when I looked a bit harder at the WorldCat XML representation, my heart sank slightly. Here's an (edited, reformatted, emphasis added) version of one of those XML instances (minus XML Namespace declarations, XML schema location declarations etc):

<institution>
  <identifier>info:rfa/oclc/Institutions/2226</identifier>
  <nameLocation>
    <institutionName>Bowling Green State University</institutionName>
    <!-- stuff snipped -->
  </nameLocation>
  <!-- stuff snipped -->
  <branches>
    <branch>info:rfa/oclc/Institutions/73034</branch>
    <branch>info:rfa/oclc/Institutions/88815</branch>
    <branch>info:rfa/oclc/Institutions/88840</branch>
    <branch>info:rfa/oclc/Institutions/89072</branch>
    <branch>info:rfa/oclc/Institutions/89073</branch>
  </branches>
  <!-- stuff snipped -->
</institution>

What struck me here was that, in the XML representation, a different set of URIs is exposed, a set of URIs using the info URI scheme. Certainly, these URIs share some of the (useful) characteristics that I listed above for the http URIs. They are pretty much as concise, human-readable/citable as the http URIs. My faith in their persistence is subject to pretty much exactly the same considerations as above: will the owners of the "rfa" info namespace continue to own that namespace, and will they manage the assignment of URIs within that namespace so that a single URI is not assigned to two different resources over time? (Hmm, actually, I don't see an entry for "rfa" in the info URI Namespace Registry, so I guess that does raise some doubts in my mind about the answer to the first of those questions!)

Where the info URIs aren't so useful, of course, is that I can't - at least without some additional information - take one of those URIs that has been given to me by email, by telephone etc, and obtain a representation of the resource. While the info URI scheme shares some of the characteristics of the http URI scheme as an identification mechanism, there is no widely deployed mechanism for dereferencing an info URI. While there is no global method for dereferencing info URIs, the info URI scheme does provide for individual "namespace authorities" to specify dereferencing mechanisms for URIs on a "per namespace" basis, and to disclose those methods via the info URI Namespace Registry. Clearly that introduces additional complexity - and ultimately cost - for dereferencing, at least in comparison with the http URI scheme. (And as I note above, there is currently no entry in the registry for the "rfa" namespace, so I don't know whether there is any dereferencing mechanism available.)

So that raises the question of why the registry system should coin, assign and expose one set of identifiers for institutions in the XML representation (URIs using the info URI scheme, like info:rfa/oclc/Institutions/2226) and a second set of identifiers for the same institutions in the HTML representation (URIs using the http URI scheme, like http://worldcat.org/registry/Institutions/2226). To a human reader and to a software application those two URIs are different URIs, and there is no indication of whether or not they identify the same resource. Of course the URI owner could make available the information that they do in fact identify the same resource. Certainly that enables consumers of those two different URIs to establish that they are referring to the same resource, but it does so at the cost of additional complexity: every system that uses the URIs needs to process the two URIs as equivalents. The Web Architecture document counsels against such an approach for exactly these reasons:

Good practice: Avoiding URI aliases

A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

That's not to say that we don't have to deal with cases where we wish to assert that two URIs identify the same resource - of course we do and we have mechanisms to do so - but let's try to avoid making life unnecessarily difficult for ourselves!

Now, of course, it may well be that my question "why coin, assign and expose two identifiers for the same resource?" is based on a completely false assumption! It may be that the intent is that these two sets of identifiers identify not one single set of resources, but two different sets of resources e.g. the info URIs identify the institutions, and the http URIs identify documents which describe those institutions. And indeed the fact that when I dereference the http URI http://worldcat.org/registry/Institutions/2226 the server returns a status code of 200 suggests that this is indeed the case (because according to the W3C Technical Architecture Group's resolution to the httpRange-14 issue, that response code indicates that the resource is an "information resource"). That is a very good argument for coining and exposing two sets of identifiers, though I think it also brings with it a requirement on the system which exposes those identifiers to make clear to users - both human users and software applications - exactly what resource is identified by each URI.

However, while it is a very good argument for using two different sets of URIs, I'm still not convinced that it is a good argument for choosing to use URIs based on the info URI scheme to identify the institution: as Ed pointed out, http URIs will do the job perfectly well - and they have the (hugely significant, I think) benefit that they can be dereferenced using widely available tools. If it is necessary to identify both the institution and the document then that could still be done using http URIs in both cases e.g.

Pattern 1:

  • Document: http://worldcat.org/registry/Institutions/2226
  • Institution: http://worldcat.org/registry/Institutions/2226#inst

Pattern 2:

  • Document: http://worldcat.org/registry/Institutions/2226
  • Institution: http://worldcat.org/registry/Institutions/2226/inst

(In this case, when the second URI is dereferenced the server could/should observe the rules specified by the resolution to the httpRange-14 issue, and use a response code of 303 to redirect to the document describing the institution. Incidentally, there's an excellent, clear, practical discussion of this and related issues in a paper by Leo Sauermann and Richard Cyganiak mentioned in a recent message to the Semantic Web Education and Outreach Interest Group mailing list.)

Pattern 3: (assuming a suitable top-level PURL domain is registered)

  • Document: http://worldcat.org/registry/Institutions/2226
  • Institution: http://purl.org/worldcat/registry/Institutions/2226

(The PURL server uses a response code of 302, rather than 303, so this approach wouldn't meet all the requirements of the TAG resolution mentioned above.)

Any of these patterns allow a clear distinction to be made between the identifier of the institution and the identifier of the document, while continuing to allow for the dereferencing of both sets of identifiers, using a protocol that is supported by widely-deployed tools (and introducing the flexibility suggested by Ed by offering multiple representations via HTTP content negotiation).

Hmm. I think I've taken a long time to say what Andy said more concisely back here!

About

Powered by TypePad
Add to Technorati Favorites