« January 2007 | Main | March 2007 »

February 21, 2007

WorldCat Institution Registry and Identifiers

(Or: PeteJ bangs on about http URIs again!)

In a couple of recent posts, Lorcan highlights the availability of a new WorldCat Institution Registry service provided by OCLC. The registry stores, manages and makes available descriptions of institutions, mainly libraries and library consortia, and I agree with Lorcan that this is potentially a very valuable service.

Lorcan's second post cites a post by Ed Summers in which Ed notes:

If you drill into a particular institution you’ll see a pleasantly cool uri:


…which would serve nicely as an identifier for the Browne Popular Culture Library. The institution pages are HTML instead of XML–however there is a link to an XML representation:


This URL isn’t bad but it would be rather nice if the former could return XML if the Accept: header had text/xml slotted before text/html.

As I was rushing out the door last night, I posted a comment to Ed's post expressing my whole-hearted agreement with both of these points, but also expressing a degree of concern at another aspect of the use of identifiers I'd observed in the data exposed by the registry. As far as I can see, my comment doesn't seem to have made it through to Ed's weblog (yet, anyway - maybe Ed moderates his comments and I'm just being impatient, or maybe - more likely - in my haste to get to the pub I clicked the wrong button on his submission form!) so I'm posting a note here.

Like Ed, my immediate reaction was that the first URI above would make a good identifier for the institution. An http URI can be used to identify a resource of any type, so no problem with the choice of URI scheme. The form of the URI itself seems to take into account all of Berners-Lee's suggestions for "coolness". Is it persistent? Well, I guess that depends on OCLC's commitment to maintaining  their ownership of the worldcat.org domain and to managing the URIs they assign within that URI-space e.g. not to assign the URI above to a different institution (or indeed to a document or a DDC term or a research project or something else entirely) at some point in the future. I haven't checked whether OCLC publish any policy statement about their commitment to these URIs. If they did, that would contribute to my faith in their having some degree of persistence, but for now I'll take it on trust that OCLC will manage this URI-space in a controlled manner, and these URIs have some reasonable degree of persistence.

Furthermore, the URI is reasonably concise and human-readable/citable: I could scribble one of those URIs down in an email or read one over the phone.

And the URI also has the helpful characteristic that, if I am given (by email, on a piece of paper, in a telephone conversation) only the URI, then with no additional information, I can "look up" that URI using a tool which supports the HTTP protocol (e.g. my desktop Web browser) and get back some information about the resource identified by the URI provided by the owner of the URI (OCLC), in the form of an HTML document. Or, in the terms of the Web Architecture, I can dereference the URI and obtain a representation of the state of the resource identified by the URI. This functionality is available to me because there is a message protocol which makes use of the http URI scheme, and that protocol is widely supported in software tools.

And that representation (well, OK, a second HTML page linked from it) also contains the URIs of other related institutions, so that I can follow hyperlinks from this institution to another - and obtain representations of those other institutions too. For example (choosing a different institution from Ed's examples to better highlight the hyperlinking), Bowling Green State University is related to a number of "branches" and I can obtain a representation of the list of branches which includes the URIs of those branch institutions.

And as Ed notes, it would be useful if OCLC added support for HTTP content negotiation so that if my software application obtained the URI, it could obtain a representation in a format which made explicit the semantics that the HTML document conveys to a human reader e.g. an instance of the WorldCat XML format, as is available by dereferencing the second URI above.

So far, so good. Better than good. Great stuff, in fact.

However, when I looked a bit harder at the WorldCat XML representation, my heart sank slightly. Here's an (edited, reformatted, emphasis added) version of one of those XML instances (minus XML Namespace declarations, XML schema location declarations etc):

    <institutionName>Bowling Green State University</institutionName>
    <!-- stuff snipped -->
  <!-- stuff snipped -->
  <!-- stuff snipped -->

What struck me here was that, in the XML representation, a different set of URIs is exposed, a set of URIs using the info URI scheme. Certainly, these URIs share some of the (useful) characteristics that I listed above for the http URIs. They are pretty much as concise, human-readable/citable as the http URIs. My faith in their persistence is subject to pretty much exactly the same considerations as above: will the owners of the "rfa" info namespace continue to own that namespace, and will they manage the assignment of URIs within that namespace so that a single URI is not assigned to two different resources over time? (Hmm, actually, I don't see an entry for "rfa" in the info URI Namespace Registry, so I guess that does raise some doubts in my mind about the answer to the first of those questions!)

Where the info URIs aren't so useful, of course, is that I can't - at least without some additional information - take one of those URIs that has been given to me by email, by telephone etc, and obtain a representation of the resource. While the info URI scheme shares some of the characteristics of the http URI scheme as an identification mechanism, there is no widely deployed mechanism for dereferencing an info URI. While there is no global method for dereferencing info URIs, the info URI scheme does provide for individual "namespace authorities" to specify dereferencing mechanisms for URIs on a "per namespace" basis, and to disclose those methods via the info URI Namespace Registry. Clearly that introduces additional complexity - and ultimately cost - for dereferencing, at least in comparison with the http URI scheme. (And as I note above, there is currently no entry in the registry for the "rfa" namespace, so I don't know whether there is any dereferencing mechanism available.)

So that raises the question of why the registry system should coin, assign and expose one set of identifiers for institutions in the XML representation (URIs using the info URI scheme, like info:rfa/oclc/Institutions/2226) and a second set of identifiers for the same institutions in the HTML representation (URIs using the http URI scheme, like http://worldcat.org/registry/Institutions/2226). To a human reader and to a software application those two URIs are different URIs, and there is no indication of whether or not they identify the same resource. Of course the URI owner could make available the information that they do in fact identify the same resource. Certainly that enables consumers of those two different URIs to establish that they are referring to the same resource, but it does so at the cost of additional complexity: every system that uses the URIs needs to process the two URIs as equivalents. The Web Architecture document counsels against such an approach for exactly these reasons:

Good practice: Avoiding URI aliases

A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

That's not to say that we don't have to deal with cases where we wish to assert that two URIs identify the same resource - of course we do and we have mechanisms to do so - but let's try to avoid making life unnecessarily difficult for ourselves!

Now, of course, it may well be that my question "why coin, assign and expose two identifiers for the same resource?" is based on a completely false assumption! It may be that the intent is that these two sets of identifiers identify not one single set of resources, but two different sets of resources e.g. the info URIs identify the institutions, and the http URIs identify documents which describe those institutions. And indeed the fact that when I dereference the http URI http://worldcat.org/registry/Institutions/2226 the server returns a status code of 200 suggests that this is indeed the case (because according to the W3C Technical Architecture Group's resolution to the httpRange-14 issue, that response code indicates that the resource is an "information resource"). That is a very good argument for coining and exposing two sets of identifiers, though I think it also brings with it a requirement on the system which exposes those identifiers to make clear to users - both human users and software applications - exactly what resource is identified by each URI.

However, while it is a very good argument for using two different sets of URIs, I'm still not convinced that it is a good argument for choosing to use URIs based on the info URI scheme to identify the institution: as Ed pointed out, http URIs will do the job perfectly well - and they have the (hugely significant, I think) benefit that they can be dereferenced using widely available tools. If it is necessary to identify both the institution and the document then that could still be done using http URIs in both cases e.g.

Pattern 1:

  • Document: http://worldcat.org/registry/Institutions/2226
  • Institution: http://worldcat.org/registry/Institutions/2226#inst

Pattern 2:

  • Document: http://worldcat.org/registry/Institutions/2226
  • Institution: http://worldcat.org/registry/Institutions/2226/inst

(In this case, when the second URI is dereferenced the server could/should observe the rules specified by the resolution to the httpRange-14 issue, and use a response code of 303 to redirect to the document describing the institution. Incidentally, there's an excellent, clear, practical discussion of this and related issues in a paper by Leo Sauermann and Richard Cyganiak mentioned in a recent message to the Semantic Web Education and Outreach Interest Group mailing list.)

Pattern 3: (assuming a suitable top-level PURL domain is registered)

  • Document: http://worldcat.org/registry/Institutions/2226
  • Institution: http://purl.org/worldcat/registry/Institutions/2226

(The PURL server uses a response code of 302, rather than 303, so this approach wouldn't meet all the requirements of the TAG resolution mentioned above.)

Any of these patterns allow a clear distinction to be made between the identifier of the institution and the identifier of the document, while continuing to allow for the dereferencing of both sets of identifiers, using a protocol that is supported by widely-deployed tools (and introducing the flexibility suggested by Ed by offering multiple representations via HTTP content negotiation).

Hmm. I think I've taken a long time to say what Andy said more concisely back here!

February 20, 2007

TypeKey and the UK Access Management Federation

There's an interesting announcement from the UK Access Management Federation for Education and Research about their support for TypeKey identities.  A slightly odd choice of technology it seems to me, as opposed to supporting OpenID for example, but interesting nonetheless and definitely seems to be a step in the right direction.  Presumably, adding support for OpenID at this point wouldn't be very difficult?

February 17, 2007

Ed Barker joins the Eduserv Foundation

A quick note, just to say welcome to Ed Barker who joined us at the beginning of this week as Researcher and Grants Coordinator for the Eduserv Foundation.  Amongst other things, Ed will be looking after our programme of grants to the community.  Ed joins us from Intrallect in Edinburgh and will probably already be known to many of you in the UK education community.

February 15, 2007

More ruminations on compoundness and complexity (and metadata)

This is a somewhat belated post that I started a few days ago, but put to one side while we concentrated on reading through the pile of Eduserv Research Grant proposals.

A couple of weeks ago I attended the workshop on describing complex objects that Andy referred to, and at which he gave a presentation (I was in the happy position of being able to sit in the back row and nod enthusiastically).

The programme featured presentations on three fairly widely used "packaging formats": MPEG-21 DIDL (by Frances Knudson of Los Alamos National Laboratory), METS (by Markus Enders of Goettingen State and University Library) and IMS Content Packaging (by Sheila Macneill of CETIS and University of Strathclyde).

The programme also included a presentation by Wilbert Kraan of CETIS on an IEEE LTSC project called RAMLET (Resource Aggregation Model for Learning, Education and Training), which has developed an ontology that can be used as the basis for mapping between instances of different "packaging formats".

Andy's presentation was the last of the five, and, leaving aside the DC-specific aspects, I thought probably his key point was that metadata is at the heart of what we call "content packaging" - metadata that describes certain specific characteristics of resources in order to allow applications to perform certain specific functions, certainly, but ultimately a key part of a "package" is some set of "statements" about some resources - and more specifically about relationships between resources.

So, when I create the content of a <structMap> element in a METS instance or of an <organisation> element in an IMS CP instance, I'm describing relationships between resources. (I'm consciously not commenting further on DIDL here as I'm much less familiar with the specification and after Frances' presentation I feel I need to go away and read up a bit more before making (probably quite misguided) comments about it!)  To take a very simple example, if I create a <structMap> like (rough outline for illustration purposes only - I don't promise that this is a complete/valid METS XML fragment!):

   <div label="My paper">
      <div label="My section 1">
          <fptr fileid="file001" />
      <div label="My section 2">
          <fptr fileid="file002" />
      <div label="My section 3">
          <fptr fileid="file003" />

or an IMS CP <organization> like (caveats as above!):

    <item identifier="item1">
       <title>My paper</title>
       <item identifier="item2" identifierref="file0001">
         <title>My section 1</title>
       <item identifier="item3" identifierref="file0002">
         <title>My section 2</title>
       <item identifier="item4" identifierref="file0003">
         <title>My section 3</title>

then in each case I'm "saying" that one resource (titled "My paper") is composed of a sequence of component resources titled "My section 1", "My section 2" and "My section 3". Elsewhere in the METS or IMS CP document I provide URIs of those resources. OK, it's a bit more complicated than that, but for the purposes of this argument, I'll stick to a simple case. And I could "say" exactly the same thing by constructing a Dublin Core metadata description set or an RDF graph, e.g. using the Turtle syntax for RDF:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix ex: <http://example.org/terms/> .

_:resource1 dc:title "My paper" ;
            ex:hasOrganisation [ a rdf:Seq ;
                                 rdf:_1 <http://example.org/docs/1> ;
                                 rdf:_2 <http://example.org/docs/2> ;
                                 rdf:_3 <http://example.org/docs/3> ] .

<http://example.org/docs/1> dc:title "My section 1" .
<http://example.org/docs/2> dc:title "My section 2" .
<http://example.org/docs/3> dc:title "My section 3" .

And I could probably do something similar using various other metadata specifications that allow me to describe relationships between things. I'm conscious that I'm over-simplifying somewhat, and METS and IMS CP provide other features that go beyond describing relationships, particularly in terms of describing how to embed representations of resources within an instance, but nevertheless I think Andy's point is a good one: a description of relationships between resources is a form of metadata. (Footnote: Sheila doesn't sound completely convinced!)

The other key point emerging from Andy's presentation, which he also highlighted in his earlier post, is that resources are of different types and relationships between resources are of different types, and he proposed a distinction between "compond objects" and "complex objects" on the basis of the different categories of relationship being described.

It seems to me that METS and IMS CP are fundamentally about describing what I think of as structural relationships - Andy's "compound object" case - : when I construct a METS structMap or an IMS CP organization, I'm "saying" resource W has components resources X, Y and Z. Further, I think METS and IMS CP support a specific subset of structural relationships i.e. they deal essentially with (ordered?) tree structures, where a "parent" resource has as components a sequence of "child" resources.

And (rather more tentatively!) I'd venture that the types of resources with which METS and IMS CP are concerned are, more or less, what the Web Architecture categorises (albeit somewhat vaguely!) as "information resources" i.e.

We do not limit the scope of what might be a resource. The term "resource" is used in a general sense for whatever might be identified by a URI. It is conventional on the hypertext Web to describe Web pages, images, product catalogs, etc. as "resources". The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as "information resources."

This document is an example of an information resource. It consists of words and punctuation symbols and graphics and other artifacts that can be encoded, with varying degrees of fidelity, into a sequence of bits. There is nothing about the essential information content of this document that cannot in principle be transfered in a message. In the case of this document, the message payload is the representation of this document.

But Andy went on to consider the example of the ePrints DC Application Profile, which is concerned with the description of resources of several different types, at least some of which - agents, for example - are not "information resources", and the description of various types of relationship which are not structural, e.g. relationships like is-created-by, is-published-by, and so on. While it is quite possible to describe relationships of these types between resources using statements in a Dublin Core metadata description set or using RDF, it seems to me I can not describe such relationships using a METS structMap or an IMS CP organization.

The point I'm trying to make here is not that I think Dublin Core is "better" than METS or IMS CP, but rather that, in order to make decisions about which specifications we use in this area, it's important to understand what each of the "packaging formats" allows us to "say" about "things in the world". From this viewpoint, the syntactic structure of, say, a METS XML instance is of less interest than what information such a document allows us to convey about resources and the relationships between them, i.e. what models underpin those formats - not models of the packaging instance itself (which I think is what is described by e.g. the IMS Content Packaging Information Model) but of the resources described or referred to within that instance. 

Such considerations will be important in the context of the OAI ORE initiative: for example, if an existing "packaging format" is used to serialise the ORE model, then it becomes critical that we understand fully any model inherent in that format - any built-in assumptions about the nature of the resources referenced or described, and the nature of any relationships between resources that are expressed within the format - , and that we ensure that any such serialisation accurately reflects the ORE model.

February 14, 2007

Donations to Creative Commons and Wikimedia

One of the great things about working at the Eduserv Foundation is our ability to give money to projects and activities that we feel are of benefit to the education community.  I am therefore very pleased to announce that we have given $10,000(US) each to Creative Commons and the Wikimedia Foundation (press release, PDF).

Creative Commons has fundamentally changed, and continues to change, our attitude to the way content is created and shared. By 'our' I am primarily thinking about the UK education community, though clearly the impact of CC is much wider than that - part of CC's attraction is that the underlying principles are understandable and applicable globally. CC has liberated us from thinking first and foremost about protecting and restricting content and has given us the ability to focus on sharing, which is fundamental to both learning and research. Sure, we have a long way to go in fully realising the benefits of CC, that's why it is important for organisations like the Eduserv Foundation to continue to support CC, but it seems to me that in a very real sense CC has changed the landscape in which we operate. The basis of the community's discussion about content is fundamentally different now because people come to the table with CC as a viable option.

In a similar way, our donation to the Wikimedia Foundation recognises the growing importance of the suite of activities they undertake, notably Wikipedia, in the context of learning and teaching. This is not just because Wikipedia has become such a valuable resource to teachers and learners in its own right, but because it has demonstrated the real potential of the Web; the potential for building very significant and valuable encyclopedic resources collaboratively, using a highly distributed knowledge base, in ways that were unimaginable to most of us even 3 or 4 years ago. I fully expect Wikipedia and their other offerings to continue to grow in importance within our community over the coming years.

February 08, 2007

Microsoft and OpenID

Scott's blog entry says it all really... this feels like a very significant development.

February 06, 2007

Computer Games: Learning, Meaning and Method

Diane Carr kindly invited me to her "Computer Games: Learning, Meaning and Method" event at the London Knowledge Lab at the end of January.  The event provided an opportunity for Diane to share some of the outcomes of her work on the Digital Technology, Learning and 'Game Formats' - Computer games, motivation and gender in learning contexts project, an activity that we have funded since January 2004.  This included feedback on Diane's experiment to get several LKL staff to use World of Warcraft, each approaching it from a different perspective, recording their impressions and experiences as they went.

Overall the day was quite broad, with some very interesting presentations on a variety of topics.  If I'm completely honest, I suppose that some of the material went a little over my head(!), being someone who is not well versed in the theory of games and gaming research.  But interesting nontheless.

Some of the Powerpoint slides and other materials from the day are now available on the LKL Web site - I hope others follow. 

As an aside, I note that the JISC-funded report Learning in Immersive Worlds: a review of game based learning by Sara de Freitas, which:

scopes out the current use of games for learning in UK HE and post-16 education and has been produced to inform practitioners who are considering using games and simulations in their practice

is now available.  Should be worth a read.



eFoundations is powered by TypePad