July 21, 2010

Getting techie... what questions should we be asking of publishers?

The Licence Negotiation team here are thinking about the kinds of technical questions they should be asking publishers and other content providers as part of their negotiations with them. The aim isn't to embed the answers to those questions in contractual clauses - rather, it is to build up a useful knowledge base of surrounding information that may be useful to institutions and others who are thinking about taking up a particular agreement.

My 'starter for 10' set of questions goes like this:

  • Do you make any commitment to the persistence of the URLs for your published content? If so, please give details. Do you assign DOIs to your published content? Are you members of CrossRef?
  • Do you support a search API? If so, what standard(s) do you support?
  • Do you support a metadata harvesting API? If so, what standard(s) do you support?
  • Do you expose RSS and/or Atom feeds for your content? If so, please describe what feeds you offer?
  • Do you expose any form of Linked Data about your published content? If so, please give details.
  • Do you generate OpenURLs as part of your web interface? Do you have a documented means of linking to your content based on bibliographic metadata fields? If so, please give details.
  • Do you support SAML (Service Provider) as a means of controlling access to your content? If so, which version? Are you a member of the UK Access Management Federation? If you also support other methods of access control, please give details.
  • Do you grant permission for the preservation of your content using LOCKSS, CLOCKSS and/or PORTICO? If so, please give details.
  • Do you have a statement about your support for the Web Accessibility Initiative (WAI)? If so, please give details?

Does this look like a reasonable and sensible set of questions for us to be asking of publishers? What have I missed? Something about open access perhaps?

March 23, 2010

Federating purl.org ?

I suggested a while back that PURLs have become quite important, at least for some aspects of the Web (particularly Linked Data as it happens), and that the current service at purl.org may therefore represent something of a single point of failure.

I therefore note with some interest that Zepheira, the company developing the PURL software, have recently announced a PURL Federation Architecture:

A PURL Federation is proposed which will consist of multiple independently-operated PURL servers, each of which have their own DNS hostnames, name their PURLs using their own authority (different from the hostname) and mirror other PURLs in the federation. The authorities will be "outsourced" to a dynamic DNS service that will resolve to proxies for all authorities of the PURLs in the federation. The attached image illustrates and summarizes the proposed architecture.

Caching proxies are inserted between the client and federation members. The dynamic DNS service responds to any request with an IP address of a proxy. The proxy attempts to contact the primary PURL member via its alternative DNS name to fulfill the request and caches the response for future requests. In the case where the primary PURL member is not responsive, the proxy attempts to contact another host in the federation until it succeeds. Thus, most traffic for a given PURL authority continues to flow to the primary PURL member for that authority and not other members of the federation.

I don't know what is planned in this space, and I may not have read the architecture closely enough, but it seems to me that there is now a significant opportunity for OCLC to work with a small number of national libraries (the British Library, The Library of Congress and the National Library of Australia spring to mind as a usefully geographically-dispersed set) to federate the current service at purl.org ?

February 03, 2010

More famous than Simon Cowell

I wrote a blog post on my other, Blipfoto, blog this morning, More famous than Simon Cowell, looking at some of the issues around persistent identifiers from the perspective of a non-technical audience. (You'll have to read the post to understand the title).

I used the identifier painted on the side of a railway bridge just outside Bath as my starting point.

It's certainly not an earth-shattering post, but it was quite interesting (for me) to approach things from a slightly different perspective:

What makes the bridge identifier persistent? It's essentially a social construct. It's not a technical thing (primarily). It's not the paint the number is written in, or the bricks of the bridge itself, or the computer system at head office that maps the number to a map reference. These things help... but it's mainly people that make it persistent.

I wrote the piece because the JISC have organised a meeting, taking place in London today, to consider their future requirements around persistent identifiers. For various reasons I was not able to attend - a situation that I'm pretty ambivalent about to be honest - I've sat thru a lot of identifier meetings in the past :-).

Regular readers will know that I've blown hot and cold (well, mainly cold!) about the DOI - an identifier that I'm sure will feature heavily in today's meeting. Just to be clear... I am not against what the DOI is trying to achieve, nor am I in any way negative about the kinds of services, particularly CrossRef, that have been able to grow up around it. Indeed, while I was at UKOLN I committed us to joining CrossRef and thus assigning DOIs to all UKOLN publications. (I have no idea if they are still members).  I also recognise that the DOI is not going to go away any time soon.

I am very critical of some of the technical decisions that the DOI people have made - particularly their decision to encourage multiple ways of encoding the DOI as a URI and the fact that the primary way (the 'doi' URI scheme) did not use an 'https' URI. Whilst persistence is largely a social issue rather than a technological one, I do think that badly used technology can get in the way of both persistence and utility. I also firmly believe in the statement that I have made several times previously... that "the only good long term identifier is a good short term identifier".  The DOI, in both its 'doi' URI and plain-old string of characters forms, is not a good short term identifier.

My advice to the JISC? Start from the principles of Linked Data, which very clearly state that 'https' URIs should be used. Doing so sidesteps many of the cyclic discussions that otherwise occur around the benefits of URNs and other URI schemes and allows people to focus on the question of, "how do we make https URIs work as well and as persistently as possible?" rather than always starting from, "https URIs are broken, what should we build instead?".

December 08, 2009

UK government’s public data principles

The UK government has put down some pretty firm markers for open data in it's recent document, Putting the Frontline First: smarter government. The section entitled Radically opening up data and promoting transparency sets out the agenda as follows:

  1. Public data will be published in reusable, machine-readable form
  2. Public data will be available and easy to find through a single easy to use online access point (https://www.data.gov.uk/)
  3. Public data will be published using open standards and following the recommendations of the World Wide Web Consortium
  4. Any 'raw' dataset will be represented in linked data form
  5. More public data will be released under an open licence which enables free reuse, including commercial reuse
  6. Data underlying the Government's own websites will be published in reusable form for others to use
  7. Personal, classified, commercially sensitive and third-party data will continue to be protected.

(Bullet point numbers added by me.)

I'm assuming that "linked data" in point 4 actually means "Linked Data", given reference to W3C recommendations in point 3.

There's also a slight tension between points 4 and 5, if only because the use of the phrase, "more public data will be released under an open licence", in point 5 implies that some of the linked data made available as a result of point 4 will be released under a closed licence.  One can argue about whether that breaks the 'rules' of Linked Data but it seems to me that it certainly runs counter to the spirit of both Linked Data and what the government says it is trying to do here?

That's a pretty minor point though and, overall, this is a welcome set of principles.

Linked Data, of course, implies URIs and good practice suggests Cool URIs as the basic underlying principle of everything that will be built here.  This applies to all government content on the Web, not just to the data being exposed thru this particular initiative.  One of the most common forms of uncool URI to be found on the Web in government circles is the technology-specific .aspx suffix... hey, I work for an organisation that has historically provided the technology to mint a great deal of these (though I think we do a better job now).  It's worth noting, for example, that the two URIs that I use above to cite the Putting the Frontline First document both end in .aspx - ironic huh?

I'm not suggesting that cool URIs are easy, but there are some easy wins and getting the message across about not embedding technology into URIs is one of the easier ones... or so it seems to me anyway.

December 03, 2009

On being niche

I spoke briefly yesterday at a pre-IDCC workshop organised by REPRISE.  I'd been asked to talk about Open, social and linked information environments, which resulted in a re-hash of the talk I gave in Trento a while back.

My talk didn't go too well to be honest, partly because I was on last and we were over-running so I felt a little rushed but more because I'd cut the previous set of slides down from 119 to 6 (4 really!) - don't bother looking at the slides, they are just images - which meant that I struggled to deliver a very coherent message.  I looked at the most significant environmental changes that have occurred since we first started thinking about the JISC IE almost 10 years ago.  The resulting points were largely the same as those I have made previously (listen to the Trento presentation) but with a slightly preservation-related angle:

  • the rise of social networks and the read/write Web, and a growth in resident-like behaviour, means that 'digital identity' and the identification of people have become more obviously important and will remain an important component of provenance information for preservation purposes into the future;
  • Linked Data (and the URI-based resource-oriented approach that goes with it) is conspicuous by its absence in much of our current digital library thinking;
  • scholarly communication is increasingly diffusing across formal and informal services both inside and outside our institutional boundaries (think blogging, Twitter or Google Wave for example) and this has significant implications for preservation strategies.

That's what I thought I was arguing anyway!

I also touched on issues around the growth of the 'open access' agenda, though looking at it now I'm not sure why because that feels like a somewhat orthogonal issue.

Anyway... the middle bullet has to do with being mainstream vs. being niche.  (The previous speaker, who gave an interesting talk about MyExperiment and its use of Linked Data, made a similar point).  I'm not sure one can really describe Linked Data as being mainstream yet, but one of the things I like about the Web Architecture and REST in particular is that they describe architectural approaches that haven proven to be hugely successful, i.e. they describe the Web.  Linked data, it seems to me, builds on these in very helpful ways.  I said that digital library developments often prove to be too niche - that they don't have mainstream impact.  Another way of putting that is that digital library activities don't spend enough time looking at what is going on in the wider environment.  In other contexts, I've argued that "the only good long-term identifier, is a good short-term identifier" and I wonder if that principle can and should be applied more widely.  If you are doing things on a Web-scale, then the whole Web has an interest in solving any problems - be that around preservation or anything else.  If you invent a technical solution that only touches on scholarly communication (for example) who is going to care about it in 50 or 100 years - answer, not all that many people.

It worries me, for example, when I see an architectural diagram (as was shown yesterday) which has channels labelled 'OAI-PMH', XML' and 'the Web'!

After my talk, Chris Rusbridge asked me if we should just get rid of the JISC IE architecture diagram.  I responded that I am happy to do so (though I quipped that I'd like there to be an archival copy somewhere).  But on the train home I couldn't help but wonder if that misses the point.  The diagram is neither here nor there, it's the "service-oriented, we can build it all", mentality that it encapsulates that is the real problem.

Let's throw that out along with the diagram.

October 14, 2009

Open, social and linked - what do current Web trends tell us about the future of digital libraries?

About a month ago I travelled to Trento in Italy to speak at a Workshop on Advanced Technologies for Digital Libraries organised by the EU-funded CACOA project.

My talk was entitled "Open, social and linked - what do current Web trends tell us about the future of digital libraries?" and I've been holding off blogging about it or sharing my slides because I was hoping to create a slidecast of them. Well... I finally got round to it and here is the result:

Like any 'live' talk, there are bits where I don't get my point across quite as I would have liked but I've left things exactly as they came out when I recorded it. I particularly like my use of "these are all very bog standard... err... standards"! :-)

Towards the end, I refer to David White's 'visitors vs. residents' stuff, about which I note he has just published a video. Nice one.

Anyway... the talk captures a number of threads that I've been thinking and speaking about for the last while. I hope it is of interest.

October 05, 2009

QNames, URIs and datatypes

I posted a version of this on the dc-date Jiscmail list recently, but I thought it was worth a quick posting here too, as I think it's a slightly confusing area and the differences in naming systems is something which I think has tripped up some DC implementers in the past.

The Library of Congress has developed a (very useful looking) XML schema datatype for dates, called the Extended Date Time Format (EDTF). It appears to address several of the issues which were discussed some time ago by the DCMI Date working group. It looks like the primary area of interest for the LoC is the use of the data type in XML and XML Schema, and the documentation for the new datatype illustrates its usage in this context.

However, as far as I can see, it doesn't provide a URI for the datatype, which is a prerequisite for referencing it in RDF. I think sometimes we assume that providing what the XML Namespaces spec calls an expanded name is sufficient, and/or that doing that automatically also provides a URI, but I'm not sure that is the case.

In XML Schema, a datatype is typically referred to using an XML Qualified Name (QName), which is essentially a "shorthand" for an "expanded name". An expanded name has two parts: a (possibly null-valued) "namespace name" and a "local name".

Take the case of the built-in datatypes defined as part of the XML Schema specification. In XML/XML Schema, these datatypes are referenced using these two part "expanded names", typically represented in XML documents as QNames e.g.

<count xsi:type="xsd:int">42</count>

where the prefix xsd is bound to the XML namespace name "https://www.w3.org/2001/XMLSchema". So the two-part expanded name of the XML Schema integer datatype is

( https://www.w3.org/2001/XMLSchema , int )

Note: no trailing "#" on that namespace name.

The key point here is that the expanded name system is a different naming system from the URI system. True, the namespace name component of the expanded name is a URI (if it isn't null), but the expanded name as a whole is not. And furthermore, there is no generalised mapping from expanded name to URI.

To refer to a datatype in RDF, I need a URI, and for the built-in datatypes provided, the XML Schema Datayping spec tells me explicitly how to construct such a URI:

Each built-in datatype in this specification (both *primitive* and *derived*) can be uniquely addressed via a URI Reference constructed as follows:

  1. the base URI is the URI of the XML Schema namespace
  2. the fragment identifier is the name of the datatype

For example, to address the int datatype, the URI is:

* https://www.w3.org/2001/XMLSchema#int

The spec tells me that for the names of this set of datatypes, to generate a URI from the expanded name, I use the local part as the fragment id - but some other mechanism might have been applied.

And this is different from, say, the way RDF/XML maps the QNames of some XML elements to URIs, where the mechanism is simple concatenation of "namespace name" and "local name"

(And, as an aside, I think this makes for a bit of "gotcha" for XML folk coming to RDF syntaxes like Turtle where datatype URIs can be represented as prefixed names, because the "namespace URI" you use in Turtle, where the simple concatenation mechanism is used, needs to include the trailing "#" (https://www.w3.org/2001/XMLSchema#), whereas in XML/XML Schema people are accustomed to using the "hash-less" XML namespace name (https://www.w3.org/2001/XMLSchema)).

But my understanding is that that the mapping defined by the XML Schema datatyping document is specific to the datatypes defined by that document ("Each built-in datatype in this specification... can be uniquely addressed..." (my emphasis)), and the spec is silent on URIs for other user-defined XML Schema datatypes.

So it seems to me that if a community defines a datatype, and they want to make it available for use in both XML/XML Schema and in RDF, they need to provide both an expanded name and a URI for the datatype.

The LoC documentation for the datatype has plenty of XML examples so I can deduce what the expanded name is:

(info:lc/xmlns/edtf-v1 , edt)

But that doesn't tell me what URI I should use to refer to the datatype.

I could make a guess that I should follow the same mapping convention as that provided for the XML Schema built-in datatypes, and decide that the URI to use is


But given that there is no global expanded name-URI mapping, and the XML Schema specs provide a mapping only for the names of the built-in types, I think I'd be on shaky ground if I did that, and really the owners of the datatype need to tell me explicitly what URI to use - as the XML Schema spec does for those built-in types.

Douglas Campbell responded to my message pointing to a W3C TAG finding which provides this advice:

If it would be valuable to directly address specific terms in a namespace, namespace owners SHOULD provide identifiers for them.

Ray Denenberg has responded with a number of suggestions: I'm less concerned about the exact form of the URI than that a URI is explicitly provided.

P.S. I didn't comment on the list on the choice of URI scheme, but of course I agree with Andy's suggestion that value would be gained from the use of the https URI scheme.... :-)

July 29, 2009

Enhanced PURL server available

A while ago, I posted about plans to revamp the PURL server software to (amongst other things) introduce support for a range of HTTP response codes. This enables the use of PURLs for the identification of a wider range of resources than "Web documents" using the interaction patterns specified by current guidelines provided by the W3C in the TAG's httpsRange-14 resolution and the Cool URIs for the Semantic Web document.

Lorcan Dempsey posted on Twitter yesterday to indicate that OCLC have deployed the new software, developed by Zepheira, on the OCLC purl.org service. Although I've just had time for a quick poke around, and I need to read the documentation more carefully to understand all the new features, it looks like it does the job quite nicely.

This should mean that existing PURL owners who have used PURLs to identify things other than "Web documents" - like DCMI, who use PURLs like https://purl.org/dc/terms/title to identify their "metadata terms", "conceptual" resources - should be able to adjust the appropriate entries on the purl.org server so that interactions follow those guidelines. It also offers a new option for those who wish to set up such redirects but perhaps don't have suitable levels of access to configure their own HTTP server to perform those redirects.

July 20, 2009

On names

There's was a brief exchange of messages on the jisc-repositories mailing list a couple of weeks ago concerning the naming of authors in institutional repositories.  When I say naming, I really mean identifying because a name, as in a string of characters, doesn't guarantee any kind of uniqueness - even locally, let alone globally.

The thread started from a question about how to deal with the situation where one author writes under multiple names (is that a common scenario in academic writing?) but moved on to a more general discussion about how one might assign identifiers to people.

I quite liked Les Carr's suggestion:

Surely the appropriate way to go forward is for repositories to start by locally choosing a scheme for identifying individuals (I suggest coining a URI that is grounded in some aspect of the institution's processes). If we can export consistently referenced individuals, then global services can worry about "equivalence mechanisms" to collect together all the various forms of reference that.

This is the approach taken by the Resist Knowledgebase, which is the foundation for the (just started) dotAC JISC Rapid Innovation project.

(Note: I'm assuming that when Les wrote 'URI' he really meant 'https URI').

Two other pieces of current work seem relevant and were mentioned in the discussion. Firstly the JISC-funded Names project which is working on a pilot Names Authroity Service. Secondly, the RLG Networking Names report.  I might be misunderstanding the nature of these bits of work but both seem to me to be advocating rather centralised, registry-like, approaches. For example, both talk about centrally assigning identifiers to people.

As an aside, I'm constantly amazed by how many digital library initiatives end up looking and feeling like registries. It seems to be the DL way... metadata registries, metadata schema registries, service registries, collection registries. You name it and someone in a digital library will have built a registry for it.

May favoured view is that the Web is the registry. Assign identifiers at source, then aggregate appropriately if you need to work across stuff (as Les suggests above).  The <sameAs> service is a nice example of this:

The Web of Data has many equivalent URIs. This service helps you to find co-references between different data sets.

As Hugh Glaser says in a discussion about the service:

Our strong view is that the solution to the problem of having all these URIs is not to generate another one. And I would say that with services of this type around, there is no reason.

In thinking about some of the issues here I had cause to go back and re-read a really interesting interview by Martin Fenner with Geoffrey Bilder of CrossRef (from earlier this year).  Regular readers will know that I'm not the world's biggest fan of the DOI (on which CrossRef is based), partly for technical reasons and partly on governence grounds, but let's set that aside for the moment.  In describing CrossRef's "Contributor ID" project, Geoff makes the point that:

... “distributed” begets “centralized”. For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.). This gets us back to square one and makes me think the real issue is - how do you make the centralized system that eventually emerges accountable?

I think this is a fair point but I also think there is a very significant architectural difference between a centralised service that aggregates identifiers and other information from a distributed base of services, in order to provide some useful centralised function for example, vs. a centralised service that assigns identifiers which it then pushes out into the wider landscape. It seems to me that only the former makes sense in the context of the Web.

June 24, 2009

The lifecycle of a URI

Via a post to the W3C Linked Open Data mailing list recently, I came across a short document by David Booth on The URI Lifecycle in Semantic Web Architecture.

It's particularly interesting, I think, because it highlights that different agents have varying relationships to, or play various roles with respect to, a URI, and those different roles bring with them varying responsibilities for maintaining the stability of the URI as an identifier. And it is the collective action of these different parties that serve to preserve that stability (or not).

The foundations of the principles articulated here are, of course, those presented in the W3C's Architecture of the World Wide Web. But David's document amplifies these base guidelines by emphasising both the social and the temporal dimensions of URI "minting", use (both by authors/writers of data and consumers/readers of data), and, potentially, obsoleting.

As the introduction emphasises, the lifecycle of a URI is quite distinct from that of the resource identified by that URI:

a URI that denotes the Greek philosopher Plato may be minted long after Plato has died. Similarly, one could mint a URI to denote one's first great-great-grandson even though such a child has not been conceived yet.

For David, a key part of the "minting" stage is the publication of what he calls a URI declaration to be accessible via the "follow-your-nose" mechanisms of the Web and the Cool URIs conventions. It is this which forms the basis of a "social expectation that the URI will be used in a way that is consistent with its declaration"; it "anchors the URI's meaning". (More precisely, the documents refer to the "core assertions" of such a declaration.)

An author using/citing that URI in their data is then responsible for using that URI in ways which are consistent with the owner's URI declaration. And a consumer of that data should make an interpretation of the URI consistent with the declaration. However, there is a temporal aspect to these actions: a URI declaration may be changed in the period between the point an author cites a URI and the point at which a consumer reads/processes that data, and in that case the author's implied commitment is to the declaration at the time their statement was written, and the consumer may also choose to select that specific version of the declaration over the most recent one.

In the discussion of the document on the public-lod mailing list, there's some analysis of what happens when actors do not meet such expectations or discharge these responsibilities, and indeed to what extent the existence of these expectations and responsibilities leaves room for flexibility. Dan Brickley describes the case of the "semantic drift" of a FOAF property called foaf:schoolHomepage, where the URI owners' initial intent was that this referred to the home pages of the institutions UK residents know as "schools" (i.e. the institutions you typically attend up to the age of 16 or 18), but which authors from outside the UK interpreted as having a broader scope (one in which the notion of "schools" included, e.g., universities) and deployed in that way in their data. As a consequence, the "URI declaration" was updated to reflect the community's interpretation.

There is a tension here between "nailing down" meaning and allowing for the sort of "evolution" that takes place in human languages, and the scope for accommodating this sort of variability was, I think, perhaps the main point of contention in the discussion. In conclusion, David emphasises:

The point of a URI declaration is not to forbid semantic variability, it is to permit the bounds of that variability to be easily prescribed and determined.

All in all, it's a clear, thoughtful document which addresses several complexities in our management of URI stability over time in a social Web.

June 19, 2009

Repositories and linked data

Last week there was a message from Steve Hitchcock on the UK jisc-repositories@jiscmail.ac.uk mailing list noting Tim Berners-Lee's comments that "giving people access to the data 'will be paradise'". In response, I made the following suggestion:

If you are going to mention TBL on this list then I guess that you really have to think about how well repositories play in a Web of linked data?

My thoughts... not very well currently!

Linked data has 4 principles:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.

Of these, repositories probably do OK at 1 and 2 (though, as I’ve argued before, one might question the coolness of some of the https URIs in use and, I think, the use of cool URIs is implicit in 2).

3, at least according to TBL, really means “provide RDF” (or RDFa embedded into HTML I guess), something that I presume very few repositories do?

Given lack of 3, I guess that 4 is hard to achieve. Even if one was to ignore the lack of RDF or RDFa, the fact that content is typically served as PDF or MS formats probably means that links to other things are reasonably well hidden?

It’d be interesting (academically at least), and probably non-trivial, to think about what a linked data repository would look like? OAI-ORE is a helpful step in the right direction in this regard.

In response, various people noted that there is work in this area: Mark Diggory on work at DSpace, Sally Rumsey (off-list) on the Oxford University Research Archive and parallel data repository (DataBank), and Les Carr on the new JISC dotAC Rapid Innovation project. And I'm sure there is other stuff as well.

In his response, Mark Diggory said:

So the question of "coolness" of URI tends to come in second to ease of implementation and separation of services (concerns) in a repository. Should "Coolness" really be that important? We are trying to work on this issue in DSpace 2.0 as well.

I don't get the comment about "separation of services". Coolness of URIs is about persistence. It's about our long term ability to retain the knowledge that a particular URI identifies a particular thing and to interact with the URI in order to obtain a representation of it. How coolness is implemented is not important, except insofar as it doesn't impact on our long term ability to meet those two aims.

Les Carr also noted the issues around a repository minting URIs "for things it has no authority over (e.g. people's identities) or no knowledge about (e.g. external authors' identities)" suggesting that the "approach of dotAC is to make the repository provide URIs for everything that we consider significant and to allow an external service to worry about mapping our URIs to "official" URIs from various "authorities"". An interesting area.

As I noted above, I think that the work on OAI-ORE is an important step in helping to bring repositories into the world of linked data. That said, there was some interesting discussion on Twitter during the recent OAI6 conference about the value of ORE's aggregation model, given that distinct domains will need to layer their own (different) domain models onto those aggregations in order to do anything useful. My personal take on this is that it probably is useful to have abstracted out the aggregation model but that the hypothesis still to be tested that primitive aggregation is useful despite every domain needing own richer data and, indeed, that we need to see whether the way the ORE model gets applied in the field turns out to be sensible and useful.

May 15, 2009

URIs and protocol dependence

In responding to my recent post, The Nature of OAI, identifiers and linked data, Herbert Van de Sompel says:

There is a use case for both variants: the HTTP variant is the one to use in a Web context (for obvious reasons; so totally agreed with both Andy and PeteJ on this), and the doi: variant is technology independent (and publishers want that, I think, if only to print it on paper).

(My emphasis added).

I'm afraid to say that I could not disagree more with Herbert on this. There is no technological dependence on HTTP by https URIs. [Editorial note: in the first comment below Herbert complains that I have mis-represented him here and I am happy to conceed that I have and apologise for it.  By "technology independent" Herbert meant "independednt of URIs", not "independent of HTTP".  I stand by my more general assertion in the remainder of this post that a mis-understanding around the relationship between https URIs and HTTP (the protocol) led the digital library community into a situation where it felt the need to invent alternative approaches to identification of digital conetent and, further, that those alternative approaches are both harmful to the Web and harmful to digital libraries.  I think those mis-understandings are still widely held in the digital library community and I disagree with those people who continue to promote relatively new non-https forms of URIs for 'scholarly' content (by which I primarily mean info URIs and doi URIs) as though their use was good practice.  On that basis, this blog post may represent a disagreement between Herbert and me but it may not.  See the comment thread for further discussion.  Note also that my reference to Stu Weibel below is intended to indicate only what he said at the time, not his current views (about which I know nothing).]  As I said to Marco Streefkerk in the same thread:

there is nothing in the 'https' prefix of the https URI that says, "this must be dereferenced using HTTP". In that sense, there is no single 'service' permanently associated with the https URI - rather, there happens to be a current, and very helpful, default de-referencing mechanism available.

At the point that HTTP dies, which it surely will at some point, people will build alternative de-referencing mechanisms (which might be based on Handle, or LDAP, or whatever replaces HTTP). The reason they will have to build something else is that the weight of deployed https URIs will demand it.

That's the reasoning behind my, "the only good long term identifier, is a good short term identifier" mantra.

The mis-understanding that there is a dependence between the https URI and HTTP (the protocol) is endemic in the digital library community and has been the cause of who knows how many wasted person-years, inventing and deploying unnecessary, indeed unhelpful, URI schemes like 'oai', 'doi' and 'info'. Not only does this waste time and effort but it distances the digital library community from the mainstream Web - something that we cannot afford to happen.

As the Architecture of the World Wide Web, Volume One (section 3.1) says:

Many URI schemes define a default interaction protocol for attempting access to the identified resource. That interaction protocol is often the basis for allocating identifiers within that scheme, just as "https" URIs are defined in terms of TCP-based HTTP servers. However, this does not imply that all interaction with such resources is limited to the default interaction protocol.

This has been re-iterated numerous times in discussion, not least by Roy Fielding:

"Really, it has far more to do with a basic misunderstanding of web architecture, namely that you have to use HTTP to get a representation of an "https" named resource. I don't think there is any simple solution to that misbelief aside from just explaining it over and over again."

"However, once named, HTTP is no more inherent in "https" name resolution than access to the U.S. Treasury is to SSN name resolution."

"The "https" resolution mechanism is not, and never has been, dependent on DNS. The authority component can contain anything properly encoded within the defined URI syntax. It is only when an https identifier is dereferenced on a local host that a decision needs to be made regarding which global name resolution protocol should be used to find the IP address of the corresponding authority. It is a configuration choice."

The (draft) URI Schemes and Web Protocols Tag Finding makes similar statements (e.g. section 4.1):

A server MAY offer representations or operations on a resource using any protocol, regardless of URI scheme. For example, a server might choose to respond to HTTP GET requests for an ftp resource. Of course, this is possible only for protocols that allow references to URIs in the given scheme.

I know that some people don't like this interpretation of https URIs, claiming that W3C (and presumably others) have changed their thinking over time. I remember Stu Weibel starting a presentation about URIs at a DCC meeting in Glasgow with the phrase:

Can you spell revisionist?

I disagree with this view. The world evolves. Our thinking evolves. This is a good thing isn't it? It's only not a good thing if we refuse to acknowledge the new because we are too wedded to the old.

May 08, 2009

The Nature of OAI, identifiers and linked data

In a post on Nascent, Nature's blog on web technology and science, Tony Hammond writes that Nature now offer an OAI-PMH interface to articles from over 150 titles dating back to 1869.

Good stuff.

Records are available in two flavours - simple Dublin Core (as mandated by the protocol) and Prism Aggregator Message (PAM), a format that Nature also use to enhance their RSS feeds.  (Thanks to Scott Wilson and TicTocs for the Jopml listing).

Taking a quick look at their simple DC records (example) and their PAM records (example) I can't help but think that they've made a mistake in placing a doi: URI rather than an https: URI in the dc:identifier field.

Why does this matter?

Imagine you are a common-or-garden OAI aggregator.  You visit the Nature OAI-PMH interface and you request some records.  You don't understand the PAM format so you ask for simple DC.  So far, so good.  You harvest the requested records.  Wanting to present a clickable link to your end-users, you look to the dc:identifier field only to find a doi: URI:


If you understand the doi: URI scheme you are fine because you'll know how to convert it to something useful:


But if not, you are scuppered!  You'll just have to present the doi: URI to the end-user and let them work it out for themselves :-(

Much better for Nature to put the https: URI form in dc:identifier.  That way, any software that doesn't understand DOIs can simply present the https: URI as a clickable link (just like any other URL).  Any software that does understand DOIs, and that desperately wants to work with the doi: URI form, can do the conversion for itself trivially.

Of course, Nature could simply repeat the dc:identifier field and offer both the https: URI form and the doi: URI form side-by-side.  Unfortunately, this would run counter the the W3C recommendations not to mint multiple URIs for the same resource (section 2.3.1 of the Architecture of the World Wide Web):

A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

On balance I see no value (indeed, I see some harm) in surfacing the non-HTTP forms of DOI:




both of which appear in the PAM record (somehwat redundantly?).

The https: URI form


is sufficient.  There is no technical reason why it should be perceived as a second-class form of the identifier (e.g. on persistence grounds).

I'm not suggesting that Nature gives up its use of DOIs - far from it.  Just that they present a single, useful and usable variant of each DOI, i.e. the https: URI form, whenever they surface them on the Web, rather than provide a mix of the three different forms currently in use.

This would be very much in line with recommended good practice for linked data:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.

March 20, 2009

Unlocking Audio

I spent the first couple of days this week at the British Library in London, attending the Unlocking Audio 2 conference.  I was there primarily to give an invited talk on the second day.

You might notice that I didn't have a great deal to say about audio, other than to note that what strikes me as interesting about the newer ways in which I listen to music online (specifically Blip.fm and Spotify) is that they are both highly social (almost playful) in their approach and that they are very much of the Web (as opposed to just being 'on' the Web).

What do I mean by that last phrase?  Essentially, it's about an attitude.  It's about seeing being mashed as a virtue.  It's about an expectation that your content, URLs and APIs will be picked up by other people and re-used in ways you could never have foreseen.  Or, as Charles Leadbeater put it on the first day of the conference, it's about "being an ingredient".

I went on to talk about the JISC Information Environment (which is surprisingly(?) not that far off its 10th birthday if you count from the initiation of the DNER), using it as an example of digital library thinking more generally and suggesting where I think we have parted company with the mainstream Web (in a generally "not good" way).  I noted that while digital library folks can discuss identifiers forever (if you let them!) we generally don't think a great deal about identity.  And even where we do think about it, the approach is primarily one of, "who are you and what are you allowed to access?", whereas on the social Web identity is at least as much about, "this is me, this is who I know, and this is what I have contributed". 

I think that is a very significant difference - it's a fundamentally different world-view - and it underpins one critical aspect of the difference between, say, Shibboleth and OpenID.  In digital libraries we haven't tended to focus on the social activity that needs to grow around our content and (as I've said in the past) our institutional approach to repositories is a classic example of how this causes 'social networking' issues with our solutions.

I stole a lot of the ideas for this talk, not least Lorcan Dempsey's use of concentration and diffusion.  As an aside... on the first day of the conference, Charles Leadbeater introduced a beach analogy for the 'media' industries, suggesting that in the past the beach was full of a small number of large boulders and that everything had to happen through those.  What the social Web has done is to make the beach into a place where we can all throw our pebbles.  I quite like this analogy.  My one concern is that many of us do our pebble throwing in the context of large, highly concentrated services like Flickr, YouTube, Google and so on.  There are still boulders - just different ones?  Anyway... I ended with Dave White's notions of visitors vs. residents, suggesting that in the cultural heritage sector we have traditionally focused on building services for visitors but that we need to focus more on residents from now on.  I admit that I don't quite know what this means in practice... but it certainly feels to me like the right direction of travel.

I concluded by offering my thoughts on how I would approach something like the JISC IE if I was asked to do so again now.  My gut feeling is that I would try to stay much more mainstream and focus firmly on the basics, by which I mean adopting the principles of linked data (about which there is now a TED talk by Tim Berners-Lee), cool URIs and REST and focusing much more firmly on the social aspects of the environment (OpenID, OAuth, and so on).

Prior to giving my talk I attended a session about iTunesU and how it is being implemented at the University of Oxford.  I confess a strong dislike of iTunes (and iTunesU by implication) and it worries me that so many UK universities are seeing it as an appropriate way forward.  Yes, it has a lot of concentration (and the benefits that come from that) but its diffusion capabilities are very limited (i.e. it's a very closed system), resulting in the need to build parallel Web interfaces to the same content.  That feels very messy to me.  That said, it was an interesting session with more potential for debate than time allowed.  If nothing else, the adoption of systems about which people can get religious serves to get people talking/arguing.

Overall then, I thought it was an interesting conference.  I suspect that my contribution wasn't liked by everyone there - but I hope it added usefully to the debate.  My live-blogging notes from the two days are here and here.

July 18, 2008

Does metadata matter?

This is a 30 minute slidecast (using 130 slides), based on a seminar I gave to Eduserv staff yesterday lunchtime.  It tries to cover a broad sweep of history from library cataloguing, thru the Dublin Core, Web search engines, IEEE LOM, the Semantic Web, arXiv, institutional repositories and more.

It's not comprehensive - so it will probably be easy to pick holes in if you so choose - but how could it be in 30 minutes?!

The focus is ultimately on why Eduserv should be interested in 'metadata' (and surrounding areas), to a certain extent trying to justify why the Foundation continues to have a significant interest in this area.  To be honest, it's probably weakest in its conclusions about whether, or why, Eduserv should retain that interest in the context of the charitable services that we might offer to the higher education community.

Nonetheless, I hope it is of interest (and value) to people.  I'd be interested to know what you think.

As an aside, I found that the Slideshare slidecast editing facility was mostly pretty good (this is the first time I've used it), but that it seemed to struggle a little with the very large number of slides and the quickness of some of the transitions.

March 04, 2008

LCCN Permalinks and the info URI scheme

Another post that has been on the back burner for a few days.... via catalogablog, I noticed recently that the Library of Congress announced the availability of what it calls LCCN Permalinks, a set of URIs using the https URI scheme which act as globally scoped identifiers for bibliographic records in the Library of Congress Online Catalog and for which the LoC makes a commitment of persistence.

I tend to think of two aspects of persistence, following the distinction that the Web Architecture makes  between identification and interaction. From reading the FAQ, I think that persistence in the LCCN Permalink case covers both of these aspects. So the LoC commits to the persistence of the identifiers as names by, for example, keeping ownership of the domain name and managing the (human, organisational) processes for assigning URIs within that space so that once assigned a single URI will continue to identify the same record (i.e. they observe the WebArch principles of avoiding collisions). And they also commit to serving consistent representations of the resources identified by those URIs (i.e. they observe the WebArch principles of providing representations and doing so consistently and predictably over time).

So for example, the URI https://lccn.loc.gov/2003556443 is a persistent identifier of a metadata record describing an online exhibit called "1492: an ongoing voyage". And in addition, for each URI of this form, a further three URIs are coined to identify that same metadata record presented in different formats: https://lccn.loc.gov/2003556443/marcxml (MARCXML), https://lccn.loc.gov/2003556443/mods (MODS), https://lccn.loc.gov/2003556443/dc (SRW DC XML). So in terms of the Web Architecture, we have four distinct, but related, resources here. And indeed the fact that they are related is reflected in the hypertext links in the HTML document served as a representation of the first resource, along the lines of the TAG finding, On Linking Alternative Representations To Enable Discovery And Publishing. It would be even nicer if that HTML document indicated the nature of the "generic resource"-"specific resource" relationship between those resources. But, really, it would be churlish to complain! :-) We now have a set of URIs which have the (attractive) characteristics that, first, they serve as globally scoped persistent names and, second, they are amenable to lookup using a widely used network protocol which is supported by tools on my desktop and by libraries for every common programming platform. Good stuff.

However, it is interesting to note that this - or at least the first aspect, the provision of persistent names - was the intent behind the provision of the "lccn" namespace within the info URI scheme. According to the entry for the "lccn" namespace in the info URI registry:

The LCCN namespace consists of identifiers, one corresponding to every assigned LCCN (Library of Congress Control Number). Any LCCN may have various forms which all normalize to a single canonical form; only normalized values are included in the LCCN namespace.

An LCCN is an identifier assigned by the Library of Congress for a metadata record (e.g., bibliographic record, authority record).

Compare (from the first two questions of the LCCN Permalink FAQ)

1. What are LCCN Permalinks?

LCCN Permalinks are persistent URLs for bibliographic records in the Library of Congress Online Catalog. These links are constructed using the record's LCCN (or Library of Congress Control Number), an identifier assigned by the Library of Congress to bibliographic and authority records.

2. How can I use LCCN Permalinks?

LCCN Permalinks offer an easy way to cite and link to bibliographic records in the Library of Congress Online Catalog. You can use an LCCN Permalink anywhere you need to reference an LC bibliographic record in emails, blogs, databases, web pages, digital files, etc.

The issue with URIs in the info: URI scheme, of course, is that while they provide globally scoped, persistent names, the info URI scheme is not mapped to a network protocol to enable the lookup of those names. I understand that for info URIs, "per-namespace methods may exist as declared by the relevant Namespace Authorities", but "[a]pplications wishing to tap into this functionalitiesy (sic) must consult the INFO Registry on a per-namespace basis." (both quotes from the info URI scheme FAQ.)

The creation of LCCN Permalinks seems to endorse Berners-Lee's basic principle that I mentioned in my post on Linked Data) that it is helpful for the users/consumers of a URI not only to have a globally-scoped name, but also to be able to look up those names - using an almost ubiquitous network protocol - and obtain some useful information. LoC have supplemented the use of a URI scheme that only supported the former with the use of a scheme which facilitates both the former and the latter. And with a recent post by Stu Weibel in mind, I'd just add that (a) the use of an https URI does not constitute an absolute requirement that the owner also serve representations - the https URIs I coin can be used quite effectively as names alone without my ever configuring my HTTP server to provide representations for those URIs (and if the LoC HTTP server disappears, an LCCN Permalink still works as a name); and (b) the serving of representations for https URIs is not - in principle, at least - limited to the use of the HTTP protocol (see "Serve using any protocol" in the draft finding of the W3C TAG URI Schemes and Web Protocols).

Further, the persistence in LCCN Permalinks is a consequence of LoC's policy commitment to ensuring that persistence (in both aspects I outlined above): it is primarily a socio-economic, organisational consideration, not a technical one, and that applies regardless of the URI scheme chosen.

Indeed, it seems to me the creation of LCCN Permalinks suggests that there wasn't really much of a requirement for the creation of the "lccn" info URI namespace. And the co-existence of these two sets of URIs now means that consumers are faced with managing the use of two parallel sets of global identifiers - two sets provided by the same agency - for a single set of resources (i.e. URI aliases). Certainly, this can be managed, using, e.g. the capability provided by the owl:sameAs property to state that two URIs identify the same resource. But it does seem to me that it adds an avoidable overhead, with - in this case - little (no?) appreciable benefit. (Compare the case that I mentioned, also in the post on Linked Data, of URI aliases provided by different agencies, where the use of two URIs enables the provision of different descriptions of a single resource, and so does bring something additional to the table.)

Given the (commendable) strong commitment to persistence expressed by LoC for LCCN Permalinks, it seems to me that anyone using the URIs in the info URI "lccn" namespace could switch to citing the corresponding LCCN Permalink instead - though if only a proportion of the community makes the change, that still leaves services which work across the Web and which merge data from the two camps having to work with the two aliases.

Interestingly, the use of the https URI scheme in association with a domain which was supported by some organisational commitment is exactly the sort of suggestion made by several observers as a viable alternative to the info URI scheme when it was first being proposed. See for example a message by Patrick Stickler to the W3C URI and RDF Interest Group mailing lists (in October 2003!) which uses the LCCN case as an example.

Anyway, all in all, this is a very positive and exciting development. I look forward to the implementation of similar conventions using the https URI scheme by the owners of other info URI namespaces :-)

January 14, 2008

Blastfeed - a small case-study in API persistence

Blastfeed is a service that I've used over the last 6 months or so to build aggregate channels from a set of RSS feeds.  For example, I've been using it to build an single RSS feed of all my favorite Second Life blogs which I can then embed into the sidebar of ArtsPLace SL.

Blastfeed isn't the only option for doing this, Yahoo Pipes would have been an obvious alternative, but it was quick and easy to use and, up until now, was also free.  Recently I got this by email:

Blastfeed has been running smoothly (almost no glitch besides one last November, sorry again) since its debut a little over a year ago. We hope here at 2or3things that you've enjoyed using Blastfeed.

Hence at this stage we feel it's no longer necessary to keep Blastfeed in a beta mode. We have also decided to focus the service onto corporate applications, while letting the opportunity to web users to subscribe for a fee to unlimited usage.

In line with the above we shall discontinue the free service from February 15th 2008 on.

Should you wish to continue using Blastfeed after that date, please contact us by return email stating your Blastfeed username and email and we'll make a quick and fair proposal. However we'll bind the proposal to the number of potential subscribers.

I'm not complaining.  Blastfeed have never promised to remain free forever and until recently they still badged themselves as a beta service.  But this does serve as a timely reminder (for me at least) to take steps to mitigate this kind of thing happening.

The "API" to the aggregated feed service(s) that I've built using Blastfeed is effectively an HTTP GET request against the feed URI, with RSS XML returned as a result.  With the demise of the free service, I can recreate the aggregated feed somewhere else easily enough - but doing so will change the URI and hence the API that I've built for myself.  I'll have to remember all the places that I've used my API (e.g. in the ArtsPLace SL sidebar) and update them with the new URI.

With hindsight, what I should have done was to make the API more persistent by using a PURL redirect rather than the native Blastfeed URI.  That way, I could have changed the technology that I use to create the feed (e.g. replace Blastfeed by Yahoo Pipes) without changing the API and without having to update anything else.

Oh well... live and learn!

November 15, 2007

UniProt, URNs, PURLs

Stu Weibel - yes, that Stu Weibel, the notorious Facebook transgressor ;-) - made a post yesterday in which he responds to a comment questioning OCLC's motivation in providing the PURL service. What caught my attention, however, was Stu's mention of the fact that:

Evidence of success of the strategy may be found in the adoption of PURLs for the identification of some billion URI-based assertions about proteins in the UniProt database, an international database of proteins intended to support open research in the biological sciences. In the latest release of UniProt (11.3), all URIs of the form:


have been replaced with URLs of the form:


Some "live" examples:




I gather that this change by UniProt was announced some time ago, so it isn't really news, but it does look to me like a very nice example of an adoption of the approaches advocated in the draft W3C Technical Architecture Group finding URNs, Namespaces and Registries, which

addresses the questions "When should URNs or URIs with novel URI schemes be used to name information resources for the Web?" and "Should registries be provided for such identifiers?". The answers given are "Rarely if ever" and "Probably not". Common arguments in favor of such novel naming schemas are examined, and their properties compared with those of the existing https: URI scheme.

Three case studies are then presented, illustrating how the https: URI scheme can be used to achieve many of the stated requirements for new URI schemes.

Or as Andy paraphrased it a few months ago (actually, a year and a bit now - crikey, time flies): "New URI schemes: just say no" ;-)

August 11, 2007

PURL's single point of failure?

It strikes me that PURLs are now a pretty critical part of the Web infrastructure.  Of course, that statement won't ring true for everyone - I suspect quite a few of you are thinking, "Huh... I've never created a PURL in my life?".  But certainly in the semantic Web arena, PURLs are widely used as identifiers for metadata terms - DCMI started doing this ages ago, and many other metadata initiatives have followed suit.

As Pete and Thom noted a while back, OCLC have now funded an activity to renew the architecture of the PURL system and Pete notes some reasons why this is important.

But the PURL system is not without problems.  Several years ago I tried to highlight the fact that the PURL system represents a single point of failure in the Web infrastructure - in persistence terms, the ongoing provision of the PURL service relies on the goodwill of OCLC.  Not something that I doubt particularly - but not an ideal situation either.

My initial thoughts on solving this problem were around mirroring - to which end I briefly created https://purl.ue.org/ (though I note that it no longer exists).  But I quickly realised that mirroring was a useless solution.  Why?  Because it results in multiple URIs for the same resource, something that the Web Architecture tells us to avoid if at all possible.

A better solution lies in DNS hiding.  Running multiple instances of the PURL software around the planet but hiding them all behind https://purl.org/ - using the DNS to share the load between the different servers  Who would run such a networked set of services?  Like any infrastructural, and largely invisible, service, the business models for running this aren't clear.  But one could imagine, for example, national libraries having an interest in running, or funding, an instance of the PURL software for the benefit of their own, and other, communities.

Of course, one could only hide multiple PURL servers behind a single DNS domain if mechanisms for rapidly replicating the data between systems are put into place.  Perhaps now is a good time to think about adding that functionality into the PURL system?

July 13, 2007

Making PURLs work for the Web of Data

An interesting snippet of news from OCLC:

OCLC Online Computer Library Center, Inc. and Zepheira, LLC announced today that they will work together to rearchitect OCLC's Persistent URL (PURL) service to more effectively support the management of a "Web of data."

(For more on Zepheira, see their own Web site and also the interview with Zepheira President Eric Miller by Talis' Paul Miller in the Talking with Talis series).

While it's good to see an emphasis on improving scalability and flexibility (I hope that will include improvements to the user interface for creating and maintaining PURLs - while the current interface is functional, I'm sure everyone would admit it could be made rather more user-friendly!), the most interesting (to me) aspect of the announcement is:

The new PURL software will also be updated to reflect the current understanding of Web architecture as defined by the World Wide Web Consortium (W3C). This new software will provide the ability to permanently identify networked information resources, such as Web documents, as well as non-networked resources such as people, organizations, concepts and scientific data. This capability will represent an important step forward in the adoption of a machine-processable "Web of data" enabled by the Semantic Web.

This is excellent news. The current functionality of the PURL server tends to leave me with a slight feeling of "so near yet so far" when it comes to implementing some of the recommendations of the W3C - for example, the re-direct behaviour recommended by the W3C TAG's resolution to the "httpsRange-14 issue". The capacity to tell the PURL server when my identified resource is an information resource and when it is something else, and have that server Do the Right Thing in terms of its response to dereference requests (which is how I'm interpreting that paragraph above!) will mean that there's one less thing for me to worry about handling, and will generally make it easier for implementers to follow the W3C's guidelines.

Good stuff. I look forward to hearing about developments.

June 21, 2007

W3C TAG considering identification in Virtual Worlds

I just noticed while browsing the mailing list archives of the W3C Technical Architecture Group (which are always a good read, I should add) that one of the items currently under discussion is "Naming and Identification in Virtual Worlds". Actually, there are only a couple of posts on the topic at the moment (the thread starts here) but I imagine more will follow.

This relates to a point about the use of SLURLs which Andy discussed in a couple of posts over on ArtsPlace, and more generally one of my interests is in understanding how virtual worlds like Second Life integrate with the Web - with identification being a key part of that - so I'll be following the TAG discussion with interest.

April 17, 2007

When persistence has a sell-by date

I note that Nicole Harris at JISC has started the JISC Access Management Team Blog... good stuff and a welcome addition to the UK HE blog landscape.  In her posting entitled The Accountability Question she notes that The Rules of the UK Federation (section 6.4.2) state that:

where unique persistent Attributes (e.g. eduPersonTargetedID or eduPersonPrincipalName) are associated with an End User, the End User Organisation must ensure that these Attribute values are not re-issued to another End User for at least 24 months;

I remember reading this guidance during the comment period on the various policy documents that came out at the start of the UK Federation - it struck me then as rather odd.  Any sentence that starts with 'unique peristent' and ends with 'not re-issued ... for at least 24 months' has got to ring alarm bells somewhere hasn't it?

Why 24 months?  Less than the period for which most students are at university!  The problem, or so it seems to me, is that any service provider that wants to make use of these attributes can't rely on them being persistent even for as long as the student is typically at university.  As a result, service providers will presumably have to find some other way of guaranteeing that the person they are dealing with today is the same of the person they were dealing with yesterday, at least for any unique persistent attribute that is nearing its second birthday :-(

I'm tempted to ask why any time limit is suggested?  Why not simply say that these attributes must never be re-used?  Presumably some institutions have problems ensuring that they do not re-use their local usernames and so on.  But so what?!  Generate a truly unique persistent handle for the user in some way (a UUID or something) and associate it with the local username thru some kind of look-up table.

That way you can easily guarantee that these identifiers will never be re-used.  Am I missing something obvious here?

March 18, 2007

JISC Conference 2007

I spent Tuesday at the JISC Conference in Birmingham, on balance quite a pleasant day and certainly an excellent opportunity for networking and meeting up with old friends ('old' as in 'long term' you understand!).

I went to an hour long session about the JISC e-Framework, SOA and Enterprise Architecture in the morning.  I have to say that I was somewhat disappointed by the lack of any mention of Web 2.0.  Err... hello!.  Also by the 30 minute presentation about the Open Group which, I'm afraid to say, struck me as rather inappropriate.

I presented in the afternoon alongside Ed Zedlewski, as part of the Eduserv special session.  Between us we tried to cover the way in which the access and identity management landscape is changing and how Eduserv is responding to that with the new OpenAthens product.  I've put our slides up on SlideShare for those that are interested.

The general thrust of my bit was that end-user needs will push us down a user-centric identity management road, the way academic collaborations are going (for both learning and research) means that institutions will need to operate across multiple access management federations and the technology around this area is in a constant state of change.  All of this, I argued, means that institutions would do well to consider outsourced access and identity management solutions rather than developing solutions in-house.  There are, of course, also good reasons for doing stuff in-house, so there's a certain amount of horses for courses here - but outsourcing should definitely be on the agenda for discussion.

The day ended with an excellent closing keynote by Tom Loosemore from the BBC.  Tom presented 15 key principles of Web design, a relatively simple concept but one that was ultimately quite powerful. Tom's principles were very much in the spirit of Web 2.0 and just the kinds of things that Brian Kelly and others have been banging on about for ages, but it was nice to hear the same messages coming from outside the community.

The 15 principles were as follows:

  1. focus on the needs of the end-user
  2. less is more
  3. do not attempt to do everything yourself
  4. fall forwards fast: try lots of things, kill failures quickly
  5. treat the entire Web as a creative canvas
  6. the Web is a conversation: join as a peer and admit mistakes when necessary
  7. any Web site is only as good as its worst page
  8. make sure that all your content can be linked to forever: the link is the heart of the Web
  9. remember that your granny won't ever use Second Life
  10. maximise routes to content: optimise to rank high in Google
  11. consistent design and navigation doesn't mean that one size fits all
  12. accessibility is not an optional extra
  13. let people paste your content on the walls of their virtual homes
  14. link to discussions on the Web, don't host them
  15. personalisation should be unobtrusive, elegant and transparent

Apologies to Tom if I've mis-quoted any of these.  Each was illustrated with some nice case studies taken from the BBC and elsewhere.

If I disagree with anything it's with the ordering.  As you might expect from previous postings to this blog, I'd put number 8 much higher up the list.  But overall, I think these are a good set of principles that people would do well to take note of.

15 principles too many for you?  Try 5...  Web sites should be:

  • straightforward
  • functional
  • gregarious
  • open
  • evolving.

February 21, 2007

WorldCat Institution Registry and Identifiers

(Or: PeteJ bangs on about https URIs again!)

In a couple of recent posts, Lorcan highlights the availability of a new WorldCat Institution Registry service provided by OCLC. The registry stores, manages and makes available descriptions of institutions, mainly libraries and library consortia, and I agree with Lorcan that this is potentially a very valuable service.

Lorcan's second post cites a post by Ed Summers in which Ed notes:

If you drill into a particular institution you’ll see a pleasantly cool uri:


…which would serve nicely as an identifier for the Browne Popular Culture Library. The institution pages are HTML instead of XML–however there is a link to an XML representation:


This URL isn’t bad but it would be rather nice if the former could return XML if the Accept: header had text/xml slotted before text/html.

As I was rushing out the door last night, I posted a comment to Ed's post expressing my whole-hearted agreement with both of these points, but also expressing a degree of concern at another aspect of the use of identifiers I'd observed in the data exposed by the registry. As far as I can see, my comment doesn't seem to have made it through to Ed's weblog (yet, anyway - maybe Ed moderates his comments and I'm just being impatient, or maybe - more likely - in my haste to get to the pub I clicked the wrong button on his submission form!) so I'm posting a note here.

Like Ed, my immediate reaction was that the first URI above would make a good identifier for the institution. An https URI can be used to identify a resource of any type, so no problem with the choice of URI scheme. The form of the URI itself seems to take into account all of Berners-Lee's suggestions for "coolness". Is it persistent? Well, I guess that depends on OCLC's commitment to maintaining  their ownership of the worldcat.org domain and to managing the URIs they assign within that URI-space e.g. not to assign the URI above to a different institution (or indeed to a document or a DDC term or a research project or something else entirely) at some point in the future. I haven't checked whether OCLC publish any policy statement about their commitment to these URIs. If they did, that would contribute to my faith in their having some degree of persistence, but for now I'll take it on trust that OCLC will manage this URI-space in a controlled manner, and these URIs have some reasonable degree of persistence.

Furthermore, the URI is reasonably concise and human-readable/citable: I could scribble one of those URIs down in an email or read one over the phone.

And the URI also has the helpful characteristic that, if I am given (by email, on a piece of paper, in a telephone conversation) only the URI, then with no additional information, I can "look up" that URI using a tool which supports the HTTP protocol (e.g. my desktop Web browser) and get back some information about the resource identified by the URI provided by the owner of the URI (OCLC), in the form of an HTML document. Or, in the terms of the Web Architecture, I can dereference the URI and obtain a representation of the state of the resource identified by the URI. This functionality is available to me because there is a message protocol which makes use of the https URI scheme, and that protocol is widely supported in software tools.

And that representation (well, OK, a second HTML page linked from it) also contains the URIs of other related institutions, so that I can follow hyperlinks from this institution to another - and obtain representations of those other institutions too. For example (choosing a different institution from Ed's examples to better highlight the hyperlinking), Bowling Green State University is related to a number of "branches" and I can obtain a representation of the list of branches which includes the URIs of those branch institutions.

And as Ed notes, it would be useful if OCLC added support for HTTP content negotiation so that if my software application obtained the URI, it could obtain a representation in a format which made explicit the semantics that the HTML document conveys to a human reader e.g. an instance of the WorldCat XML format, as is available by dereferencing the second URI above.

So far, so good. Better than good. Great stuff, in fact.

However, when I looked a bit harder at the WorldCat XML representation, my heart sank slightly. Here's an (edited, reformatted, emphasis added) version of one of those XML instances (minus XML Namespace declarations, XML schema location declarations etc):

    <institutionName>Bowling Green State University</institutionName>
    <!-- stuff snipped -->
  <!-- stuff snipped -->
  <!-- stuff snipped -->

What struck me here was that, in the XML representation, a different set of URIs is exposed, a set of URIs using the info URI scheme. Certainly, these URIs share some of the (useful) characteristics that I listed above for the https URIs. They are pretty much as concise, human-readable/citable as the https URIs. My faith in their persistence is subject to pretty much exactly the same considerations as above: will the owners of the "rfa" info namespace continue to own that namespace, and will they manage the assignment of URIs within that namespace so that a single URI is not assigned to two different resources over time? (Hmm, actually, I don't see an entry for "rfa" in the info URI Namespace Registry, so I guess that does raise some doubts in my mind about the answer to the first of those questions!)

Where the info URIs aren't so useful, of course, is that I can't - at least without some additional information - take one of those URIs that has been given to me by email, by telephone etc, and obtain a representation of the resource. While the info URI scheme shares some of the characteristics of the https URI scheme as an identification mechanism, there is no widely deployed mechanism for dereferencing an info URI. While there is no global method for dereferencing info URIs, the info URI scheme does provide for individual "namespace authorities" to specify dereferencing mechanisms for URIs on a "per namespace" basis, and to disclose those methods via the info URI Namespace Registry. Clearly that introduces additional complexity - and ultimately cost - for dereferencing, at least in comparison with the https URI scheme. (And as I note above, there is currently no entry in the registry for the "rfa" namespace, so I don't know whether there is any dereferencing mechanism available.)

So that raises the question of why the registry system should coin, assign and expose one set of identifiers for institutions in the XML representation (URIs using the info URI scheme, like info:rfa/oclc/Institutions/2226) and a second set of identifiers for the same institutions in the HTML representation (URIs using the https URI scheme, like https://worldcat.org/registry/Institutions/2226). To a human reader and to a software application those two URIs are different URIs, and there is no indication of whether or not they identify the same resource. Of course the URI owner could make available the information that they do in fact identify the same resource. Certainly that enables consumers of those two different URIs to establish that they are referring to the same resource, but it does so at the cost of additional complexity: every system that uses the URIs needs to process the two URIs as equivalents. The Web Architecture document counsels against such an approach for exactly these reasons:

Good practice: Avoiding URI aliases

A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

That's not to say that we don't have to deal with cases where we wish to assert that two URIs identify the same resource - of course we do and we have mechanisms to do so - but let's try to avoid making life unnecessarily difficult for ourselves!

Now, of course, it may well be that my question "why coin, assign and expose two identifiers for the same resource?" is based on a completely false assumption! It may be that the intent is that these two sets of identifiers identify not one single set of resources, but two different sets of resources e.g. the info URIs identify the institutions, and the https URIs identify documents which describe those institutions. And indeed the fact that when I dereference the https URI https://worldcat.org/registry/Institutions/2226 the server returns a status code of 200 suggests that this is indeed the case (because according to the W3C Technical Architecture Group's resolution to the httpsRange-14 issue, that response code indicates that the resource is an "information resource"). That is a very good argument for coining and exposing two sets of identifiers, though I think it also brings with it a requirement on the system which exposes those identifiers to make clear to users - both human users and software applications - exactly what resource is identified by each URI.

However, while it is a very good argument for using two different sets of URIs, I'm still not convinced that it is a good argument for choosing to use URIs based on the info URI scheme to identify the institution: as Ed pointed out, https URIs will do the job perfectly well - and they have the (hugely significant, I think) benefit that they can be dereferenced using widely available tools. If it is necessary to identify both the institution and the document then that could still be done using https URIs in both cases e.g.

Pattern 1:

  • Document: https://worldcat.org/registry/Institutions/2226
  • Institution: https://worldcat.org/registry/Institutions/2226#inst

Pattern 2:

  • Document: https://worldcat.org/registry/Institutions/2226
  • Institution: https://worldcat.org/registry/Institutions/2226/inst

(In this case, when the second URI is dereferenced the server could/should observe the rules specified by the resolution to the httpsRange-14 issue, and use a response code of 303 to redirect to the document describing the institution. Incidentally, there's an excellent, clear, practical discussion of this and related issues in a paper by Leo Sauermann and Richard Cyganiak mentioned in a recent message to the Semantic Web Education and Outreach Interest Group mailing list.)

Pattern 3: (assuming a suitable top-level PURL domain is registered)

  • Document: https://worldcat.org/registry/Institutions/2226
  • Institution: https://purl.org/worldcat/registry/Institutions/2226

(The PURL server uses a response code of 302, rather than 303, so this approach wouldn't meet all the requirements of the TAG resolution mentioned above.)

Any of these patterns allow a clear distinction to be made between the identifier of the institution and the identifier of the document, while continuing to allow for the dereferencing of both sets of identifiers, using a protocol that is supported by widely-deployed tools (and introducing the flexibility suggested by Ed by offering multiple representations via HTTP content negotiation).

Hmm. I think I've taken a long time to say what Andy said more concisely back here!

January 09, 2007

When two identifiers for the price of one is not a good deal

I came across a December posting about CrossRef and DOIs on the DigitalKoans blog the other day.  The posting was primarily about the then newly announced CrossRef DOI Finding Tool, but the 'https' URI pedant in me I couldn't help but notice something else.

Look at the example search results scattered throughout the posting.  In every case, each DOI is displayed twice, firstly using the 'doi:' form of URI, then using the 'https:' form.

What is that all about?  Why oh why have we lumbered ourselves with a system that forces us to repeat the preferred form of an identifier in a form that is actually useful?  Why don't we just use the useful form, full stop!?

There is nothing, not one single thing, that is technically weaker or less persistent about using https://dx.doi.org/10.1108/00907320510611357 as an identifier rather than doi:10.1108/00907320510611357.  Quite the opposite in fact.  The 'https:' form is much more likely to persist, since it is so firmly embedded into the fabric of everything we do these days. 
Yet for some reason we seem intent on promoting the 'doi:' form, even though it is next to useless in the Web environment.  As a result, all implementors of DOI-aware software have to hard-code knowledge into their applications to treat the two forms as equivalent.

Note, this is not a criticism of the DOI per se... just a continued expression of frustration at people's insistence on ignoring the 'https' URI in favour of creating their own completely unnecessary alternatives.

November 27, 2006

Repositories and Web 2.0

[Editorial note: I've updated the title and content of this item in response to comments that correctly pointed out that I was over-emphasising the importance of Flash in Web 2.0 service user-interfaces.]

At a couple of meetings recently the relationship between digital repositories as we currently know them in the education sector and Web 2.0 has been discussed.  This happened first at the CETIS Metadata and Digital Repositories SIG meeting in Glasgow that looked at Item Banks, then again at the eBank/R4L/Spectra meeting in London.

In both cases, I found myself asking "What would a Web 2.0 repository look like?".  At the Glasgow meeting there was an interesting discussion about the desirability of separating back-end functionality from the front-end user-interface.  From a purist point of view, this is very much the approach to take - and its an argument I would have made myself until recently.  Let the repository worry about managing the content and let someone (or something) else build the user-interface based on a set of machine-oriented APIs.

Yet what we see in Web 2.0 services is not such a clean separation.  What has become the norm is a default user-interface, typically written in AJAX though often using other technologies such as Flash, that is closely integrated into the back-end content of the Web 2.0 service.  For example, both Flickr and SlideShare follow this model.  Of course, the services also expose an API of some kind (the minimal API being persistent URIs to content and various kinds of RSS feeds) - allowing other services to integrate ("mash") the content and other people to develop their own user-interfaces.  But in some cases at least, the public API isn't rich enough to allow me to build my own version of the default user-interface.

More recently, there has been a little thread on the UK jisc-repositories@jiscmail.ac.uk list about the mashability of digital repositories.  However, it struck me that most of that discussion centered on the repository as the locus of mashing - i.e. external stuff is mashed into the repository user-interface, based on metadata held in repository records.  There seemed to be little discussion about the mashability of the repository content itself - i.e. where resources held in repositories are able to be easily integrated into external services.

One of the significant  hurdles to making repository content more mashable is the way that identifiers are assigned to repository content.  Firstly, there is currently little coherence in the way that identifiers are assigned to research publications in repositories.  This is one of the things we set out to address in the work on the Eprints Application Profile.  Secondly, the 'oai' URIs typically assigned to metadata 'items' in the repository are not Web-friendly and do not dereference (i.e. are not resolvable) in any real sense, without every application developer having to hardcode knowledge about how to dereference them.  To make matters worse, the whole notion of what an 'item' is in the OAI-PMH is quite difficult conceptually, especially for those new to the protocol.

Digital repositories would be significantly more usable in the context of Web 2.0 if they used 'https' URIs throughout, and if those URIs were assigned in a more coherent fashion across the range of repositories being developed.

September 19, 2006

New URI schemes - just say no

A little thread has just emerged on the W3C URI mailing list, the conclusion of which (so far) can be summed up more or less as:

  • use https URIs to identify stuff, and
  • make it possible to dereference those https URIs to useful representations of the thing that is being identified.

Sentiments that I very much agree with, and I've given presentations and written in the reasonably recent past (To name: persistently: ay, there's the rub, Persistently identifying website content and Guidelines for assigning identifiers to metadata terms) reaching much the same conclusion.

In his presentation about Public Resource Identifiers (linked from one of the messages in the thread), Steve Pepper suggests that the use of https URIs as identifiers is:

No longer subject to paralysing controversy

Yeah, right!  While I agree with most of his presentation, that particular statement doesn't tally with my experience - perhaps it's true in some alternative Utopian W3C reality?  After my presentation about using https URIs at the DCC Workshop in Glasgow, at least two people suggested that I was a "creativity stifling Luddite" (or nicer words to that effect!) for saying that "the only good long term identifier is a good short term identifier" and "the best short term identifier is the https URI".

Well, perhaps they were right... but I still don't feel like I've ever seen a convincing argument as to why, in the general case, we need to invent new URI schemes rather than simply make creative use of the existing https URI scheme.

The W3C draft document, URNs, Namespaces and Registries, lays out some of the reasons why people choose to develop new URI schemes, and offers counter arguments as to why they should think carefully before doing so.  Again, I very much agree with the general thrust of this document.  Inevitably there's a certain kind of comfort in inventing one's own solution to problems, rather than re-using what is already on the table, and I'm probably as guilty as the next person of doing so in other contexts.  But every time we do it, we need to be very clear that the benefits outweigh the costs of adoption and the possible damage done to interoperability.



eFoundations is powered by TypePad