How uncool? Repository URIs...
I've been using the OpenDOAR API to take a quick look at the coolness of the URIs that people in the UK are assigning to their repositories. Why does coolness matter? Because uncool URIs are more likely to break than cool URIs and broken URIs for the content of academic scholarly repositories will probably cause disruption to the smooth flow of scholarly communication at some point in the future.
So what is an uncool URI? An uncool URI is one that is unlikely to be persistent, typically because the person who first assigned it didn't think hard enough about likely changes in organisational structure, policy or technology and the impact that changes in those areas might have on the persistence of the URI into the future.
In short - URIs don't break... people break them and, usually, the seeds for that breakage are sown at the point that a URI is minted.
OK, so first... hats off to the OpenDOAR team for providing such an easy to use API, one which made it simple to get at the data I was interested in - the URIs of the home pages of all the institutional repositories in the UK - by using the following link:
http://www.opendoar.org/api13.php?co=gb&rt=2&show=min
This provides a list of 107 repositories (as of 23 Feb 2009) as an XML file. Here's just the URIs of the repository home pages, broken out into the first part of the domain name, the institutional part of the domain name, the port, and the rest of the path (as a csv-separated file for easy loading into a spreadsheet).
In the following analysis, I'm making the assumption that the URI of the repository home page is carried thru into the URIs of all the items within that repository and that, if the home page URI is uncool then it is likely that the URIs for everything within that repository will be likewise. This feels like a reasonable assumption to me, though I haven't actually checked it out.
So... what do we find?
Looking at the first part of the domain name, we find 7 institutions using 'dspace' as part of the repository domain name and 35 using 'eprints'. Both are, presumably, derived from the technology being used as the repository platform. Building this information into the URL is potentially uncool (because that technology might well be changed in the future). Now, in both cases, I suspect that institutions might argue that they would stick with their use of 'eprints' and 'dspace' (particularly the former) even if the underlying technology was to change (on the basis that these terms have become somewhat generic). I'm not totally convinced by that argument, though I think it holds more water in the case of 'eprints' than it does in the case of 'dspace' but in any case, I would argue that this is something definitely worth thinking about.
Note that 10 institutions (with some cross-over between the two counts) have built 'dspace' into the path part of the repository URL, which is uncool for the same reasons.
3 institutions have built a non-standard port (i.e. not port 80) into their repository URL. Whilst this isn't necessarily uncool, it does warrant a question as to why it has been done and whether it will cause maintenance headaches into the future.
Looking at the path part of the URLs, 3 institutions have built the underlying technology (.htm, .php and .aspx) into their URLs - again, this is uncool because of the likelihood of future changes in technology.
A small number of institutions have built the library into their repository URLs. This is probably OK but reflects a commitment to organisational structure thaat may not be warranted longer term?
Similarly, a larger number have built a, err..., 'jazzy' project name into their repository URL. I would have thought it might be safer to stick to 'functional' labels like 'research' than, say 'opus', at least for the URLs, since this seems less likely to change because of political or other organisational issues into the future.
Finally, 4 institutions have outsourced their repository to openrepository.com, resulting in URLs under that domain. Outsourcing is good (I say that not least because I work for a charity that is in that business!) but I would strongly suggest outsourcing in a form that keeps your institutional domain name as part of the URL so that your URLs don't break if your host goes bust or you decide to move things back into the institution or to another provider.
Overall then, it's another 'could try harder' mid-term report from me to the Repository 101 course members.
It's also a bit uncool that the OpenDoar API exposes it's own server-side technology in its URIs (i.e. in the "api.php" part).
Seems to me something like
http://opendoar.org/repos?x=xxx&y=yyy&z=zzz
could work just as well.
Posted by: PeteJ | March 04, 2009 at 11:14 AM
Argh, nooo, bad apostrophe!
I meant "its own", obviously. :-)
Posted by: PeteJ | March 04, 2009 at 11:19 AM
Make your repository and its URIs cool by fixing some of the things Andy is talking about in this post (or even follow his lead and start playing with some of those APIs), funding announcement coming soon: http://infteam.jiscinvolve.org/2009/03/03/information-environment-rapid-innovation-grants/
Posted by: David | March 04, 2009 at 01:54 PM
Some of these are uncool, at least in part because they were early, experimental, prototype developments, that "just growed", and then became a service. They will not have wanted to break the old URIs, so they got built in.
Certainly in the DSpace case, an uncool repository URI does not translate into a similarly uncool content item URI, as the latter is a Handle (which you might well dislike for different reasons).
I remember having this discussion with an institution I used to work for (who had both an eprints and a DSpace repository, intended in the former case for published and the latter case for working papers), and I did want them to use more meaningful, less technology-related names. They have now done that, and moved to a new single service name. with the result that I find my contributed content hard to find (via a browse from their web site)! However, both a Google search for any content item, and the original content item Handle will still find it. So the URIs there are persistent if not cool.
Posted by: Chris Rusbridge | March 04, 2009 at 02:04 PM
This is a good post, I've thought about the same thing in the service URLs I've inherited, and have been gradually changing them to make them more long-term. (Meaning I'm breaking them now, but might as well get it over with).
But I'm confused about your assertion that using ".htm" in the URL is a problem. To me ".htm" is a synonym for ".html" -- are you complaining about ".htm" specifically, or would your critique be equally against ".html"? While I guess it's probable that technology will change at SOME point in the future so we are no longer using HTML... It's a pretty standard thing to end URLs that will deliver html in ".html", I'm not sure it would be wise to abandon that.
Posted by: Jonathan Rochkind | March 04, 2009 at 02:40 PM
Jonathan,
I hear what you say, but taking a long term view means it is best to remove all technology-specific information from the URL. The Web provides perfectly usable typing mechanisms other than the file extension.
That said, if your URI explicitly identifies a particular format of document, then leaving the technology-specific bit in place seems acceptable (to me) - but I don't think that is the case for the repository home page.
Posted by: AndyP | March 04, 2009 at 02:54 PM
Dear Pot, can you do anything about your blog URIs?
e.g. http://efoundations.typepad.com/efoundations/2009/03/how-uncool-repository-uris.html
typepad.com? .html?
Yours truly, Kettle
Posted by: Les carr | March 04, 2009 at 08:59 PM
Your comments are obviously bang on - and a basic rule of cool information management for cool webmasters.
But as you have demonstrated, there is no right answer. That's because in the non-digital world, everything is always changing. You can't coin a URI that is guaranteed immune to change if it is grounded in real service delivery to a real organisation. Organisational structure will change ("library" becomes "information service"), policy will change (a research-only repository will find itself accepting teaching material) and projects will be renamed and responsored (opus will become oeuvre). And when quantum computing becomes a commercial reality, your blog will be renamed qfoundations.
Posted by: Les Carr | March 04, 2009 at 09:27 PM
Good point, Andy. A particular document URL is arguably identifying a particular format that content is being requested in. But a repository home page URL certainly isn't.
Posted by: Jonathan Rochkind | March 05, 2009 at 12:33 AM