Repository usability
In his response to my previous post, Freedom, Google-juice and institutional mandates, Chris Rusbridge responded using one of his Ariadne articles as an illustrative example.
By way of, err... reward, I want to take a quick look (in what I'm going to broadly call 'usability' terms) at the way in which that article is handled by the Edinburgh Research Archive (ERA). Note that I'm treating the ERA as an example here - I don't consider it to be significantly different to other institutional repositories and, on that basis, I assume that most of what I am going to say will also apply to other repository implementations.
Much of this is basic Web 101 stuff...
The original Ariadne article is at http://www.ariadne.ac.uk/issue46/rusbridge/ - an HTML document containing embedded links to related material in the References section (internally linked from the relevant passage in the text). The version deposited into ERA is a 9 page PDF snapshot of the original article. I assume that PDF has been used for preservation reasons, though I'm not sure. Hypertext links in the original HTML continue to work in the PDF version.
So far, so good. I would naturally tend to assume that the HTML version is more machine-readable than the PDF version and on that basis is 'better', though I admit that I can't provide solid evidence to back up that statement.
The repository 'jump-off' page for the article is at http://www.era.lib.ed.ac.uk/handle/1842/1476 though the page itself tells us (in a human-readable way) that we should use http://hdl.handle.net/1842/1476 for citation purposes.
So we already have 4 URLs for this article and no explicit machine-readable knowledge that they all identify the same resource. Further, the URLs that 15 years of using a Web browser lead me to use most naturally (those of the jump-off page, the original Ariadne article or the PDF file) are not the one that the page asks me to use for citation purposes. So, in Web usability terms, I would most naturally bookmark (e.g. using del.icio.us) the wrong URL for this article and where different scholars choose to bookmark different URLs, services like del.icio.us are unlikely to be able to tell that they are referring to the same thing (recent experience of Google Scholar notwithstanding).
OK, so now let's look more closely at the jump-off page...
Firstly, what is the page title (as contained in the HTML <title> tag)? Something useful like "Excuse Me... Some Digital Preservation Fallacies?". No, it's "Edinburgh Research Archive : Item 1842/1476". Nice!? Again, if I bookmark this page in del.icio.us, that is the label is going to appear next to the URL, unless I manually edit it.
Secondly, what other metadata and/or micro-formats are embedded into this page? All that nice rich Dublin Core metadata that is buried away inside the repository? Nah. Nothing. A big fat zilch. Not even any <meta name="keywords" ...> stuff. I mean, come on. The information is there on the page right in front of me... it's just not been marked up using even the most basic of HTML tags. Most university Web site managers would get shot for turning out this kind of rubbish HTML.
Note I'm not asking for embedded Dublin Core metadata here - I'm asking for useful information to be embedded in useful (and machine-readable) ways where there are widely accepted conventions for how to to that.
So, let's look at those human-readable keywords again. Why aren't they hyperlinked to all all other entries in ERA that use those keywords (in the way that Flickr and most other systems do with tags)? Yes, the institutional repository architectural approach means that we'd only get to see other stuff in ERA, not all that useful I'll grant you, but it would be better than nothing.
Similarly, what about linking the author's name to all other entries by that author. Ditto with the publisher's name. Let's encourage a bit of browsing here shall we? This is supposed to be about resource discovery after all!
So finally, let's look at the links on the page. There at the bottom is a link labelled 'View/Open' which takes me to the PDF file - phew, the thing I'm actually looking for! Not the most obvious spot on the page but I got there in the end. Unfortunately, I assume that every single other item in ERA uses that exact same link text for the PDF (or other format) files. Link text is supposed to indicate what is being linked to - it's a kind of metadata for goodness sake.
And then, right at the bottom of the page, there's a button marked "Show full item record". I have no idea what that is but I'll click on it anyway. Oh, it's what other services call "Additional information". But why use an HTML form button to hide a plain old hypertext link? Strange or what?
OK, I apologise... I've lapsed into sarcasm for effect. But the fact remains that repository jump-off pages are potentially some of the most important Web pages exposed by universities (this is core research business after all) yet they are nearly always some of the worst examples of HTML to be found on the academic Web. I can draw no other conclusion than that the Web is seen as tangential in this space.
I've taken 10 minutes to look at these pages... I don't doubt that there are issues that I've missed. Clearly, if one took time to look around at different repositories one would find examples that were both better and worse (I'm honestly not picking on ERA here... it just happened to come to hand). But in general, this stuff is atrocious - we can and should do better.
Yes. I've been screaming at DSpace for these and other usability/SEO solecisms for years, and fixing things as best I can in the DSpace repos I've run. (If you go to my local repo, she said proudly, page titles work, authors and subjects are hyperlinked, and Zotero-compatible COinS are embedded. Some of that is Manakin, which fixed the idiotic page title thing, but much of it is me swearing at XSLT. I haven't done meta tags, I admit.)
Thank you for your sarcasm. It may work where years of more-or-less patient persuasion have failed.
The multiple-link thing is a poser, but it did come to my mind too as I considered the SEO question. I don't know what is best to do about it, honestly, and would welcome suggestions.
Posted by: Dorothea Salo | February 10, 2009 at 03:21 PM
from a usability point of view - as you describe well - it is entirely clear that we have to keep some URLs persistent, even though we may use the handle system. but once you have to commit to keeping URLs persistent what is there to gain from handles? that's something i was never able to understand.
Posted by: robert forkel | February 10, 2009 at 03:29 PM
Well, the persistent URL that one commits to oughtn't necessarily be the one slapped on the repo. Systems change, services merge and rebrand themselves, and NONE of that will damage a handle.
Posted by: Dorothea Salo | February 10, 2009 at 03:42 PM
@robert As @Dorothea suggests, the argument put forward is that the extra level of indirection helps to buy more persistence. I don't buy that argument myself - but that is what people tend to argue.
Let's unpick this a little...
First, the repository jump-off page URL could (I think) be helpfully shortened a little to http://era.ed.ac.uk/1842/1476 - the 'lib' bit of the current URL is particularly unhelpful since it is an obvious point of failure (since responsibility for ERA might shift to another part of the University). For similar reasons one might also argue that 'research' would be better than 'era'.
On that basis, which of http://research.ed.ac.uk/1842/1476 and http://hdl.handle.net/1842/1476 is more likely to be persistent? Hard to judge to be honest but I don't see any a priori reason to think that the ed.ac.uk specific URL is going to break significantly before the handle.net one.
I don't understand the handle.net business model is, so I don't understand its likely sustainability. I do understand the Edinburgh University business model and while it's not totally risk free, persistence doesn't feel like an overly risky bet in that particular case?
In short, most of the persistence problems with URLs are caused by bad planning (as in this case) not by the technology per se.
Posted by: AndyP | February 10, 2009 at 03:54 PM
Sorry... quick follow-up. As I point out in the main body of the enrty, whatever so-called 'persistent' URI scheme is adopted by the repository, chances are that real-world users will end up using the native (and in this case poorly designed) repository URLs in any case (because that is what browsers and other Web tolls encourage them to do) so the use of handles, or anything else, is largely irrelevant - we have to make the URLs work persistently.
Posted by: AndyP | February 10, 2009 at 03:59 PM
As ERA manager I feel duty bound to comment on the criticism raised in this post!
I agree that repository software is out of step with the way people now expect to use the web. It’s unfortunate that you chose to pick on ERA as we are stuck using a version of DSpace that is at least 3 years behind the latest release. We’re way overdue an upgrade so your comments form a good starting point for a requirements analysis for updating the interface :)
However, you are way off the mark with some of your conclusions. I certainly don’t see the Web as tangential for repositories, nor do the users who download ~60,000 files per month. Just because something is currently done badly doesn’t mean it’s not worth doing and improving upon.
Posted by: Theo Andrew | February 10, 2009 at 04:11 PM
AndyPowell sez "So we already have 4 URLs for this article and no explicit machine-readable knowledge that they all identify the same resource"
Those URLs don't all identify the same resource.
URL#1 -> published version of an article stored on publisher's website
URL#2 -> the author's postprint, stored in the repository
URL#3 -> the repository record about the resource identified by URL2
URL#4 -> an indirect URL that resolves to exactly the same resource as URL#3, by redirecting the browser to URL#3.
There are two manifestations of the same work (each with identifiably different URLs) and there is a location-based URL for its "library record" together with an explicit alias for that URL.
Posted by: Les carr | February 10, 2009 at 04:17 PM
A note on handles: it's possible to run one's own handle server, rather than be dependent on handle.net's business model. We do at MPOW, and that means (among other things) our handles don't have gunk in them. (Our non-handle URLs are admittedly not quite the cleanest.)
The handle spec is actually quite interesting and cool things can be done with it behind the scenes... but nobody does, so that's that.
As a cautionary tale, the repo I run is going to be debranded and consolidated with the digital library atop a new platform in the medium term. Its non-handle URLs are going the way of the dodo. One of my next headaches is figuring out how to communicate that and to whom.
Posted by: Dorothea Salo | February 10, 2009 at 04:18 PM
Okay, Les, so how do we do FRBR over repositories?
Posted by: Dorothea Salo | February 10, 2009 at 04:19 PM
@Theo Once again, other than the obvious fact that I was criticising ERA, I didn't intend to pick on you in particular. My experience (possibly limited) is that this is what repositories are like.
Re: "Just because something is currently done badly doesn’t mean it’s not worth doing and improving upon." I completely agree and if I implied otherwise I certainly didn't intend to. Ultimately my post was intended as constructive criticism, albeit delivered in a tongue in cheek way.
Re: the Web being tangential for repositories. Well, the fact that we, as a repository community, spend a lot of energy focusing on the OAI-PMH (e.g. to paraphrase the current discussions on the jisc-repository lists, "the point of deposit is unimportant because we can harvest between repositories using the OAI-PMH") and almost no energy thinking about Web usability, SEOs tells a story IMHO.
Posted by: AndyP | February 10, 2009 at 04:20 PM
Andy, I could not agree more with your last statement. Irritates the living daylights out of me, the idea that machines talking eliminates the need to pay attention to human-computer interaction.
I tend to think that SWORD is going to solve part of this problem on the ingest side, because we'll finally be able to get away from repo software's horrendous ingest UI... but we're a LONG way from solving it on the consumption side still.
Theo, I quickly rechecked my comment to make sure I was beating on DSpace instead of you. Whew. I know full well it's not your fault!
Posted by: Dorothea Salo | February 10, 2009 at 04:26 PM
@les OK, fair cop - I agree with you (and as one of the authors of SWAP I should have known better). If I'd written, "So, from the point of view of the typical end-user, we already have 4 URLs for this article and no widespread understanding or explicit machine-readable knowledge about how they are related" would you have agreed with me?
In usability terms, that is still a problem. Agreed?
Posted by: AndyP | February 10, 2009 at 04:28 PM
@andy Oh alright then. We agree. I was just showing off.
@ds FRBR isn't good enough. A "work" is just an arbitrary stratum in the chop-suey of academic research :-) We need a model that is based on scholarly workflows, not publishing industry workflows!
Posted by: Les carr | February 10, 2009 at 04:52 PM
Okay, Les, I can buy that. Consider the question suitably modified. :)
I know somebody did a taxonomy of the article lifecycle (in fact, it's probably lurking in my Zotero, but I'm lazy today). Taking that (including its correctness) as a given, what can we do to solve the end-user problem Andy correctly points out?
Posted by: Dorothea Salo | February 10, 2009 at 05:08 PM
ironically, i think what may make handles more persistent (at least seemingly) is the fact that they are less usable. my suspicion is that - at least in some time - we may see handles break just like URLs, because of bad planning/maintenance. we just won't notice so much, because they don't see that much use.
at least in the cases where i've seen handles used, there is either not much planning done about how to actually maintain them (for the case resolving has to change), or there's so much infrastructure put into place, that apache and mod_rewrite look like toys.
Posted by: robert forkel | February 10, 2009 at 05:45 PM
Gulp! Sorry Theo, I didn't mean to unleash this on you, honest!
Much of Andy's criticism seems to me quite reasonable, BTW, but driven by DSpace design or implementation issues rather than choices that Edinburgh made. I can't remember. but I suspect we had a conversation about the problem that the Ariadne article was in HTML, which either DSpace or ERA weren't comfortable with. I suspect I cobbled up a PDF for them from a near-final Word file, rather than the PDF being a snapshot of the HTML.
On the persistent IDs issue, even in the case Dorothea mentions of a complete restructuring, i think URIs can be as persistent as Handles (it's a question of where the redirection happens). A local Handle server sounds like one (possibly expensive and partial) way to make your URIs persistent.
But to remind folks, my comment was originally intended to illustrate that the multi-version issue seemed to be handled reasonably well in the real world of Google Scholar. I think one would have to find a much better query than I suggested to really teat Andy's Googlejuice reduction hypothesis, though!
Posted by: Chris Rusbridge | February 10, 2009 at 06:08 PM
Just to add, Lorcan picked up this thread for the JISC RPAG (Repositories & Preservation Advisory group, I think), and I responded partly with the text below:
"I hope [Andy] repeats the rant on a victim with an ePrints repository, so that we can see whether it has been a better web resident. Picking on Les Carr, who both runs and deposits in his repository, I found http://eprints.ecs.soton.ac.uk/14352/, which appears better than the DSpace example on a couple of quick tests (eg the HTML page title, etc). I note that it's Google Scholar search button produces no result, and my Google Scholar search ("schraefel carr aktivesa") finds the Computer Journal version but not the repository version!"
So Andy, could you do your analysis with one of Les's articles?
BTW I went on to suggest that JISC could/should support some improvements to the DSpace platform (and maybe ePrints, if needed)...
Posted by: Chris Rusbridge | February 11, 2009 at 10:14 AM
" I would naturally tend to assume that the HTML version is more machine-readable than the PDF version and on that basis is 'better', though I admit that I can't provide solid evidence to back up that statement."
The SPECTRa-T project encountered problems with the machine readability of PDFs - documented to some extent in (ironically a PDF!) http://pubs.or08.ecs.soton.ac.uk/16/3/ATonge_OR08.pdf page/slide 10 onwards
Also today Peter Sefton did a talk at dev8D on why many papers end up in PDF, and what he (and others) are doing about it - see http://dev8d.jiscinvolve.org/2009/02/11/lightning-talk-beyond-the-pdf/
Posted by: Owen Stephens | February 11, 2009 at 01:16 PM
I hate to see Andy in such pain. So, I took a stab at a possible solution that builds on existing technologies. Since the description is a bit too long to embed here, I put it up at http://public.lanl.gov/herbertv/papers/Repository_Usability.htm
Posted by: Herbert Van de Sompel | February 11, 2009 at 08:58 PM
@chrisr In fact, the query "schraefel carr aktivesa" does not pick up the paper in Google either, but the query "aktivesa" by itself does! So it is definitely being indexed by Google. An inspection of the logs shows that the PDF was last crawled on Feb 1st and the abstract was last crawled on Feb 7th. It gets freakier. Google Query "aktivesa" works. Google Query "aktivesa carr" works. Google Query "aktivesa carr schraefel" doesn't work. GS Queries on any of the above always pick up the Computer Journal paper but ignore the repository version. HOWEVER when the Google Query does work, the result gets annotated with all the usual trappings of a successful Google Scholar search including a link to the GS cluster for this paper. But in this case, following the GS "show all versions" link takes you to a whole bunch of random rubbish. GS doesn't always get things right! In fact it commonly gets the authors wrong, even though they are marked up in DC metadata on the page!
Posted by: Les Carr | February 12, 2009 at 11:46 AM
@les Well, I just repeated the Google search on schraefel carr aktivesa, cut and pasted from your comment. On the first page of results, the CJ article comes in as number 4. Number 1 was a different article (Info Fusion rather than Tech Demo), number 2 appeared to be the Tech Demo (ie CJ) article on a wiki at http://tw.rpi.edu/wiki/AKTiveSA:_A_Technical_Demonstrator_System_For_Enhanced_Situation_Awareness, and number 3 was this discussion. Yes, the Google results did have GS-style "cited by 1" and related article links. Still no sign of the ECS ePrints version.
So you and I are getting different results for the same search!
Posted by: Chris Rusbridge | February 13, 2009 at 10:50 AM
It's been fascinating to watch this discussion from the 'eye of the storm'...
I'm particularly interested in seeing the possible application of OAI ORE (thanks Herbert). Subject to resourcing, this is certainly on our radar for future development.
And I'm totally on board in terms of the usability issues. This isn't the only part of the Edinburgh University Library online presence which needs attention. At present, we're looking at our homepage, and about to start some usability testing, but I'm keen to widen the process.
On a more positive note, I'll take this opportunity to make the first, cautious, public statement about the adoption of an open access publications policy at Edinburgh!
This has yet to be minuted by Senate (hence the caution) but a paper has been approved. Excellent news, and a driver to resource further development work on ERA! The policy will come into place from the start of 2010.
Posted by: Simon Bains | February 13, 2009 at 03:17 PM
Huh. Everybody see this? http://searchengineland.com/canonical-tag-16537
Posted by: Dorothea Salo | February 14, 2009 at 03:37 AM