November 04, 2011

JISC Digital Infrastructure programme call

JISC currently has a call, 16/11: JISC Digital infrastructure programme, open for project proposals in a number of "strands"/areas, including the following:

  • Resource Discovery: "This programme area supports the implementation of the resource discovery taskforce vision by funding higher education libraries archives and museums to make open metadata about their collections available in a sustainable way. The aim of this programme area is to develop more flexible, efficient and effective ways to support resource discovery and to make essential resources more visible and usable for research and learning."

This strand advances the work of the UK Discovery initiative, and is similar to the "Infrastructure for Resource Discovery" strand of the JISC 15/10 call under which the SALDA project (in which I worked with the University of Sussex Library on the Mass Observation Archive data) was funded. There is funding for up to ten projects of between £25,000 and £75,000 per project in this strand

First, I should say this is a great opportunity to explore this area of work and I think we're fortunate that JISC is able to fund this sort of activity. A few particular things I noticed about the current call:

  • a priority for "tools and techniques that can be used by other institutions"
  • a focus on unique resources/collections not duplicated elsewhere
  • should build on lessons of earlier projects, but must avoid duplication/reinvention
  • a particular mention of "exploring the value of integrating structured data into webpages using microformats, microdata, RDFa and similar technologies" as an area in scope
  • an emphasis on sharing the experience/lessons learned: "The lessons learned by projects funded under this call are expected to be as important as the open metadata produced. All projects should build sharing of lessons into their plans. All project reporting will be managed by a project blog. Bidders should commit to sharing the lessons they learn via a blog"

Re that last point, as I've said before, one of the things I most enjoyed about the SALDA and LOCAH projects was the sense that we were interested in sharing the ideas as well as getting the data out there.

I'm conscious the clock is ticking towards the submission deadline, and I should have posted this earlier, but if anyone reading is considering a proposal and thinks that I could make a useful contribution, I'd be interested to hear from you. My particular areas of experience/interest are around Linked Data, and are probably best reflected by the posts I made on the LOCAH and SALDA blogs, i.e. data modelling, URI pattern design, identification/selection of useful RDF vocabularies, identification of potential relationships with things described in other datasets, construction of queries using SPARQL, etc. I do have some familiarity with RDFa, rather less with microdata and microformats. I'm not a software developer, but I can do a little bit of XSLT (and maybe enough PHP to be dangerous hack together rather flakey basic demonstrators). And I'm not a technical architect, but I did get some experience of working with triple stores in those recent projects.

My recent work has been mainly with archival metadata, and I'd be particularly interested in further work which complements that. I'm conscious of the constraint in the call of not repeating earlier work, so I don't think "reapplying" the sort of EAD to RDF work I did with LOCAH and SALDA would fit the bill. (I'd love to do something around the event/narrative/storytelling angle that I wrote about recently here, for example.) Having said that, I certainly don't mean to limit myself to archival data. Anyway, if you think I might be able to help, please do get in touch (

August 15, 2011

Two ends and one start

The end of July formally marked the end of two projects I've been contributing to recently:

There are still some things to finish for LOCAH, particularly the publication of the Copac data.

A few (not terribly original or profound) thoughts, prompted mainly by my experience of working with the Archives Hub data:

  • Sometimes data is "clean" and consistent and "normalised", but more often than not it has content in inconsistent forms or is incomplete: it is "messy". Data aggregated from multiple sources over a long period of time is probably going to be messier. (With an XML format like EAD, with what I think of as its somewhat "hybrid" document/data character, the potential for messiness increases).
  • Doing things with data requires some sort of processing by software, and while there are off-the-shelf apps and various configurable tools and frameworks that can provide some functions, some element of software development is usually required.
  • It may be relatively easy to identify in advance the major tasks where developer effort is required, and to plan for that, but sometimes there are additional tasks which it's more difficult to anticipate; rather they emerge as you attempt some of the other processes (and I think messy data probably makes that more likely).
  • Even with developers on board, development work has to be specified, and that is a task in itself, and one that can be quite complex and time-consuming (all the more so if you find yourself trying to think through and describe what to do with a range of different inputs from a messy data source).

It's worth emphassing that most of the above is not specific to generating Linked Data: it applies to data in all sorts of formats, and it applies to all sorts of tasks, whether you're building a plain old Web site or exposing RSS feeds or creating some other Web application to do something with the data.

Sort of related to all of the above, I was reminded that my limited programming skills often leave me in the position where I can identify what needs doing but I'm not able to "make it happen", and that is something I'd like to try to change. I can "get by" in XSLT, and I can do a bit of PHP, but I'd like to get to the point where I can do more of the simpler data manipulation tasks and/or pull together simple presentation prototypes.

I've enjoyed working on both the projects. I'm pleased that we've managed to deliver some Linked Data datasets, and I'm particularly pleased to have been able to contribute to doing this for archival description data, as it's a field in which I worked quite a long time ago.

Both projects gave me opportunities to learn by actually doing stuff, rather than by reading about it or prodding at other people's data. And perhaps more than anything, I've enjoyed the experience of working together as a group. We had a lot of discussion and exchange of ideas, and I'd like to think that this process was valuable in itself. In an early post on the SALDA blog, Karen Watson noted:

it is perhaps not the end result that is most important but the journey to get there. What we hope to achieve with SALDA is skills and knowledge to make our catalogues Linked Data and use those skills and that knowledge to inform decisions about whether it would be beneficial for make all our data Linked Data.

Of course it wasn't all plain sailing, and particularly near the start there were probably times when we ran up against differences of perception and terminology. Jane Stevenson has written about some of these issues from her point of view on the LOCAH blog (e.g. here, here and here). As the projects progressed, I think we moved closer towards more of a shared understanding - and I think that is a valuable "output" in its own right, even if it is one which it may be rather hard to "measure".

So, a big thank you to everyone I worked with on both projects.

Looking forward, I'm very pleased to be able to say that Jane prepared a proposal for some further work with the Archives Hub data, under JISC's Capital Funded Service Enhancements initiative, and that bid has been successful, and I'll be contributing some work to that project as a consultant. The project is called "Linking Lives" and is focused on providing interfaces for researchers to explore the data (as well as making any enhancements/extensions to the data and the "back-end" processes that are required to enable that). More on that work to come once we get moving.

Finally, as I'm sure many of you are aware, JISC recently issued some new calls for projects, including call 13/11 for projects "to develop services that aim to make content from museums, libraries and archives more visible and to develop innovative ways to reuse those resources". If there are any institutions out there who are considering proposals and think that I could make a useful contribution - despite my lamenting my limitations above, I hope my posts on the LOCAH and SALDA blogs give an idea of the sorts of areas I can contribute to! - , please do get in touch.

May 11, 2011

LOCAH releases Linked Archives Hub dataset

The LOCAH project, one of the two JISC-funded projects to which I've been contributing, this week announced the availability of an initial batch of data derived from a small subset of the Archives Hub EAD data as linked data. The homepage for what we have called the "Linked Archives Hub" dataset is

The project manager, Adrian Stevenson of UKOLN, provides an overview of the dataset, and yesterday I published a post which provides a few example SPARQL queries.

I'm very pleased we've got this data "out there": it feels as if we've been at the point of it being "nearly ready" for a few weeks now, but a combination of ironing out various technical wrinkles (I really must remember to look at pages in Internet Explorer now and again) and short working weeks/public holidays held things up a little. It is very much a "first draft": we have additional work planned on making more links with other resources, and there are various things which could be improved (and it seems to be one of those universal laws that as soon as you open something up, you spot more glitches...). But I hope it's enough to demonstrate the approach we are taking to the data, and to provide a small testbed that people can poke at and experiment with.

(If you have any specific comments on content of the LOCAH dataset, it's probably better to post them over on the LOCAH blog where other members of the project team can see and respond to them).

April 08, 2011

Scholarly communication, open access and disruption

I attended part of UKSG earlier this week, listening to three great presentations in the New approaches to research session on Monday afternoon (by Philip Bourne, Cameron Neylon and Bill Russell) and presenting first thing Tuesday morning in the Rethinking 'content' session.

(A problem with my hearing meant that I was very deaf for most of the time, making conversation in the noisy environment rather tiring, so I decided to leave the conference early Tuesday afternoon. Unfortunately, that meant that I didn't get much of an opportunity to network with people. If I missed you, sorry. Looking at the Twitter stream, it also meant that I missed what appear to have been some great presentations on the final day. Shame.)

Anyway, for what it's worth, my slides are below. I was speaking on the theme of 'open, social and linked', something that I've done before, so for regular readers of this blog there probably won't be too much in the way of news.

With respect to the discussion of 'social' and it's impact on scholarly communication, there is room for some confusion because 'social' is often taken to mean, "how does one use social media like Facebook, Twitter, etc. to support scholarly communication?". Whilst I accept that as a perfectly sensible question, it isn't quite what I meant in this talk. What I meant was that we need to better understand the drivers for social activity around research and research artefacts, which probably needs breaking down into the various activities that make up the scholarly research workflow/cycle, in order that we can build tools that properly support that social activity. That is something that I don't think we have yet got right, particularly in our provision of repositories. Indeed, as I argued in the talk, our institutional repository architecture is more or less in complete opposition to the social drivers at play in the research space. Anyway... you've heard all this from me before.

Cameron Neylon's talk was probably the best of the ones that I saw and I hope my talk picked up on some of the themes that he was developing. I'm not sure if Cameron's UKSG slides are available yet but there's a very similar set, The gatekeeper is dead, long live the gatekeeper, presented at the STM Innovation Seminar last December. Despite the number of slides, these are very quick to read thru, and very understandable, even in the absence of any audio. On that basis, I won't re-cap them here. Slides 112 onwards give a nice summary: "we are the gatekeepers... enable, don't block... build platforms, not destinations... sell services, not content... don't think about filtering or control... enable discovery". These are strong messages for both the publishing community and libraries. All in all, his points about 'discovery defecit' rather than 'filter failure' felt very compelling to me.

On the final day there were talks about open access and changing subscription models, particularly from 'reader pays' to 'author pays', based partly on the recently released study commissioned by the Research Information Network (RIN), JISC, Research Libraries UK (RLUK), the Publishing Research Consortium (PRC) and the Wellcome Trust, Heading for the open road: costs and benefits of transitions in scholarly communications. We know that the web is disruptive to both publishers and libraries but it seemed to me (from afar) that the discussions at UKSG missed the fact that the web is potentially also disruptive to the process of scholarly communication itself. If all we do is talk about shifting the payment models within the confines of current peer-review process we are missing a trick (at least potentially).

What strikes me as odd, thinking back to that original hand-drawn diagram of the web done by Tim Berners-Lee, is that, while the web has disrupted almost every aspect of our lives to some extent, it has done relatively little to disrupt scholarly communication except in an 'at the margins' kind of way. Why is that the case? My contention is that there is such a significant academic inertia to overcome, coupled with a relatively small and closed 'market', that the momentum of change hasn't yet grown sufficiently - but it will. The web was invented as a scholarly device, yet it has, in many ways, resulted in less transformation there than in most other fields. Strange?

Addendum: slides for Philip Bourne's talk are now available on Slideshare.

March 28, 2011

Waiter, my resource discovery glass is half-empty

Old joke...

A snail goes into a pub and says to the barman, "I've just been mugged by two tortoises". The barman looks a bit shocked and says, "Oh no, that's terrible. What happened?".

The snail responds, "I dunno, it all happened so fast".


I had a bit of a glass half-empty moment last week, listening to the two presentations in the afternoon session of the ESRC Resource Discovery workshop, the first by Joy Palmer about the MIMAS-led Resource Discovery Task Framework Management Framework and the second by Lucy Bell about the UKDA resource discovery project. Not that there was anything wrong with either presentation. But it struck me that they both used phrases that felt very familiar in the context of resource discovery in the cultrual heritage and education space over the last 10 years or so (probably longer) - "content locked in sectoral silos", "the need to work across multiple websites, each with their own search idiosyncracies", "the need to learn and understand multiple vocabularies", and so on.

In a moment of panic I said words to the effect of, "We're all doomed. Nothing has changed in the last 10 years. We're going round in circles here". Clearly rubbish... and, looking at the two presentations now, it's not clear why I reached that particular conclusion anyway. I asked the room why this time round would be different, compared with previous work on initiatives like the UK JISC Information Environment, and got various responses about community readiness, political will, better stakeholder engagement and what not. I mean, for sure, lots of things have changed in the last 10 years - I'm not even sure the alphabet contained the three letters A, P and I back then and the whole environment is clearly very different - but it is also true that some aspects of the fundamental problem remain largely unchanged. Yes, there are a lot more cultural heritage, scientific and educational resources out there (being made available from within those sectors) but it's not always clear the extent to which that stuff is better joined up, open and discoverable than it was at the turn of the century?

There is a glass half-full view of the resource discovery world, and I try to hold onto it most of the time, but it certainly helps to drink from the Google water fountain! Hence the need for initiatives like the UK Resource Discovery Task Force to emphasise the 'build better websites' approach. We're talking about cultural change here, and cultural change takes time. Or rather, the perceived rate of cultural change tends to be relative to the beholder.

March 25, 2011

RDTF metadata guidelines - next steps

A few weeks ago I blogged about the work that Pete and I have been doing on metadata guidelines as part of the JISC/RLUK Resource Discovery Task Force, RDTF metadata guidelines - Limp Data or Linked Data?.

In discussion with the JISC we have agreed to complete our current work in this area by:

  • delivering a single summary document of the consultation process around the current draft guidelines, incorporating the original document and all the comments made using the JISCpress site during the comment period; and
  • suggesting some recommendations about any resulting changes that we would like to see made to the draft guidelines.

For the former, a summary view of the consultation is now available. It's not 100% perfect (because the links between the comments and individual paragraphs are not retained in the summary) but I think it is good enough to offer a useful overview of the draft and the comments in a single piece of text. Furthermore, the production of this summary was automated (by post-processing the export 'dump' from Wordpress), so the good news is that a similar view can be obtained for any future (or indeed past) JISCpress consultations.

For the latter, this blog post forms our recommendations.

As noted previously, there were 196 comments during the comment period (which is not bad!), many of which were quite detailed in terms of particular data models, formats and so on. On the basis that we do not know quite what form any guidelines might take from here on (that is now the responsibility of the RDTF team at MIMAS I think), it doesn't seem sensible to dig into the details too much. Rather, we will make some comments on the overall shape of the document and suggest some areas where we think it might be useful for JISC and RLUK to undertake additional work.

You may recall that our original draft proposed three approaches to exposing metadata, which we refered to as the community formats approach, the RDF data approach and the Linked Data approach. In light of comments (particularly those from Owen Stephens and Paul Walk) we have been putting some thought into how the shape of the whole document might be better conceptualised. The result is the following four-quadrant model:

Like any simple conceptualisation, there is some fuzziness in this but we think it's a useful way of thinking about the space.

Traditionally (in the world of library, museum and archives at least), most sharing of metadata has happened in the bottom-left quadrant - exchanging bulk files of MARC records for example. And, to an extent, this approach continues now, even outside those sectors. Look at the large amount of 'open data' that is shared as CSV files on sites like for example. Broadly speaking, this is what we refered to as the community formats approach (though I think our inclusion of the OAI-PMH in that area probably muddied the waters a little - see below).

One can argue that moving left to right across the quadrants offers semantically richer metadata in a 'small pieces, loosely joined' kind of way (though this quickly becomes a religious argument with no obvious point of closure! :-) ) and that moving bottom to top offers the ability to work with individual item descriptions rather than whole collections of them - and that, in particular, it allows for the assignment of 'http' URIs to those descriptions and the dereferencing of those URIs to serve them.

Our three approaches covered the bottom-left, bottom-right and top-right quadrants. The web, at least in the sense of serving HTML pages about things of interest in libraries, museums and archives, sits in the top-left quadrant (though any trend towards embedded RDFa in HTML pages moves us firmly towards the top-right).

Interestingly, especially in light of the RDTF mantra to "build better websites", our guidelines managed to miss that quadrant. In their comments, Owen and Paul argued that moving from bottom to top is more important than moving left to right - and, on balance, we tend to agree.

So, what does this mean in terms of our recommendations?

We think that the guidelines need to cover all four quadrants and that, in particular, much greater emphasis needs to be placed on the top-left quadrant. Any guidance needs to be explicit that the 'http' URIs assigned to descriptions served on the web are not URIs for the things being described; that, typically, multiple descriptions may be served for the things being described (an HTML page and an XML document for example, each of which will have separate URIs) and that mechanisms such as '<link rel="alternative" ...>' can be used to tie them together; and that Google sitemaps (on the left) and semantic sitemaps (on the right) can be used to guide robots to the descriptions (either individually or in collections).

Which leaves the issue of the OAI-PMH. In a sense, this sits along-side the top-left quadrant - which is why, I think, it didn't fit particularly well with our previous three approaches. If you think about a typical repository for example, it is making descriptions of the content it holds available as HTML 'splash' pages (sometimes with embedded links to descriptions in other formats). In that sense it is functioning in top-left, "page per thing", mode. What the OAI-PMH does is to give you a protocol mechanism for getting at that those descriptions in a way that is useful for harvesting.

Several people noted that Atom and RSS might be used as an alternative to both sitemaps and the OAI-PMH, and we agree - though it may be that some additional work is needed to specify the exact mechanisms for doing so.

There were some comments on our suggestion to follow the UK government guidelines on assigning URIs. On reflection, we think it would make more sense to recommend only the W3C guidelines on Cool URIs for the Semantic Web, particularly on the separation of things from the desriptions of things, suggesting that it may be sensible to fund (or find) more work in this area making specific recommendations around persistent URIs (for both things and their descriptions).

Finally, there were a lot of comments on the draft guidelines about our suggested models and formats - notably on FRBR, with many commenters suggesting that this was premature given significant discussion around FRBR elsewhere. We think it would make sense to separate out any guidance on conceptual models and associated vocabularies, probably (again) as a separate piece of work.

To summarise then, we suggest:

  • that the four-quadrant model above is used to frame the guidelines - we think all four quadrants are useful, and that there should probably be some guidance on each area;
  • that specific guidance be developed for serving an HTML page description per 'thing' of interest (possibly with associated, and linked, alternative formats such as XML);
  • that guidance be developed (or found) about how to sensibly assign persistent 'http' URIs to everything of interest (including both things and descriptions of things);
  • that the definition of 'open' needs more work (particularly in the context of whether commercial use is allowed) but that this needs to be sensitive to not stirring up IPR-worries in those domains where they are less of a concern currently;
  • that mechanisms for making statement of provenance, licensing and versioning be developed where RDF triples are being made available (possibly in collaboration with Europeana work); and
  • that a fuller list of relevant models that might be adopted, the relationships between them, and any vocabularies commonly associated with them be maintained separately from the guidelines themselves (I'm trying desperately not to use the 'registry' word here!).

February 25, 2011

RDTF metadata guidelines - Limp Data or Linked Data?

Having just finished reading thru the 196 comments we received on the draft metadata guidelines for the UK RDTF I'm now in the process of wondering where we go next. We (Pete and I) have relatively little effort to take this work forward (a little less than 5 days to be precise) so it's not clear to me how best we use that effort to get something useful out for both RDTF and the wider community.

By the way... many thanks to everyone who took the time to comment. There are some great contributions and, if nothing else, the combination of the draft and the comments form a useful snapshot of the current state of issues around library, museum and archival metadata in the context of the web.

Here's my brief take on what the comments are saying...

Firstly, there were several comments asking about the target audience for the guidelines and whether, as written, they will be meaningful to... well... anyone I guess! It's worth pointing out that my understanding is that any guidelines we come up with thru the current process will be taken forward as part of other RDTF work. What that means is that the guidelines will get massaged into a form (or forms) that are digestable by the target audience (or audiences), as determined by other players with the RDTF activity. What we have been tasked with are the guidelines themselves - not how they are presented. We perhaps should have made this clearer in the draft guidelines. In short, I don't think the document, as written, will be put directly in-front of anyone who doesn't go to the trouble of searching it out explicitly.

Secondly, there were quite a number of detailed comments on particular data formats, data models, vocabularies and so on. This is great and I'm hopeful that as a result we can either extend the list of examples given at various points in the guidelines (or, in some cases, drop back to not having examples but simply say, "do whatever is the emerging norm here in your community").

Thirdly, the were some concerns about what we meant by "open". As we tried to point out in the draft, we do not consider this to be our problem - it is for other activity in RDTF to try and work out what "open" means - we just felt the need to give that word a concrete definition, so that people could understand where we were coming from for the purposes of these guidelines.

Finally, there were some bigger concerns - these are the things that are taxing me right now - that broadly fell into two, related, camps. Firstly, that the step between the community formats approach and the RDF data approach is too large (though no-one really suggested what might go in the middle). And secondly, that we are missing a trick by not encouraging the assignment of 'http' URIs to resources as part of the community formats approach.

As it stands, we have, on the one hand, what one might call Limp Data (MARC records, XML files, CVS, EAD and the rest) and, on the other, Linked Data and all that entails, with a rather odd middle ground that we are calling RDF data (in the current guidelines).

I was half hoping that someone would simply suggest collapsing our RDF data and Linked Data approaches into one - on the basis that separating them into two is somewhat confusing (but as far as I can tell no-one did... OK, I'm doing it now!). That would leave a two-pronged approach - community formats and Linked Data - to which we could add a 'community formats with http URIs' middle ground. My gut feel is that there is some attraction in such an approach but I'm not sure how feasible it is given the characteristics of many existing community formats.

As part of his commentary around encouraging http URIs (building a 'better web' was how he phrased it), Owen Stephens suggested that every resource should have a corresponding web page. I don't disagree with this... well, hang on... actually I do (at least in part)! One of the problems faced by this work is the fundamental difference between the library world and museums and archives. The former is primarily dealing with non-unique resources (at the item level), the latter with unique resources. (I know that I'm simplifying things here but bear with me). Do I think that resource discovery will be improved if every academic library in the UK (or indeed in the world) creates an http URI for every book they hold at which they serve a human-readable copy of their catalogue record? No, I don't. If the JISC and RLUK really want to improve web-scale resource discovery of books in the library sector, they would be better off spending their money encouraging libraries to sign up to OCLC WorldCat and contributing their records there. (I'm guessing that this isn't a particular popular viewpoint in the UK - at least, I'm not sure that I've ever heard anyone else suggest it - but it seems to me that WorldCat represents a valuable shared service approach that will, in practice, be hard to beat in other ways.) Doing this would both improve resource discovery (e.g. thru Google) and provide a workable 'appropriate copy' solution (for books). Clearly, doing this wouldn't help build a more unified approach across the GLAM domains but, as at least one commenter pointed out, it's not clear that the current guidelines do either. Note: I do agree with Owen that every unique museum and archival resource should have an http URI and a web page.

All of which, as I say, leaves us with a headache in terms of how we take these guidelines forward. Ah well... such is life I guess.

February 03, 2011

Metadata guidelines for the UK RDTF - please comment

As promised last week, our draft metadata guidelines for the UK Resource Discovery Taskforce are now available for comment in JISCPress. The guidelines are intended to apply to UK libraries, museums and archives in the context of the JISC and RLUK Resource Discovery Taskforce activity.

The comment period will last two weeks from tomorrow and we have seeded JISCPress with a small number of questions (see below) about issues that we think are particularly worth addressing. Of course, we welcome comments on all aspects of the guidelines, not just where we have raised issues. (Note that you don't have to leave public comments in JISCPress if you don't want to - an email to me or Pete will suffice. Or you can leave a comment here.)

The guidelines recommend three approaches to exposing metadata (to be used individually or in combination), referred to as:

  1. the community formats approach;
  2. the RDF data approach;
  3. the Linked Data approach.

We've used words like 'must' and 'should' but it is worth noting that at this stage we are not in a position to say how these guidelines will be applied - if at all. Nor whether there will be any mechanisms for compliance put in place. On that basis, treat phrases like 'must do this' as meaning, 'you must do this for your metadata to comply with one or other approach as recommended by these guidelines' - no more, no less. I hope that's clear.

When we started this work, we we began by trying to think about functional requirements - always a good place to start. In this case however, that turned out not to make much sense. We are not starting from a green field here. Lots of metadata formats are already in use and we are not setting out with the intent of changing current cataloguing practice across libraries, museums and archives. What we can say is that:

  1. we have tried to keep as many people happy as possible (hence the three approaches), and
  2. we want to help libraries, museums and archives expose existing metadata (and new metadata created using existing practice) in ways that support the development of aggregator services and that integrate well with the web (of data).

As mentioned previously, the three approaches correspond roughly to the 3-star, 4-star and 5-star ratings in the W3C's Linked Data Star Ratings Scheme. To try and help characterise them, we prepared the following set of bullet points for a meeting of the RDTF Technical Advisory Group earlier this week:

The community data approach

  • the “give us what you’ve got” bit
  • share existing community formats (MARC, MODS, BibTeX, DC, SPECTRUM, EAD, XML, CSV, JSON, RSS, Atom, etc.) over RESTful HTTP or OAI-PMH
  • for RESTful HTTP, use sitemaps and robots.txt to advertise availability and GZip for compression
  • for CSV, give us a column called ‘label’ or ‘title’ so we’ve got something to display and a column called 'identifier' if you have them
  • provide separate records about separate resources
  • simples!

The RDF data approach

  • use RDF
  • model according to FRBR, CIDOC CRM or EAD and ORE where you can
  • re-use existing vocabularies where you can
  • assign URIs to everything of interest
  • make big buckets of RDF (e.g. as RDF/XML, N-Tuples, N-Quads or RDF/Atom) available for others to play with
  • use Semantic Sitemaps and the Vocabulary of Interlinked Datasets (VoID) to advertise availability of the buckets

The Linked Data approach

Dunno if that is a helpful summary but we look forward to your comments on the full draft. Do your worst!

For the record, the issues we are asking questions about mainly fall into the following areas:

  • is offering a choice of three approaches helpful?
  • for the community formats approach, are the example formats we list correct, are our recommendations around the use of CSV useful and are JSON and Atom significant enough that they should be treated more prominently?
  • does the suggestion to use FRBR and CIDOC CRM as the basis for modeling in RDF set the bar too high for libraries and museums?
  • where people are creating Linked Data, should we be recommending particular RDF datasets/vocabularies as the target of external links?
  • do we need to be more prescriptive about the ways that URIs are assigned and dereferenced?

Note that a printable version of the draft is also available from Google Docs.

January 26, 2011

Metadata guidelines for the UK Resource Discovery Taskforce

We (Pete and I) have been asked by the JISC and RLUK to develop some metadata guidelines for use by the UK Resource Discovery Taskforce as it rolls out its vision [PDF].

This turns out to be a non-trivial task. The vision covers libraries, museums and archives and is intended to:

focus on defining the requirements for the provision of a shared UK infrastructure for libraries, archives, museums and related resources to support education and research. The focus will be on catalogues/metadata that can assist in access to objects/resources. With a special reference to serials, books, archives/special collections, museum collections, digital repository content and other born digital content. This will interpret the shared UK infrastructure as part of global information provision.

(Taken from the Resource Discovery Taskforce Term of Reference)

The vision itself talks of a "collaborative, aggregated and integrated resource discovery and delivery framework" which implies an approach based on harvesting metadata (and other content) rather than cross-searching.

If the last 15 years or so have taught me anything, it's not to expect much coming together of metadata practice across those three sectors! Add to that a wide spectrum of attitudes to Linked Data and its potential value in this space, an unclear picture about the success of Europeana and its ESE [PDF] and EDM [PDF] metadata formats, the apparent success of somewhat "permissive" metadata-based initiatives such as Digital NZ and you are left with a a range of viewpoints from "Keep calm and carry on" thru to "Throw it all away and use Linked Data" and everything in between.

At this point in time, we are taking the view that Tim Berners-Lee's star rating system for linked open data provides a useful framework for this work. However, as I have indicated elsewhere, Mugging up on the linked open data star ratings, it is rather unhelpful that the definition of each of the stars seems to be somewhat up for grabs at the moment (more or less in line with the ongoing, and quite probably endless, debate about the centrality of RDF and SPARQL to Linked Data). On that basis, we will almost certainly have to provide our own definitions for the purposes of these guidelines. Note that using this star rating system does not mean that everything has to use RDF.

Anyway... all of that is currently our problem, so I won't burden you with it :-)

The real purpose of this post is simply to say that we hope to make a draft of our metadata guidelines available during next week (I'm not willing to commit to a specific day at this point in time!), at which point we hope that people will share their thoughts on what we've come up with. That said, time is reasonably tight so I don't expect to be able to give people more than a couple of weeks (at most) to comment.

November 03, 2010

Google support for GoodRelations

Google have announced support for the GoodRelations vocabulary for product and price information in Web pages, Product properties: GoodRelations and hProduct. This is primarily of interest to ecommerce sites but is more generally interesting because it is likely to lead to a significant rise in the amount of RDF flowing around the Web. It therefore potentially represents a significant step forward for the adoption of the Semantic Web and Linked Data.

Martin Hepp, the inventor of the GoodRelations vocabulary, has written about this development, Semantic SEO for Google with GoodRelations and RDFa, suggesting a slightly modified form of markup which is compatible with that adopted by Google but that is also "understood by ALL RDFa-aware search engines, shopping comparison sites, and mobile services".

September 06, 2010

On funding and sustainable services

I write this post with some trepidation, since I know that it will raise issues that are close to the hearts of many in the community but discussion on the jisc-repositories list following Steve Hitchcock's post a few days ago (which I posted in full here recently) has turned to the lessons that the withdrawl of JISC funding for the Intute service might teach us in terms of transitioning JISC- (or other centrally-) funded activities into self-sustaining services.

I'm reminded of a recent episode of the Dragon's Den on BBC TV where it emerged that the business idea being proposed for investment had survived thus far on European project funding. The dragons took a dim view of this, on the basis, I think, that such funding would only rarely result in a viable business because of a lack of exposure to 'real' market forces and the proposer was dispatched forthwith (the dragons clearly never having heard of Google! :-) ).

On the mailing list, views have been expressed that projects find it hard to turn into services because they attract the wrong kind of staff, or that the IPR situation is wrong, or that they don't get good external business advice. All valid points I'm sure. But I wonder if one could make the argument that it is the whole model of centralised project funding for activities that are intended to transition into viable, long-term, self-sustaining businesses that is part of the problem. (Note: I don't think this applies to projects that are funded purely in the pursuit of knowledge). By that I mean that such funding tends to skew the market in rather unhelpful ways, not just for the projects in question but for everyone else - ultimately in ways that make it hard for viable business models to emerge at all.

There are a number of reasons for this - reasons that really did not become apparent to me until I started working for an organisation that can only survive by spending all its time worrying about whether its business models are viable.

Firstly, centralised funding tends to mean that ideas are not subject to market forces early enough - not just not subjected, but market forces are not even considered by those proposing/evaluating the projects. Often we can barely get people to use the results of project funding when we give them away for free - imagine if we actually tried to charge people for them!? The primary question is not, 'can I get user X or institution Y to pay for this?' but 'can I get the JISC to pay for this?' which is a very different proposition.

Secondly, centralised funding tends to support people (often very clever people) who can then cherry-pick good ideas and develop them without any concern for sustainable business-models, and who subsequently may or may not be in a position to support them long term, but who thus prevent others, who might develop something more sustainable, from even getting started.

Thirdly, the centrally-funded model contributes to a wider 'free-at-the-point-of-use' mindset where people simply are not used to thinking in terms of 'how much is it really costing to do this?' and 'what would somebody actually be prepared to pay for this?' and where there is little incentive to undertake a cost/benefit analysis or prepare a proper business case. As I've mentioned here before, I've been on the receiving end of many proposals under the UCISA Award for Excellence programme that were explicitly asked to assess their costs and benefits but who chose to treat staff time at zero cost simply because those staff were in the employ of the institutions anyway.

Now... before you all shout at me, I don't think market forces are the be-all and end-all of this and I think there are plenty of situations where services, particularly infrastructural services, are better procured centrally than by going out to the market. This post is absolutely not a rant that everything funded by the JISC is necessarily pants - far from it.

That said, my personal view is that Intute did not fall into that class of infrastructural service and that it was rightly subjected to an analysis of whether its costs outweighed its benefits. I wasn't involved in that analysis, so I can't really comment on it - I'm sure there is a debate to be had about how the 'benefits' were assessed and measured. But my suspicion is that if one had asked every UK HE institution to pay a subscription to Intute not many would have been willing to do so - were that the case, I presume that Intute would be exploring that model right now? That, it seems to me, is the ultimate test of viability - or at least one of them. As I mentioned before, one of the lessons here is the speed with which we, as a community, can react to the environmental changes around us and how we deal with the fall-out - which is as much about how the viability of business models changes over time as it is about technology.

I certainly don't think there are any easy answers.

Comparing Yahoo Directory and the eLib Subject Gateways (the fore-runners of Intute), which emerged at around the same time and which attempted to meet a similar need (see Lorcan Dempsey's recent post, Curating the web ...), it's interesting that the Yahoo offering has proved to be longer lasting than the subject gateways, albeit in a form that is largely hidden from view, supported (I guess) by an advertising- and paid-for-listings- based model, a route that presumably wasn't/isn't considered appropriate or sufficient for an academic service?

Addendum (8 September 2010): Related to this post, and well worth reading, see Lorcan Dempsey's post from last year, Entrepreneurial skills are not given out with grant letters.

September 01, 2010

Lessons of Intute

Many years ago now, back when I worked for UKOLN, I spent part of my time working on the JISC-funded Intute service (and the Resource Discovery Network (RDN) that went before it), a manually created catalogue of high-quality Internet resources. It was therefore with some interest that I read a retrospective about the service in the July issue of Ariadne. My involvement was largely with the technology used to bring together a pre-existing and disparate set of eLib 'subject gateways' into a coherent whole. I was, I suppose, Intute's original technical architect, though I doubt if I was ever formally given that title. Almost inevitably, it was a role that led to my involvement in discussions both within the service and with our funders (and reviewers) at the time about the value (i.e. the benefits vs the costs) of such a service - conversations that were, from my point of view, always quite difficult because they involved challenging ourselves about the impact of our 'home grown' resource discovery services against those being built outside the education sector - notably, but not exclusively, by Google :-). 

Today, Steve Hitchcock of Southampton posted his thoughts on the lessons we should draw from the history of Intute. They were posted originally to the jisc-repositories mailing list. I repeat the message, with permission and in its entirety, here:

I just read the obituary of Intute, and its predecessor JISC services, in Ariadne with interest and some sadness, as will others who have been involved with JISC projects over this extended period. It rightly celebrates the achievements of the service, but it is also balanced in seeking to learn the lessons for where it is now.

We must be careful to avoid partial lessons, however. The USP of Intute was 'quality' in its selection of online content across the academic disciplines, but ultimately the quest for quality was also its downfall:

"Our unique selling point of human selection and generation of descriptions of Web sites was a costly model, and seemed somewhat at odds with the current trend for Web 2.0 technologies and free contribution on the Internet. The way forward was not clear, but developing a community-generated model seemed like the only way to go."

Unfortunately it can be hard for those responsible for defining and implementing quality to trust others to adhere to the same standards: "But where does the librarian and the expert fit in all of this? Are we grappling with new perceptions of trust and quality?" It seems that Intute could not unravel this issue of quality and trust of the wider contributor community. "The market research findings did, however, suggest that a quality-assurance process would be essential in order to maintain trust in the service". It is not alone, but it is not hard to spot examples of massively popular Web services that found ways to trust and exploit community.

The key to digital information services is volume and speed. If you have these then you have limitless opportunities to filter 'quality'. This is not to undermine quality, but to recognise that first we have to reengineer the information chain. Paul Ginsparg reengineered this chain in physics, but he saw early on that it would be necessary to rebuild the ivory towers:

"It is clear, however, that the architecture of the information data highways of the future will somehow have to reimplement the protective physical and social isolation currently enjoyed by ivory towers and research laboratories."

It was common at that time in 1994 to think that the content on the emerging Web was mostly rubbish and should be swept away to make space for quality assured content. A senior computer science professor said as much in IEEE Computer magazine, and as a naive new researcher I replied to say he was wrong and that speed changes everything.

Clearly we have volume of content across the Web; only now are we beginning to see the effect of speed with realtime information services.

If we are to salvage something from Intute, as seems to be the aim of the article, it must be to recognise the relations on the digital information axis between volume, speed and quality, not just the latter, even in the context of academic information services.

Steve Hitchcock

Steve's comments were made in the context of repositories but his final paragraph struck a chord with me more generally, in ways that I'm struggling to put into words.

My involvement with Intute ended some years ago and I can't comment on its recent history but, for me, there are also lessons in how we recognise, acknowledge and respond to changes in the digital environment beyond academia - changes that often have a much larger impact on our scholarly practices than those we initiate ourselves. And this is not a problem just for those of us working on developing the component services within our environment but for the funders of such activities.

August 24, 2010

Resource discovery revisited...

...revisited for me that is!

Last week I attended an invite-only meeting at the JISC offices in London, notionally entitled a "JISC IE Technical Review" but in reality a kind of technical advisory group for the JISC and RLUK Resource Discovery Taskforce Vision [PDF], about which the background blurb says:

The JISC and RLUK Resource Discovery Taskforce was formed to focus on defining the requirements for the provision of a shared UK resource discovery infrastructure to support research and learning, to which libraries, archives, museums and other resource providers can contribute open metadata for access and reuse.

The morning session felt slightly weird (to me), a strange time-warp back to the kinds of discussions we had a lot of as the UK moved from the eLib Programme, thru the DNER (briefly) into what became known (in the UK) as the JISC Information Environment - discussions about collections and aggregations and metadata harvesting and ... well, you get the idea.

In the afternoon we were split into breakout groups and I ended up in the one tasked with answering the question "how do we make better websites in the areas covered by the Resource Discovery Taskforce?", a slightly strange question now I look at it but one that was intended to stimulate some pragmatic discussion about what content providers might actually do.

Paul Walk has written up a general summary of the meeting - the remainder of this post focuses on the discussion in the 'Making better websites' afternoon breakout group and my more general thoughts.

Our group started from the principles of Linked Data - assign 'http' URIs to everything of interest, serve useful content (both human-readable and machine-processable (structured according to the RDF model)) at those URIs, and create lots of links between stuff (internal to particular collections, across collections and to other stuff). OK... we got slightly more detailed than that but it was a fairly straight-forward view that Linked Data would help and was the right direction to go in. (Actually, there was a strongly expressed view that simply creating 'http' URIs for everything and exposing human-readable content at those URIs would be a huge step forward).

Then we had a discussion about what the barriers to adoption might be - the problems of getting buy-in from vendors and senior management, the need to cope with a non-obvious business model (particularly in the current economic climate), the lack of technical expertise (not to mention semantic expertise) in parts of those sectors, the endless discussions that might take place about how to model the data in RDF, the general perception that Semantic Web is permanently just over the horizon and so on.

And, in response, we considered the kinds of steps that JISC (and its partners) might have to undertake to build any kind of political momentum around this idea.

To cut a long story short, we more-or-less convinced ourselves out of a purist Linked Data approach as a way forward, instead preferring a 4 layer model of adoption, with increasing levels of semantic richness and machine-processability at each stage:

  1. expose data openly in any format available (.csv files, HTML pages, MARC records, etc.)
  2. assign 'http' URIs to things of interest in the data, expose it in any format available (.csv files, HTML pages, etc.) and serve useful content at each URI
  3. assign 'http' URIs to things of interest in the data, expose it as XML and serve useful content at each URI
  4. assign 'http' URIs to things of interest in the data and expose Linked Data (as per the discussion above).

These would not be presented as steps to go thru (do 1, then 2, then 3, ...) but as alternatives with increasing levels of semantic value. Good practice guidance would encourage the adoption of option 4, laying out the benefits of such an approach, but the alternatives would provide lower barriers to adoption and offer a simpler 'sell' politically.

The heterogeneity of data being exposed would leave a significant implementation challenge for the aggregation services attempting to make use of it and the JISC (and partners) would have to fund some pretty convincing demonstrators of what might usefully be achieved.

One might characterise these approaches as '' (echoing '' but where 'glam' is short for 'galleries, libraries, archives and museums') and/or Digital UK (echoing the pragmatic approaches being successfully adopted by the Digital NZ activity in New Zealand).

Despite my reservations about the morning session, the day ended up being quite a useful discussion. That said, I remain somewhat uncomfortable with its outcomes. I'm a purest at heart and the 4 levels above are anything but pure. To make matters worse, I'm not even sure that they are pragmatic. The danger is that people will adopt only the lowest, least semantic, option and think they've done what they need to do - something that I think we are seeing some evidence of happening within 

Perhaps even more worryingly, having now stepped back from the immediate talking-points of the meeting itself, I'm not actually sure we are addressing a real user need here any more - the world is so different now than it was when we first started having conversations about exposing cultural heritage collections on the Web, particularly library collections - conversations that essentially pre-dated Google, Google Scholar, Amazon, WorldCat, CrossRef, ... the list goes on. Do people still get agitated by, for example, the 'book discovery' problem in the way they did way back then? I'm not sure... but I don't think I do. At the very least, the book 'discovery' problem has largely become an 'appropriate copy' problem - at least for most people? Well, actually, let's face it... for most people the book 'discovery' and 'appropriate copy' problems have been solved by Amazon!

I also find the co-location of libraries, museums and archives, in the context of this particular discussion, rather uncomfortable. If anything, this grouping serves only to prolong the discussion and put off any decision making?

Overall then, I left the meeting feeling somewhat bemused about where this current activity has come from and where it is likely to go.


July 21, 2010

Getting techie... what questions should we be asking of publishers?

The Licence Negotiation team here are thinking about the kinds of technical questions they should be asking publishers and other content providers as part of their negotiations with them. The aim isn't to embed the answers to those questions in contractual clauses - rather, it is to build up a useful knowledge base of surrounding information that may be useful to institutions and others who are thinking about taking up a particular agreement.

My 'starter for 10' set of questions goes like this:

  • Do you make any commitment to the persistence of the URLs for your published content? If so, please give details. Do you assign DOIs to your published content? Are you members of CrossRef?
  • Do you support a search API? If so, what standard(s) do you support?
  • Do you support a metadata harvesting API? If so, what standard(s) do you support?
  • Do you expose RSS and/or Atom feeds for your content? If so, please describe what feeds you offer?
  • Do you expose any form of Linked Data about your published content? If so, please give details.
  • Do you generate OpenURLs as part of your web interface? Do you have a documented means of linking to your content based on bibliographic metadata fields? If so, please give details.
  • Do you support SAML (Service Provider) as a means of controlling access to your content? If so, which version? Are you a member of the UK Access Management Federation? If you also support other methods of access control, please give details.
  • Do you grant permission for the preservation of your content using LOCKSS, CLOCKSS and/or PORTICO? If so, please give details.
  • Do you have a statement about your support for the Web Accessibility Initiative (WAI)? If so, please give details?

Does this look like a reasonable and sensible set of questions for us to be asking of publishers? What have I missed? Something about open access perhaps?

July 16, 2010

Finding e-books - a discovery to delivery problem

Some of you will know that we recently ran a quick survey of academic e-book usage in the UK - I hope to be able to report on the findings here shortly. One of the things that we didn't ask about in the survey but that has come up anecdotally in our discussions with librarians is the ease (or not) with which it is possible to find out if a particular e-book title is available.

A typical scenario goes like this. "Lecturer adds an entry for a physical book to a course reading list. Librarian checks the list and wants to know if there is an e-book edition of the book, in order to offer alternatives to the students on that course". Problemo. Having briefly asked around, it seems (somewhat surprisingly?) that there is no easy solution to this problem.

If we assume that the librarian in question knows the ISBN of the physical book, what can be done to try and ease the situation? Note that in asking this question I'm conveniently ignoring the looming, and potentially rather massive, issue around "what the hell is an e-book anyway?" and "how are we going to assign identifiers to them once we've worked out what they are?" :-). For some discussion around this see Eric Hellman's recent piece, What IS an eBook, anyway?

But, let's ignore that for now... we know that OCLC's xISBN service allows us to navigate different editions of the same book (I'm desperately trying not to drop into FRBR-speak here). Taking a quick look at the API documentation for xISBN yesterday, I noticed that the metadata returned for each ISBN can include both the fact that something is a 'Book' and that it is 'Digital' (form == 'BA' && form == 'DA') - that sounds like the working definition of an e-book to me (at least for the time being) - as well as listing the ISBNs for all the other editions/formats of the same book. So I knocked together a quick demonstrator. The result is e-Book Finder and you are welcome to have a play. To get you started, here are a couple of examples:

Of course, because e-Book Finder is based on xISBN, which is in turn based on WorldCat, you can only use it to find e-books that are listed in the catalogues of WorldCat member libraries (but I'm assuming that is a big enough set of libraries that the coverage is pretty good). Perhaps more importantly, it also only represents the first stage of the problem. It allows you to 'discover' that an e-book exists - but it doesn't get the thing 'delivered' to you.

Wouldn't it be nice if e-Book Finder could also answer questions like, "is this e-book covered by my existing institutional subscriptions?", "can I set up a new institutional subscription that would cover this e-book?" or simply "can I buy a one-off copy of this e-book?". It turns out that this is a pretty hard problem. My Licence Negotiation colleagues at Eduserv suggested doing some kind of search against myilibrary, dawsonera, Amazon, eBrary, eblib and SafariBooksOnline. The bad news is that (as far as I can tell), of those, only Amazon and SafariBooksOnline allow users to search their content before making them sign in and only Amazon offer an API. (I'm not sure why anyone would design a website that has the sole purpose of selling stuff such that people have to sign in before they can find out what is on offer, nor why that information isn't available in a openly machine-readable form but anyway...). So in this case, moving from discovery to delivery looks to be non-trivial. Shame. Even if each of these e-book 'aggregators' simply offered a list1 of the ISBNs of all the e-books they make available, it would be a step in the right direction.

On the other hand, maybe just pushing the question to the institutional OpenURL resolver would help answer these questions. Any suggestions for how things could be improved?

1. It's a list so that means RSS or Atom, right?

July 07, 2010

On federated access management, usability and discovery

A little over a week ago I attended a meeting in London organised by the JISC Collections team entitled From discovery to log-in and use: a workshop for publishers, content owners and service providers.

The meeting was targetted at academic publishers (and other service providers), of whom there were between 30 and 40 in the room. It started with presentations about two reports, the first by William Wong et al (Middlesex University), User Behaviour in Resource Discovery: Final Report, the second by Rhys Smith (Cardiff University), JISC Service Provider Interface Study. Both reports are worth reading, though, as I noted somewhat cheekily on Twitter prior to the meeting, if the JISC had paid more for the first one it might have been shorter!

Anyway... the eagle-eyed amongst you will have noticed that the two reports are somewhat different in scope and scale. Both talk about 'discovery' but the first uses that word in a very broad 'resource discovery' sense whilst the second uses it in the context of the 'discovery problem' as it applies to federated access management - i.e. the problem of how a 'service provider' knows which institutional login page to send the user to when they want to access their site. This difference in focus left me thinking that the day overall was a little out of balance.

For this blog post I don't intend to say anything more about 'resource discovery' in its wider sense, other than to note that Lorcan Dempsey has been writing some interesting stuff about this topic recently, that there are issues about SEO and how publishers of paid-for academic content can best interact with services like Google that could usefully be discussed somewhere (though they weren't discussed at this particular meeting), and that, in my humble opinion, any approach to resource discovery that assumes that institutions can dictate or control which service(s) the end-user is going to use to discover stuff is pretty much doomed from the start. On that basis, I'm not a big believer in library (or any other kind of) portals, nor in any architectural approach that assumes that a particular portal is what the user wants to use!

The two initial presentations were followed by a talk about the 'business case' for an 'EduID' brand - essentially a logo and/or button signifying to the user that they are about to undertake an 'academic federated login' (as opposed to an OpenID login, a Facebook Connect login, a Google login, or whatever else). Such a brand was one of the recommendations coming out of the Cardiff study. I fundamentally disagree with this approach (though I struggled to put my case across on the day). I'm not convinced that we have a 'branding' problem here and I'm worried that the way this work was presented makes it look as though the decision that we need a new 'brand' has already been taken.

During the ensuing discussion about the 'discovery problem' I mentioned the work of the Kantara Initiative and, in particular, the ULX group which is developing a series of recommendations about how the federated access management user experience should be presented to users. I think this group is coming up with a very sensible set of pragmatic recommendations and I think we need to collectively sit up and take some notice and/or get involved. Unfortunately, when I mentioned the initiative at the meeting, it appeared that the bulk of the publishers in the room were not aware of it.

To try and marshal my thoughts a little bit around the Kantara work I decided to try and implement a working demo based on their recommendations. I took as my starting point a fictitious academic service called EduStuff with a requirement to offer three login routes:

  • for UK university students and staff via the UK Federation,
  • for NHS staff via Athens, and
  • for other users via a local EduStuff login.

I'm assuming that this is a reasonably typical scenario for many academic publishers (with the exception of the UK-only targetting on the academic side of things, something I'll come back to later).

Note that this scenario is narrower than the scope of the Kantara ULX work, which includes things like Facebook Connect, Google, OpenID and so on, so I've had to interpret their recommendations somewhat, rather than implement them in their totality.

You can see the results on the demo site. Note that the site itself does nothing other than to provide a backdrop for demonstrating how the 'sign in' process might look - none of the other links work for example.

The process starts by clicking on the 'Sign in' link at the top right (as per the Kantara recommendations). This generates a pop-up 'sign in' box offering the three options. Institutional accounts are selected using a dynamic JQuery search interface which, once an institution has been selected, takes the user to their institutional login page. (My thanks to Mike Edwards at Eduserv for the original code for this). The NHS Athens option takes the user to an Athens login page. The EduStuff option goes to a fairly typical local login/register page, but one which also carries a warning about using one of the other two account types if that is more appropriate.

Whichever account type is chosen, the selection is remembered in a cookie so that future visits to the pop-up 'sign in' box can offer that as the default (again, as per Kantara).

Have a play and see what you think.

Ok, some thoughts from my perspective...

  • In the more general Kantara scenario, some options (Facebook, Google, OpenID, etc.) are presented using clickable buttons/icons. I haven't done this for my scenario because the text wording felt more helpful to me. If icons were to be used, for example if a publisher wanted to offer a Google-based login, then I would probably present the NHS Athens and EduStuff choices as icons as well.
  • You'll note that the word 'Athens' only appears next to the NHS option. I think that our Athens/OpenAthens branding should become largely invisible to users in the context of the UK Federation - or, to put it another way, one of our current usability problems is that publishers are still presenting Athens as an explicit 'sign in' option when they really do not need to so. In the context of the UK Federation, OpenAthens is just an implementation choice for SAML - users need be no more aware of it than they are of the fact that Apache is being used as the Web server. (The same can be said of Shibboleth of course). Part of our current problem is that we are highlighting the wrong brands - i.e. Shibboleth and OpenAthens/Athens rather than the institution - something that both the JISC and Eduserv have been guilty of encouraging in the past.
  • The institutional search box part of the demo is currently built on UK Federation metadata, so it only offers access to UK institutions. There is no reason why this interface couldn't deal with metadata from multiple federations. Indeed, I see no reason why it wouldn't scale to every institution in the world (with some sensible naming). So although the current demo is UK-specific, I think the approach adopted here can be expanded quite significantly.
  • On that basis, you'll note that there is no need in this interface for an EduID brand/button. Users need only concern themselves with the name of their institution - other brands become largely superficial, except where things like Google, Facebook, OpenID and so on are concerned.
  • I've presented only the front page for the EduStuff site. On the basis that we can't control how users discover stuff, i.e. we have to assume that users might arrive directly at any page of our site as the result of a Google search, the 'sign in' process has to be available on each and every page of the site.
  • Finally, the demo only deals with the usability of the first part of the process. It doesn't consider the usability of the institutional login screen, nor of what happens when the user arrives back at the publisher site after they have successfully (or otherwise) authenticated with their institution. I think there are probably significant usability issues at this point as well - for example, how to best indicate that the user is signed in - but I haven't addressed this as part of the current demo.

I'd be very interested in people's views on this work. It's at a very early stage - I haven't even presented it properly to other Eduserv staff yet - but we have some agreement (internally) that work in this area will likely be of value both to ourselves and our current customers and to the wider community. On that basis, I'm hopeful that we will do more work with this demo:

  • to make it more fully functional, i.e. to complete the round-trip back to the EduStuff site after successful authentication,
  • to make the 'sign in' pop-up into a re-usable 'widget' of some kind,
  • and to experiment with the usability of much larger lists of institutions, taken from multiple federations.

Whatever our conclusions, any results will be shared publicly.

Overall the day was very interesting. I'll leave you with my personal highlight... the point at which one of the (non-publisher) participants said (somewhat naively), "What would it take to make all this [publisher] content available for free? Then we wouldn't need to worry about authentication". Oh boy... there was a collective sharp intake of breath and you could almost hear the tumble-weed blowing for a minute there! :-)

Addendum (8 July 2010): in light of comments below I have re-worked my demo using a more icon-based approach. This is much more in line with the current Kantara ULX mockups (version 4) including the addition of a 'more options'/'less options' toggle on second and subsequent sign ins. Overall, it is, I think, rather better than my initial text-based approach. I stand by my assertion that an EduId button is not required in the 'sign in' process demonstrated here (irrespective of whether the icon-based or text-based approach is used). That said, I'd welcome views on how/where such a button would fit in.

May 05, 2010

The future of UK Dublin Core application profiles

I spent yesterday morning up at UKOLN (at the University of Bath) for a brief meeting about the future of JISC-funded Dublin Core application profile development in the UK.

I don't intend to report on the outcomes of the meeting here since it is not really my place to do so (I was just invited as an interested party and I assume that the outcomes of the meeting will be made public in due course). However, attending the meeting did make me think about some of the issues around the way application profiles have tended to be developed to date and these are perhaps worth sharing here.

By way of background, the JISC have been funding the development of a number of Dublin Core application profiles in areas such as scholarly works, images, time-based media, learning objects, GIS and research data over the last few years.  An application profile provides a model of some subset of the world of interest and an associated set of properties and controlled vocabularies that can be used to describe the entities in that model for the purposes of some application (or service) within a particular domain. The reference to Dublin Core implies conformance with the DCMI Abstract Model (which effectively just means use of the RDF model) and an inherent preference for the use of Dublin Core terms whenever possible.

The meeting was intended to help steer any future UK work in this area.

I think (note that this blog post is very much a personal view) that there are two key aspects of the DC application profile work to date that we need to think about.

Firstly, DC application profiles are often developed by a very small number of interested parties (sometimes just two or three people) and where engagement in the process by the wider community is quite hard to achieve. This isn't just a problem with the UK JISC-funded work on application profiles by the way. Almost all of the work undertaken within the DCMI community on application profiles suffers from the same problem - mailing lists and meetings with very little active engagement beyond a small core set of people.

Secondly, whilst the importance of enumerating the set of functional requirements that the application profile is intended to meet has not been underestimated, it is true to say that DC application profiles are often developed in the absence of an actual 'software application'. Again, this is also true of the application profile work being undertaken by the DCMI. What I mean here is that there is not a software developer actually trying to build something based on the application profile at the time it is being developed. This is somewhat odd (to say the least) given that they are called application profiles!

Taken together, these two issues mean that DC application profiles often take on a rather theoretical status - and an associated "wouldn't it be nice if" approach. The danger is a growth in the complexity of the application profile and a lack of any real business drivers for the work.

Speaking from the perspective of the Scholarly Works Application Profile (SWAP) (the only application profile for which I've been directly responsible), in which we adopted the use of FRBR, there was no question that we were working to a set of perceived functional requirements (e.g. "people need to be able to find the latest version of the current item"). However, we were not driven by the concrete needs of a software developer who was in the process of building something. We were in the situation where we could only assume that an application would be built at some point in the future (a UK repository search engine in our case). I think that the missing link to an actual application, with actual developers working on it, directly contributed to the lack of uptake of the resulting profile. There were other factors as well of course - the conceptual challenge of basing the work on FRBR and that fact that existing repository software was not RDF-ready for example - but I think that was the single biggest factor overall.

Oddly, I think JISC funding is somewhat to blame here because, in making funding available, JISC helps the community to side-step the part of the business decision-making that says, "what are the costs (in time and money) of developing, implementing and using this profile vs. the benefits (financial or otherwise) that result from its use?".

It is perhaps worth comparing current application profile work and other activities. Firstly, compare the progress of SWAP with the progress of the Common European Research Information Format (CERIF), about which the JISC recently reported:

EXRI-UK reviewed these approaches against higher education needs and recommended that CERIF should be the basis for the exchange of research information in the UK. CERIF is currently better able to encode the rich information required to communicate research information, and has the organisational backing of EuroCRIS, ensuring it is well-managed and sustainable.

I don't want to compare the merits of these two approaches at a technical level here. What is interesting however, is that if CERIF emerges as the mandated way in which research information is shared in the UK then there will be a significant financial driver to its adoption within systems in UK institutions. Research information drives a significant chunk of institutional funding which, in turn, drives compliance in various applications. If the UK research councils say, "thou shalt do CERIF", that is likely what institutions will do.  They'll have no real choice. SWAP has no such driver, financial or otherwise.

Secondly, compare the current development of Linked Data applications within the UK initiative with the current application profile work. Current government policy in the UK effectively says, 'thou shalt do Linked Data' but isn't really any more prescriptive. It encourages people to expose their data as Linked Data and to develop useful applications based on that data. Ignoring any discussion about whether Linked Data is a good thing or not, what has resulted is largely ground-up. Individual developers are building stuff and, in the process, are effectively developing their own 'application profiles' (though they don't call them that) as part of exposing/using the Linked Data. This approach results in real activity. But it also brings with it the danger of redundancy, in that every application developer may model their Linked Data differently, inventing their own RDF properties and so on as they see fit.

As Paul Walk noted at the meeting yesterday, at some stage there will be a huge clean-up task to make any widespread sense of the UK government-related Linked Data that is out there. Well, yes... there will. Conversely, there will be no clean up necessary with SWAP because nobody will have implemented it.

Which situation is better!? :-)

I think the issue here is partly to do with setting the framework at the right level. In trying to specify a particular set of application profiles, the JISC is setting the framework very tightly - not just saying, "you must use RDF" or "you must use Dublin Core" but saying "you must use Dublin Core in this particular way". On the other hand, the UK government have left the field of play much more open. The danger with the DC application profile route is lack of progress. The danger with the government approach is too little consistency.

So, what are the lessons here? The first, I think, is that it is important to lobby for your prefered technical solution at a policy level as well as at a technical level. If you believe that a Linked Data-compliant Dublin Core application profile is the best technical way of sharing research information in the UK then it is no good just making that argument to software developers and librarians. Decisions made by the research councils (in this case) will be binding irrespective of technical merit and will likely trump any decisions made by people on the ground.

The second is that we have to understand the business drivers for the adoption, or not, of our technical solutions rather better than we do currently. Who makes the decisions? Who has the money? What motivates the different parties? Again, technically beautiful solutions won't get adopted if the costs of adoption are perceived to outweigh the benefits, or if the people who hold the purse strings don't see any value in spending their money in that particular way, or if people simply don't get it.

Finally, I think we need to be careful that centralised, top-down, initiatives (particularly those with associated funding) don't distort the environment to such an extent that the 'real' drivers, both financial and user-demand, can be ignored in the short term, leading to unsustainable situations in the longer term. The trick is to pump-prime those things that the natural drivers will support in the long term - not always an easy thing to pull off.

April 08, 2010

Linked Data & Web Search Engines

I seem to have fallen into the habit of half-writing posts and then either putting them to one side because I'm don't feel entirely happy with them or because I get diverted into other more pressing things. This is one of several that I seem to have accumulated over the last few weeks, and which I've resolved to try to get out there....

A few weekends ago I spotted a brief exchange on Twitter between Andy and our colleague Mike Ellis on the impact of exposing Linked Data on Google search ranking. Their conclusion seemed to be that the impact was minimal. I think I'd question this assessment, and here I'll try to explain why - though in the absence of empirical evidence, I admit this is largely speculation on my part, a "hunch", if you like. I admit I almost hesitate to write this post at all, as I am far from an expert in "search-engine optimisation", and, tbh, I have something of an instinctive reaction against a notion that a high Google search ranking is the "be all and end all" :-) But I recognise it is something that many content providers care about.

In this post, I'm not considering the ways search engines might use the presence of structured data in the documents they index to enhance result sets (or make that data available to developers to provide such enhancements); rather, I'm thinking about the potential impact of the Linked Data approach on ranking.

It is widely recognised that one of the significant factors in Google's page ranking algorithm is the weighting it attaches to the number of links made to the page in question from other pages ("incoming links"). Beyond that, the recommendations of the Google Webmaster guidelines seem to be largely "common sense" principles for providing well-formed X/HTML, enabling access to your document set for Google's crawlers, and not resorting to techniques that attempt to "game" the algorithm.

Let's go back to Berners-Lee's principles for Linked Data:

  1. Use URIs as names for things

  2. Use HTTP URIs so that people can look up those names.

  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)

  4. Include links to other URIs so that they can discover more things.

The How to Publish Linked Data on the Web and the W3C Note on Cool URIs for the Semantic Web elaborate on some of the mechanics of providing Linked Data. Both of these sources make the point that to "provide useful information" means to provide data both in RDF format(s) and in human-readable forms.

So following those guidelines typically means that "exposing Linked Data" results in exposing a whole lot of new Web documents "about" the things featured in the dataset, in both RDF/XML (or another RDF format) and in XHTML/HTML - and indeed the use of XHTML+RDFa could meet both requirements in a single format. So this immediately increases what Leigh Dodds of Talis rather neatly refers to as the "surface area" of my pages which are available for Google to crawl and index.

The second aspect which is significant is that, by definition, Linked Data is about making links: I make links between items described in my own dataset, but I also make ("outgoing") links between those items and items described in other linked datasets made available by other parties elsewhere. And (hopefully!), at least in time, other people exposing Linked Data make ("incoming") links to items in my datasets.

And in the X/HTML pages at least, those are the very same hyperlinks that Google crawls and counts when calculating its pagerank.

The key point, I think, is that my pages are available, not just to other "Linked Data applications", but also for other people to reference, bookmark and make links to just as they do any page on any Web site. This is one of the points I was trying to highlight in my last post when I mentioned the BBC's Linked Data work: the pages generated as part of those initiatives are fairly seamlessly integrated within the collection of documents that make up the BBC Web site. They do not appear as some sort of separate "data area", something just for "client apps that want data", somehow "different from" the news pages or the IPlayer pages; on the contrary, they are linked to by those other pages, and the X/HTML pages are given a "look and feel" that emphasises their membership of this larger aggregation. And human readers of the BBC Web site encounter those pages in the course of routinely navigating the site.

Of course the key to increasing the rank of my pages in Google is whether other people actually make those links to pages I expose, and it may well be that for much of the data surfaced so far, such links are relatively small in number. But the Linked Data approach, and its emphasis on the use of URIs and links, helps me do my bit to make sure my resources are things "of (or in) the Web".

So I'd argue that the Linked Data approach is potentially rather a good fit with what we know of the way Google indexes and ranks pages - precisely because both approaches seek to "work with the grain of the Web". I'd stick my neck out and say that having a page about my event (project, idea, whatever) provides a rather better basis for making that information findable in Google than exposing that description only as the content of a particular row in an Excel spreadsheet, where it is difficult to reference as an individual target resource and where it is (typically at least) not a source of links to other resources.

As I was writing this, I saw a new post appear from Michael Hausenblas, in which he attempts to categorise some common formats and services according to what he calls their "Link Factor" ("the degree of how 'much' they are in the Web". And more recently, I noticed the appearance of a a post titled 10 Reasons Why News Organizations Should Use 'Linked Data' which, in its first two points, highlights the importance of Linked Data's use of hyperlinks and URIs to SEO - and points to the fact that the BBC's Wildlife Finder pages do tend to appear prominently in Google result sets.

Before I get carried away, I should add a few qualifiers, and note some issues which I can imagine may have some negative impact. And I should emphasise this is just my thinking out loud here - I think more work is necessary to examine the actual impact, if any.

  • Redirects: Many of the links in Linked Data are made between "things", rather than between the pages describing the things. And following the "Cool URIs" guidelines, these URIs would either be URIs with fragment identifiers ("hash URIs") or URIs for which an HTTP server responds with a 303 response providing the URI of a document describing the thing. For the first case, I think Google recognises these as links to the document with the URI obtained by stripping the fragment id; for the 303 case, I'm unsure about the impact of the use of the redirect on the ranking for the document which is the final target. (A related issue would be that some sources might cite the URI of the thing and other sources might cite the URI of the document describing the thing).
  • Synonyms: As the Publishing Linked Data tutorial highlights, one of the characteristics of Linked Data is that it often makes use of URI aliases, multiple URIs each referring to the same resource. If some users bookmark/cite URI A and some users bookmark/cite URI B, then that would result in a lower link-based ranking for each of the two pages describing the thing than if all users bookmarked/cited a single URI. To some extent, this is just part of the nature of the Web, and it applies similarly outside the Linked Data context, but the tendency to generate an increasing number of aliases is something which generates continued discussion in the LD community (see, for example, the recent thread on "annotation" on the public-lod mailing list generated in response to Leigh Dodds' and Ian Davis' recent Linked Data Patterns document (which I should add, from my very hasty skim reading so far, seems to provide an excellent combination of thoughtful discussion and clear practical suggestions.)).
  • "Caching"/"Resurfacing": As we are seeing Linked Data being deployed, we are seeing data aggregated by various agencies and resurfaced on the Web using new URIs. Similarly to the previous point, this may lead to a case where two users cite different URIs, with a corresponding reduction in the number of incoming links to any single document. I also note that Google's guidelines include the admonition: "Don't create multiple pages, subdomains, or domains with substantially duplicate content", which does make me wonder whether such resurfaced content may have a negative impact on ranking.
  • "Good XHTML": While links are important, they aren't the whole story, and attention still needs to be paid to ensuring that HTML pages generated by a Linked Data application follow the sort of general good practice for "good Web pages" described in the Google guidelines (provide well-structured XHTML, use title elements, use alt attributes, don't fill with irrelevant keywords etc etc)
  • Sitemaps: This is probably just a special case of the previous point, but Google emphasises the importance of using sitemaps to provide entry points for its crawlers. Although I'm aware of the Semantic Sitemap extension, I'm not sure whether the use of sitemaps is widespread in Linked Data deployments - though it is the sort of thing I'd expect to see happen as Linked Data moves further out of the preserve of the "research project" and towards more widespread deployment.
  • "Granularity": (I'm unsure whether this is a factor or not: I can imagine it might be, but it's probably not simple to assess exactly what the impact is.) How a provider decides to "chunk up" their descriptive data into documents might have an impact on the "density" of incoming links. If they expose a large number of documents each describing a single specific resource, does that result in each document receiving fewer incoming links than if they expose a smaller number of documents each describing several resources?
  • Integration: Although above I highlighted the BBC as an example of Linked Data being well-integrated into a "traditional" Web site, and so made highly visible to users of that Web site, I suspect this may - at the moment at least - be the exception rather than the rule. However, as with the previous point, this is something I'd expect to become more common.

Nevertheless, I still stand by my "hunch" that the LD approach is broadly "good" for ranking. I'm not claiming Linked Data is a panacea for search-engine optimisation, and I admit that some of what I'm suggesting here may be "more potential than actual". But I do believe the approach can make a positive contribution - and that is because both the Google ranking algorithm and Linked Data exploit the URI and the hyperlink: they "work with the grain of the Web".

January 31, 2010

Readability and linkability

In July last year I noted that the terminology around Linked Data was not necessarily as clear as we might wish it to be.  Via Twitter yesterday, I was reminded that my colleague, Mike Ellis, has a very nice presentation, Don't think websites, think data, in which he introduces the term MRD - Machine Readable Data.

It's worth a quick look if you have time:

We also used the 'machine-readable' phrase in the original DNER Technical Architecture, the work that went on to underpin the JISC Information Environment, though I think we went on to use both 'machine-understandable' and 'machine-processable' in later work (both even more of a mouthful), usually with reference to what we loosely called 'metadata'.  We also used 'm2m - machine to machine' a lot, a phrase introduced by Lorcan Dempsey I think.  Remember that this was back in 2001, well before the time when the idea of offering an open API had become as widespread as it is today.

All these terms suffer, it seems to me, from emphasising the 'readability' and 'processability' of data over its 'linkedness'. Linkedness is what makes the Web what it is. With hindsight, the major thing that our work on the JISC Information Environment got wrong was to play down the importance of the Web, in favour of a set of digital library standards that focused on sharing 'machine-readable' content for re-use by other bits of software.

Looking at things from the perspective of today, the terms 'Linked Data' and 'Web of Data' both play up the value in content being inter-linked as well as it being what we might call machine-readable.

For example, if we think about open access scholarly communication, the JISC Information Environment (in line with digital libraries more generally) promotes the sharing of content largely through the harvesting of simple DC metadata records, each of which typically contains a link to a PDF copy of the research paper, which, in turn, carries only human-readable citations to other papers.  The DC part of this is certainly MRD... but, overall, the result isn't very inter-linked or Web-like. How much better would it have been to focus some effort on getting more Web links between papers embedded into the papers themselves - using what we would now loosely call a 'micro format'?  One of the reasons I like some of the initiatives around the DOI (though I don't like the DOI much as a technology), CrossRef springs to mind, is that they potentially enable a world where we have the chance of real, solid, persistent Web links between scholarly papers.

RDF, of course, offers the possibility of machine-readability, machine-processable semantics, and links to other content - which is why it is so important and powerful and why initiatives like need to go beyond the CSV and XML files of this world (which some people argue are good enough) and get stuff converted into RDF form.

As an aside, DCMI have done some interesting work on Interoperability Levels for Dublin Core Metadata. While this work is somewhat specific to DC metadata I think it has some ideas that could be usefully translated into the more general language of the Semantic Web and Linked Data (and probably to the notions of the Web of Data and MRD).

Mike, I think, would probably argue that this is all the musing of a 'purist' and that purists should be ignored - and he might well be right.  I certainly agree with the main thrust of the presentation that we need to 'set our data free', that any form of MRD is better than no MRD at all, and that any API is better than no API.  But we also need to remember that it is fundamentally the hyperlink that has made the Web what it is and that those forms of MRD that will be of most value to us will be those, like RDF, that strongly promote the linkability of content, not just to other content but to concepts and people and places and everything else.

The labels 'Linked Data' and 'Web of Data' are both helpful in reminding us of that.

October 14, 2009

Open, social and linked - what do current Web trends tell us about the future of digital libraries?

About a month ago I travelled to Trento in Italy to speak at a Workshop on Advanced Technologies for Digital Libraries organised by the EU-funded CACOA project.

My talk was entitled "Open, social and linked - what do current Web trends tell us about the future of digital libraries?" and I've been holding off blogging about it or sharing my slides because I was hoping to create a slidecast of them. Well... I finally got round to it and here is the result:

Like any 'live' talk, there are bits where I don't get my point across quite as I would have liked but I've left things exactly as they came out when I recorded it. I particularly like my use of "these are all very bog standard... err... standards"! :-)

Towards the end, I refer to David White's 'visitors vs. residents' stuff, about which I note he has just published a video. Nice one.

Anyway... the talk captures a number of threads that I've been thinking and speaking about for the last while. I hope it is of interest.

October 07, 2009

What is "Simple Dublin Core"?

Over the last couple of weeks I've exchanged some thoughts, on Twitter and by email, with John Robertson of CETIS, on the topic of "Qualified Dublin Core", and as we ended up discussing a number of areas where it seems to me there is a good deal of confusion, I thought it might be worth my trying to distill them into a post here (well, it's actually turned into a couple of posts!).

I'm also participating in an effort by the DCMI Usage Board to modernise some of DCMI's core documentation, and I hope this can contribute to that work. However, at this point this is, I should emphasise, a personal view only, based on my own interpretation of historical developments, not all of which I was around to see at first hand, and should be treated accordingly.

The exchange began with a couple of posts from John on Twitter in which he expressed some frustration in getting to grips with the notion referred to as "Qualified Dublin Core", and its relationship to the concept of "DC Terms".

First, I think it's maybe worth taking a step back from the "Qualified DC" question, and looking at the other concepts John mentions in his first question: "the Dublin Core Metadata Element Set (DCMES)" and "Simple Dublin Core", and that's what I'll focus on in this post.

The Dublin Core Metadata Element Set (DCMES) is a collection of (what DCMI calls) "terms" - and it's a collection of "terms" of a single type, a collection of properties - each of which is identified by a URI beginning; the URIs are "in that namespace". Historically, DCMI referred to this set of properties as "elements".

Although I'm not sure it is explicitly stated anywhere, I think there is a policy that - at least barring any quite fundamental changes of approach by DCMI - no new terms will be added to that collection of fifteen terms; it is a "closed" set, its membership is fixed in number.

A quick aside: for completeness, I should emphasise that those fifteen properties have not been "deprecated" by DCMI. Although, as I'll discuss in the next post, a new set of properties has been created in the "DC terms" set of terms, the DCMES properties are still available for use in just the same way as the other terms owned by DCMI. The DCMES document says:

Implementers may freely choose to use these fifteen properties either in their legacy dc: variant (e.g., or in the dcterms: variant (e.g., depending on application requirements. The RDF schemas of the DCMI namespaces describe the subproperty relation of dcterms:creator to dc:creator for use by Semantic Web-aware applications. Over time, however, implementers are encouraged to use the semantically more precise dcterms: properties, as they more fully follow emerging notions of best practice for machine-processable metadata.

The intent behind labelling them as "legacy" is, as Tom Baker puts it, to "gently promote" the use of the more recently defined set of properties.

Perhaps the most significant characteristic of that set of terms is that it was created as a "functional" set, by which I mean that it was created with the notion that that set of fifteen properties could and would be used together in combination in the descriptions of resources. And I think this is reflected, for instance, in the fact that some of the "comments" provided for individual properties refer to other properties in that set (e.g. dc:subject/dc:coverage, dc:format/dc:type).

And there was particular emphasis placed on one "pattern" for the construction of descriptions using those fifteen properties, in which a description could contain statements referring only to those fifteen properties, all used with literal values, and any of those 15 properties could be referred to in multiple statements (or in none). In that pattern of usage, the fifteen properties were all "optional and repeatable", if you like. And that pattern is often referred to as "Simple Dublin Core".

Such a "pattern" is what today - if viewed from the perspective of the DCMI Abstract Model and the Singapore Framework - we would call a Description Set Profile (DSP).

So "Simple Dublin Core" might be conceptualised as a DSP designed, initially at least, for use within a very simple, general purpose DC Application Profile (DCAP), constructed to support some functions related to the discovery of a broad range of resources. That DSP specifies the following constraints:

  • A description set must contain exactly one description (Description Template: Minimum occurrence constraint = 1; Maximum occurrence constraint = 1)
  • That description may be of a resource of any type (Description Template: Resource class constraint: none (default))
  • For each statement in that description:
    • The property URI must be drawn from a list of the fifteeen URIs of the DCMES properties (Statement Template: Property list constraint: (the 15 URIs))
    • There must be at least one such statement; there may be many (Statement Template: Minimum occurrence constraint = 1; Maximum occurrence constraint = unbounded)
    • A literal value surrogate is required (Statement Template: Type constraint = literal)
    • Within that literal value surrogate, the use of a syntax encoding sceme URI is not permitted (Statement Template/Literal Value: Syntax Encoding Scheme Constraint = disallowed)

And this DSP represents a "pattern" that is quite widely deployed, perhaps most notably in the context of systems supporting the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which requires that an OAI-PMH repository expose records using an XML format called oai_dc, which is essentially a serialisation format for this DSP. (There may be an argument that the "Simple DC" pattern has been overemphasised at the expense of other patterns, and as a result people have poured their effort into using that pattern when a different one might have been more appropriate for the task at hand, but that's a separate discussion!)

It seems to me that, historically, the association between the DCMES as a set of terms on the one hand and that particular pattern of usage of those terms on the other was so close that, at least in informal accounts, the distinction between the two was barely made at all. People tended to (and still do) use the terms "Dublin Core Metadata Element Set" and "Simple Dublin Core" interchangeably. So, for example, in the introduction to the Usage Guide, one finds comments like "Simple Dublin Core comprises fifteen elements" and "The Dublin Core basic element set is outlined in Section 4. Each element is optional and may be repeated." I'd go as far as to say that many uses of the generic term "Dublin Core", informal ones at least, are actually references to this one particular pattern of usage. (I think the glossary of the Usage Guide does try to establish the difference, referring to "Simple Dublin Core" as "The fifteen Dublin Core elements used without qualifiers, that is without element refinement or encoding schemes.")

The failure to distinguish more clearly between a set of terms and one particular pattern of usage of those terms has caused a good deal of confusion, and I think this will becomes more apparent when we consider the (rather more complex) case of "Qualified Dublin Core", as I'll do in the next post, and it's an area which I'm hoping will be addressed as part of the Usage Board review of documentation.

If you look at the definitions of the DCMES properties, in the human-readable document, and especially in the RDF Schema descriptions provided in the "namespace document", with the possible exceptions of the "cross-references" I mentioned above, those definitions don't formally say anything about using those terms together as a set, or about "optionality/repeatability": they just define the terms; they are silent about any particular "pattern of usage" of those terms.

So, such patterns of usage of a collection of terms exist distinct from the collection of terms. And it is possible to define multiple patterns of usage. multiple DSPs, referring to that same set of 15 properties. In addition to the "all optional/repeatable" pattern, I might find myself dealing with some set of resources which all have identifiers and all have names, and operations on those identifiers and names are important to my application, so I could define a pattern/DSP ("PeteJ's Basic DC" DSP) where I say all my descriptions must contain at least one statement referring to the dc:identifier property and at least one statement referring to the dc:title property, and the other thirteen properties are optional/repeatable, still all with literal values. Another implementer might find themselves dealing with some stuff where everything has a topic drawn from some specified SKOS concept scheme, so they define a pattern ("Fred's Easy DC" DSP) which says all their descriptions must contain at least one statement referring to the dc:subject property and they require the use, not of a literal, but of a value URI from that specific set of URIs. So now we have three three different DC Application Profiles, incorporating three different patterns for constructing description sets (three different DSPs) each referring to the same set of 15 properties.

It's also worth noting that the "Simple DC" pattern of usage, a single set of structural constraints, could be deployed in multiple DC Application Profiles, supporting different applications and containing different human-readable guidelines. (I was going to point to the document Using simple Dublin Core to describe eprints as an actual example of this, but having read that document again, I think strictly speaking it probably introduces additional structural constraints (i.e. introduces a different DSP), e.g. it requires that statements using the dc:type property refer to values drawn from a bounded set of literal values.)

The graphic below is an attempt to represent what I see as those relationships between the DCMES vocabulary, DSPs and DCAPs:


Finally, it's worth emphasising that the 15 properties of the DCMES, or indeed any subset of them - there is no requirement that a DSP refer to all, or indeed any, of the properties of the DCMES - , may be referred to in other DSPs in combination with other terms from other vocabularies, owned either by DCMI or by other parties.


The point that DCMI's concept of an "application profile" is not based either on the use of the DCMES properties in particular or on the "Simple DC" pattern is an important one. Designing a DC application profile does not require taking either the DCMES or the "Simple DC" pattern as a starting point; any set of properties, classes, vocabulary encoding schemes and syntax encoding schemes, owned by any agency, can be referenced. But that is rather leading me into the next post, where I'll consider the similar (but rather more messy) picture that emerges once we start talking about "Qualified DC".

August 20, 2009

What researchers think about data preservation and access

There's an interesting report in the current issues of Ariadne by Neil Beagrie, Robert Beagrie and Ian Rowlands, Research Data Preservation and Access: The Views of Researchers, fleshing out some of the data behind the UKRDS Report, which I blogged about a while back.

I have a minor quibble with the way the data has been presented in the report, in that it's not overly clear how the 179 respondents represented in Figure 1 have been split across the three broad areas (Sciences, Social Sciences, and Arts and Humanities) that appear in subsequent figures. One is left wondering how significant the number of responses in each of the 3 areas was?  I would have preferred to see Figure 1 organised in such a way that the 'departments and faculties' were grouped more obviously into the broad areas.

That aside, I think the report is well worth reading.  I'll just highlight what the authors perceive to be the emerging themes:

  • It is clear that different disciplines have different requirements and approaches to research data.
  • Current provision of facilities to encourage and ensure that researchers have data stores where they can deposit their valuable data for safe-keeping and for sharing, as appropriate, varies from discipline to discipline.
  • Local data management and preservation activity is very important with most data being held locally.
  • Expectations about the rate of increase in research data generated indicate not only higher data volumes but also an increase in different types of data and data generated by disciplines that have not until recently been producing volumes of digital output.
  • Significant gaps and areas of need remain to be addressed.

The Findings of the Scoping Study and Research Data Management Workshop (undertaken at the University of Oxford and part of the work that infomed the Ariadne article) provides an indication of the "top requirements for services to help [researchers] manage data more effectively":

  • Advice on practical issues related to managing data across their life cycle. This help would range from assistance in producing a data management/sharing plan; advice on best formats for data creation and options for storing and sharing data securely; to guidance on publishing and preserving these research data.
  • A secure and user-friendly solution that allows storage of large volume of data and sharing of these in a controlled fashion way allowing fine grain access control mechanisms.
  • A sustainable infrastructure that allows publication and long-term preservation of research data for those disciplines not currently served by domain specific services such as the UK Data Archive, NERC Data Centres, European Bioinformatics Institute and others.
  • Funding that could help address some of the departmental challenges to manage the research data that are being produced.

Pretty high level stuff so nothing particularly surprising there. It seems to me that some work drilling down into each of these areas might be quite useful.

July 20, 2009

On names

There's was a brief exchange of messages on the jisc-repositories mailing list a couple of weeks ago concerning the naming of authors in institutional repositories.  When I say naming, I really mean identifying because a name, as in a string of characters, doesn't guarantee any kind of uniqueness - even locally, let alone globally.

The thread started from a question about how to deal with the situation where one author writes under multiple names (is that a common scenario in academic writing?) but moved on to a more general discussion about how one might assign identifiers to people.

I quite liked Les Carr's suggestion:

Surely the appropriate way to go forward is for repositories to start by locally choosing a scheme for identifying individuals (I suggest coining a URI that is grounded in some aspect of the institution's processes). If we can export consistently referenced individuals, then global services can worry about "equivalence mechanisms" to collect together all the various forms of reference that.

This is the approach taken by the Resist Knowledgebase, which is the foundation for the (just started) dotAC JISC Rapid Innovation project.

(Note: I'm assuming that when Les wrote 'URI' he really meant 'http URI').

Two other pieces of current work seem relevant and were mentioned in the discussion. Firstly the JISC-funded Names project which is working on a pilot Names Authroity Service. Secondly, the RLG Networking Names report.  I might be misunderstanding the nature of these bits of work but both seem to me to be advocating rather centralised, registry-like, approaches. For example, both talk about centrally assigning identifiers to people.

As an aside, I'm constantly amazed by how many digital library initiatives end up looking and feeling like registries. It seems to be the DL way... metadata registries, metadata schema registries, service registries, collection registries. You name it and someone in a digital library will have built a registry for it.

May favoured view is that the Web is the registry. Assign identifiers at source, then aggregate appropriately if you need to work across stuff (as Les suggests above).  The <sameAs> service is a nice example of this:

The Web of Data has many equivalent URIs. This service helps you to find co-references between different data sets.

As Hugh Glaser says in a discussion about the service:

Our strong view is that the solution to the problem of having all these URIs is not to generate another one. And I would say that with services of this type around, there is no reason.

In thinking about some of the issues here I had cause to go back and re-read a really interesting interview by Martin Fenner with Geoffrey Bilder of CrossRef (from earlier this year).  Regular readers will know that I'm not the world's biggest fan of the DOI (on which CrossRef is based), partly for technical reasons and partly on governence grounds, but let's set that aside for the moment.  In describing CrossRef's "Contributor ID" project, Geoff makes the point that:

... “distributed” begets “centralized”. For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.). This gets us back to square one and makes me think the real issue is - how do you make the centralized system that eventually emerges accountable?

I think this is a fair point but I also think there is a very significant architectural difference between a centralised service that aggregates identifiers and other information from a distributed base of services, in order to provide some useful centralised function for example, vs. a centralised service that assigns identifiers which it then pushes out into the wider landscape. It seems to me that only the former makes sense in the context of the Web.

May 08, 2009

The Nature of OAI, identifiers and linked data

In a post on Nascent, Nature's blog on web technology and science, Tony Hammond writes that Nature now offer an OAI-PMH interface to articles from over 150 titles dating back to 1869.

Good stuff.

Records are available in two flavours - simple Dublin Core (as mandated by the protocol) and Prism Aggregator Message (PAM), a format that Nature also use to enhance their RSS feeds.  (Thanks to Scott Wilson and TicTocs for the Jopml listing).

Taking a quick look at their simple DC records (example) and their PAM records (example) I can't help but think that they've made a mistake in placing a doi: URI rather than an http: URI in the dc:identifier field.

Why does this matter?

Imagine you are a common-or-garden OAI aggregator.  You visit the Nature OAI-PMH interface and you request some records.  You don't understand the PAM format so you ask for simple DC.  So far, so good.  You harvest the requested records.  Wanting to present a clickable link to your end-users, you look to the dc:identifier field only to find a doi: URI:


If you understand the doi: URI scheme you are fine because you'll know how to convert it to something useful:

But if not, you are scuppered!  You'll just have to present the doi: URI to the end-user and let them work it out for themselves :-(

Much better for Nature to put the http: URI form in dc:identifier.  That way, any software that doesn't understand DOIs can simply present the http: URI as a clickable link (just like any other URL).  Any software that does understand DOIs, and that desperately wants to work with the doi: URI form, can do the conversion for itself trivially.

Of course, Nature could simply repeat the dc:identifier field and offer both the http: URI form and the doi: URI form side-by-side.  Unfortunately, this would run counter the the W3C recommendations not to mint multiple URIs for the same resource (section 2.3.1 of the Architecture of the World Wide Web):

A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

On balance I see no value (indeed, I see some harm) in surfacing the non-HTTP forms of DOI:




both of which appear in the PAM record (somehwat redundantly?).

The http: URI form

is sufficient.  There is no technical reason why it should be perceived as a second-class form of the identifier (e.g. on persistence grounds).

I'm not suggesting that Nature gives up its use of DOIs - far from it.  Just that they present a single, useful and usable variant of each DOI, i.e. the http: URI form, whenever they surface them on the Web, rather than provide a mix of the three different forms currently in use.

This would be very much in line with recommended good practice for linked data:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.

March 20, 2009

Unlocking Audio

I spent the first couple of days this week at the British Library in London, attending the Unlocking Audio 2 conference.  I was there primarily to give an invited talk on the second day.

You might notice that I didn't have a great deal to say about audio, other than to note that what strikes me as interesting about the newer ways in which I listen to music online (specifically and Spotify) is that they are both highly social (almost playful) in their approach and that they are very much of the Web (as opposed to just being 'on' the Web).

What do I mean by that last phrase?  Essentially, it's about an attitude.  It's about seeing being mashed as a virtue.  It's about an expectation that your content, URLs and APIs will be picked up by other people and re-used in ways you could never have foreseen.  Or, as Charles Leadbeater put it on the first day of the conference, it's about "being an ingredient".

I went on to talk about the JISC Information Environment (which is surprisingly(?) not that far off its 10th birthday if you count from the initiation of the DNER), using it as an example of digital library thinking more generally and suggesting where I think we have parted company with the mainstream Web (in a generally "not good" way).  I noted that while digital library folks can discuss identifiers forever (if you let them!) we generally don't think a great deal about identity.  And even where we do think about it, the approach is primarily one of, "who are you and what are you allowed to access?", whereas on the social Web identity is at least as much about, "this is me, this is who I know, and this is what I have contributed". 

I think that is a very significant difference - it's a fundamentally different world-view - and it underpins one critical aspect of the difference between, say, Shibboleth and OpenID.  In digital libraries we haven't tended to focus on the social activity that needs to grow around our content and (as I've said in the past) our institutional approach to repositories is a classic example of how this causes 'social networking' issues with our solutions.

I stole a lot of the ideas for this talk, not least Lorcan Dempsey's use of concentration and diffusion.  As an aside... on the first day of the conference, Charles Leadbeater introduced a beach analogy for the 'media' industries, suggesting that in the past the beach was full of a small number of large boulders and that everything had to happen through those.  What the social Web has done is to make the beach into a place where we can all throw our pebbles.  I quite like this analogy.  My one concern is that many of us do our pebble throwing in the context of large, highly concentrated services like Flickr, YouTube, Google and so on.  There are still boulders - just different ones?  Anyway... I ended with Dave White's notions of visitors vs. residents, suggesting that in the cultural heritage sector we have traditionally focused on building services for visitors but that we need to focus more on residents from now on.  I admit that I don't quite know what this means in practice... but it certainly feels to me like the right direction of travel.

I concluded by offering my thoughts on how I would approach something like the JISC IE if I was asked to do so again now.  My gut feeling is that I would try to stay much more mainstream and focus firmly on the basics, by which I mean adopting the principles of linked data (about which there is now a TED talk by Tim Berners-Lee), cool URIs and REST and focusing much more firmly on the social aspects of the environment (OpenID, OAuth, and so on).

Prior to giving my talk I attended a session about iTunesU and how it is being implemented at the University of Oxford.  I confess a strong dislike of iTunes (and iTunesU by implication) and it worries me that so many UK universities are seeing it as an appropriate way forward.  Yes, it has a lot of concentration (and the benefits that come from that) but its diffusion capabilities are very limited (i.e. it's a very closed system), resulting in the need to build parallel Web interfaces to the same content.  That feels very messy to me.  That said, it was an interesting session with more potential for debate than time allowed.  If nothing else, the adoption of systems about which people can get religious serves to get people talking/arguing.

Overall then, I thought it was an interesting conference.  I suspect that my contribution wasn't liked by everyone there - but I hope it added usefully to the debate.  My live-blogging notes from the two days are here and here.

March 03, 2009

What became of the JISC IE?

Having just done an impromptu, and very brief, 1:1 staff development session about Z39.50 and OpenURL for a colleague here at Eduserv, I was minded to take a quick look at the JISC Information Environment Technical Standards document. (I strongly suspect that the reason he was asking me about these standards, before going to a meeting with a potential client, was driven by the JISC IE work.)

As far as I can tell, the standards document hasn't been updated since I left UKOLN (more than 3 years ago). On that basis, one is tempted to conclude that the JISC IE has no relevance, at least in terms of laying out an appropriate framework of technical standards. Surely stuff must have changed significantly in the intervening years? There is no mention of Atom, REST, the Semantic Web, SWORD, OpenSocial, OpenID, OAuth, Google Sitemaps, OpenSearch, ... to name but a few.

Of course, I accept that this document could simply now be seen as irrelevant?  But, if so, why isn't it flagged as such?  It's sitting there with my name on it as though I'd checked it yesterday and the JISC-hosted Information Environment pages still link to that area as though it remains up to date.  This is somewhat frustrating, both for me as an individual and, more importantly, for people in the community trying to make sense of the available information.

Odd... what is the current status of the JISC IE, as a framework of technical standards?

February 11, 2009

Repository usability - take 2

OK... following my 'rant' yesterday about repository user-interface design generally (and, I suppose, the Edinburgh Research Archive in particular), Chris Rusbridge suggested I take a similar look at an repository and pointed to a research paper by Les Carr in the University of Southampton School of Electronics and Computer Science repository by way of example.  I'm happy to do so though I'm going to try and limit myself to a 10 minute survey of the kind I did yesterday.

The paper in question was originally published in The Computer Journal (Oxford University Press) and is available from though I don't have the necessary access rights to see the PDF that OUP make available.  (In passing, it's good to see that OUP have little or no clue about Cool URIs, resorting instead to the totally useless (in Web terms at least) DOI as text string, "doi:10.1093/comjnl/bxm067" as their means of identification :-( ).

Ecs The jump-off page for the article in the repository is at, a URL that, while it isn't too bad, could probably be better.  How about replacing 'eprints.ecs' by 'research' for example to mitigate against changes in repository content (things other than eprints) and organisational structure (the day Computer Science becomes a separate school).

The jump-off page itself is significantly better in usability terms than the one I looked at yesterday.  The page <title> is set correctly for a start.  Hurrah!  Further, the link to the PDF of the paper is near the top of the page and a mouse-over pop-up shows clearly what you are going to get when you follow the link.  I've heard people bemoaning the use of pop-ups like this in usability terms in the past but I have to say, in this case, I think it works quite well.  On the downside, the link text is just 'PDF' which is less informative than it should be.

Following the abstract a short list of information about the paper is presented.  Author names are linked (good) though for some reason keywords are not (bad).  I have no idea what a 'Performance indicator' is in this context, even less so the value "EZ~05~05~11".  Similarly I don't see what use the ID Code is and I don't know if Last Modified refers to the paper or the information about the paper.  On that basis, I would suggest some mouse-over help text to explain these terms to end-users like myself.

The 'Look up in Google Scholar' link fails to deliver any useful results, though I'm not sure if that is a fault on the part of Google Scholar or the repository?  In any case, a bit of Ajax that indicated how many results that link was going to return would be nice (note: I have no idea off the top of my head if it is possible to do that or not).

Each of the references towards the bottom of the page has a 'SEEK' button next to them (why uppercase?).  As with my comments yesterday, this is a button that acts like a link (from my perspective as the end-user) so it is not clear to me why it has been implemented in the way it has (though I'm guessing that it is to do with limitations in the way Paracite (the target of the link) has been implemented.  My gut feeling is that there is something unRESTful in the way this is working, though I could be wrong.  In any case, it seems to be using an HTTP POST request where a HTTP GET would be more appropriate?

There is no shortage of embedded metadata in the page, at least in terms of volume, though it is interesting that <meta name="DC.subject" ... > is provided whereas the far more useful <meta name="keywords" ... > is not.

The page also contains a large number of <link rel="alternate" ... > tags in the page header - matching the wide range of metadata formats available for manual export from the page (are end-users really interested in all this stuff?) - so many in fact, that I question how useful these could possibly be in any real-world machine-to-machine scenario.

Overall then, I think this is a pretty good HTML page in usability terms.  I don't know how far this is an "out of the box" installation or how much it has been customised but I suggest that it is something that other repository managers could usefully take a look at.

Usability and SEO don't centre around individual pages of course, so the kind of analysis that I've done here needs to be much broader in its reach, considering how the repository functions as a whole site and, ultimately, how the network of institutional repositories and related services (since that seems to be the architectural approach we have settled on) function in usability terms.

Once again, my fundamental point here is not about individual repositories.  My point is that I don't see the issues around "eprint repositories as a part of the Web" featuring high up the agenda of our discussions as a community (and I suggest the same is true of  learning object repositories), in part because we have allowed ourselves to get sidetracked by discussion of community-specific 'interoperability' solutions that we then tend to treat as some kind of magic bullet, rolling them out whenever someone questions one approach or another.

Even where usability and SEO are on the agenda (as appears to be the case here) It's not enough that individual repositories think about the issues, even if some or most make good decisions, because most end-users (i.e. researchers) need to work across multiple repositories (typically globally) and therefore we need the usability of the system as a whole to function correctly.  We therefore need to think about these issues as a community.

February 10, 2009

Repository usability

In his response to my previous post, Freedom, Google-juice and institutional mandates, Chris Rusbridge responded using one of his Ariadne articles as an illustrative example.

By way of, err... reward, I want to take a quick look (in what I'm going to broadly call 'usability' terms) at the way in which that article is handled by the Edinburgh Research Archive (ERA).  Note that I'm treating the ERA as an example here - I don't consider it to be significantly different to other institutional repositories and, on that basis, I assume that most of what I am going to say will also apply to other repository implementations.

Much of this is basic Web 101 stuff...

The original Ariadne article is at - an HTML document containing embedded links to related material in the References section (internally linked from the relevant passage in the text).  The version deposited into ERA is a 9 page PDF snapshot of the original article.  I assume that PDF has been used for preservation reasons, though I'm not sure.  Hypertext links in the original HTML continue to work in the PDF version.

So far, so good.  I would naturally tend to assume that the HTML version is more machine-readable than the PDF version and on that basis is 'better', though I admit that I can't provide solid evidence to back up that statement.

Era The repository 'jump-off' page for the article is at though the page itself tells us (in a human-readable way) that we should use for citation purposes.

So we already have 4 URLs for this article and no explicit machine-readable knowledge that they all identify the same resource.  Further, the URLs that 15 years of using a Web browser lead me to use most naturally (those of the jump-off page, the original Ariadne article or the PDF file) are not the one that the page asks me to use for citation purposes.  So, in Web usability terms, I would most naturally bookmark (e.g. using the wrong URL for this article and where different scholars choose to bookmark different URLs, services like are unlikely to be able to tell that they are referring to the same thing (recent experience of Google Scholar notwithstanding).

OK, so now let's look more closely at the jump-off page...

Firstly, what is the page title (as contained in the HTML <title> tag)?  Something useful like "Excuse Me... Some Digital Preservation Fallacies?".  No, it's "Edinburgh Research Archive : Item 1842/1476". Nice!? Again, if I bookmark this page in, that is the label is going to appear next to the URL, unless I manually edit it.

Secondly, what other metadata and/or micro-formats are embedded into this page?  All that nice rich Dublin Core metadata that is buried away inside the repository?  Nah.  Nothing.  A big fat zilch.  Not even any <meta name="keywords" ...> stuff.  I mean, come on.  The information is there on the page right in front of me... it's just not been marked up using even the most basic of HTML tags.  Most university Web site managers would get shot for turning out this kind of rubbish HTML.

Note I'm not asking for embedded Dublin Core metadata here - I'm asking for useful information to be embedded in useful (and machine-readable) ways where there are widely accepted conventions for how to to that.

So, let's look at those human-readable keywords again.  Why aren't they hyperlinked to all all other entries in ERA that use those keywords (in the way that Flickr and most other systems do with tags)?  Yes, the institutional repository architectural approach means that we'd only get to see other stuff in ERA, not all that useful I'll grant you, but it would be better than nothing.

Similarly, what about linking the author's name to all other entries by that author.  Ditto with the publisher's name.  Let's encourage a bit of browsing here shall we?  This is supposed to be about resource discovery after all!

So finally, let's look at the links on the page.  There at the bottom is a link labelled 'View/Open' which takes me to the PDF file - phew, the thing I'm actually looking for!  Not the most obvious spot on the page but I got there in the end.  Unfortunately, I assume that every single other item in ERA uses that exact same link text for the PDF (or other format) files.  Link text is supposed to indicate what is being linked to - it's a kind of metadata for goodness sake.

And then, right at the bottom of the page, there's a button marked "Show full item record".  I have no idea what that is but I'll click on it anyway.  Oh, it's what other services call "Additional information".  But why use an HTML form button to hide a plain old hypertext link?  Strange or what?

OK, I apologise... I've lapsed into sarcasm for effect.  But the fact remains that repository jump-off pages are potentially some of the most important Web pages exposed by universities (this is core research business after all) yet they are nearly always some of the worst examples of HTML to be found on the academic Web.  I can draw no other conclusion than that the Web is seen as tangential in this space.

I've taken 10 minutes to look at these pages... I don't doubt that there are issues that I've missed.  Clearly, if one took time to look around at different repositories one would find examples that were both better and worse (I'm honestly not picking on ERA here... it just happened to come to hand).  But in general, this stuff is atrocious - we can and should do better.

Freedom, Google-juice and institutional mandates

[Note: This entry was originally posted on the 9th Feb 2009 but has been updated in light of comments.]

An interesting thread has emerged on the American Scientist Open Access Forum based on the assertion that in Germany "freedom of research forbids mandating on university level" (i.e. that a mandate to deposit all research papers in an institutional repository (IR) would not be possible legally).  Now, I'm not familiar with the background to this assertion and I don't understand the legal basis on which it is made.  But it did cause me to think about why there might be an issue related to academic freedom caused by IR deposit mandates by funders or other bodies.

In responding to the assertion, Bernard Rentier says:

No researcher would complain (and consider it an infringement upon his/ her academic freedom to publish) if we mandated them to deposit reprints at the local library. It would be just another duty like they have many others. It would not be terribly useful, needless to say, but it would not cause an uproar. Qualitatively, nothing changes. Quantitatively, readership explodes.

Quite right. Except that the Web isn't like a library so the analogy isn't a good one.

If we ignore the rarefied, and largely useless, world of resource discovery based on the OAI-PMH and instead consider the real world of full-text indexing, link analysis and, well... yes, Google then there is a direct and negative impact of mandating a particular place of deposit. For every additional place that a research paper surfaces on the Web there is a likely reduction in the Google-juice associated with each instance caused by an overall diffusion of inbound links.

So, for example, every researcher who would naturally choose to surface their paper on the Web in a location other than their IR (because they have a vibrant central (discipline-based) repository (CR) for example) but who is forced by mandate to deposit a second copy in their local IR will probably see a negative impact on the Google-juice associated with their chosen location.

Now, I wouldn't argue that this is an issue of academic freedom per se, and I agree with Bernard Rentier (earlier in his response) that the freedom to "decide where to publish is perfectly safe" (in the traditional academic sense of the word 'publish'). However, in any modern understanding of 'to publish' (i.e. one that includes 'making available on the Web') then there is a compromise going on here.

The problem is that we continue to think about repositories as if they were 'part of a library', rather than as a 'true part of the fabric of the Web', a mindset that encourages us to try (and fail) to redefine the way the Web works (through the introduction of things like the OAI-PMH for example) and that leads us to write mandates that use words like 'deposit in a repository' (often without even defining what is meant by 'repository') rather than 'make openly available on the Web'.

In doing so I think we do ourselves, and the long term future of open access, a disservice.

Addendum (10 Feb 2009): In light of the comments so far (see below) I confess that I stand partially corrected.  It is clear that Google is able to join together multiple copies of research papers.  I'd love to know the heuristics they use to do this and I'd love to know how successful those heuristics are in the general case.  Nonetheless, on the basis that they are doing it, and on the assumption that in doing so they also combine the Google juice associated with each copy, I accept that my "dispersion of Google-juice" argument above is somewhat weakened.

There are other considerations however, not least the fact that the Web Architecture explicitly argues against URI aliases:

Good practice: Avoiding URI aliases
A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

The reasons given align very closely to the ones I gave above, though couched in more generic language:

Although there are benefits (such as naming flexibility) to URI aliases, there are also costs. URI aliases are harmful when they divide the Web of related resources. A corollary of Metcalfe's Principle (the "network effect") is that the value of a given resource can be measured by the number and value of other resources in its network neighborhood, that is, the resources that link to it.

The problem with aliases is that if half of the neighborhood points to one URI for a given resource, and the other half points to a second, different URI for that same resource, the neighborhood is divided. Not only is the aliased resource undervalued because of this split, the entire neighborhood of resources loses value because of the missing second-order relationships that should have existed among the referring resources by virtue of their references to the aliased resource.

Now, I think that some of the discussions around linked data are pushing at the boundaries of this guidance, particularly in the area of non-information resources.  Nonetheless, I think this is an area in which we have to tread carefully.  I stand by my original statement that we do not treat scholarly papers as though they are part of the fabric of the Web - we do not link between them in the way we link between other Web pages.  In almost all respects we treat them as bits of paper that happen to have been digitised and the culprits are PDF, the OAI-PMH, an over-emphasis on preservation and a collective lack of imagination about the potential transformative effect of the Web on scholarly communication.  We are tampering at the edges and the result is a mess.

January 30, 2009

Surveying with voiD

Michael Hausenblas yesterday announced the availability of version 1.0 of the voiD specification. void specifies an RDF-based approach to the description of RDF datasets that have been constructed following the principles of linked data.

Although the emphasis is very much on those characteristics specific to a void:Dataset - and particularly the nature of links between datasets - this sort of approach reminded me of that taken in the area of collection-level description, an area which Andy and I both contributed to in the past, leading to work within DCMI on the development of the Dublin Core Collections Application Profile. - though of course that profile is much more generally scoped than voiD.

Michael describes the problem addressed by voiD in his article in a recent issue of Nodalities:

Now, the main challenge is: how can I, as someone who wants to build an application on top of linked data, find and select appropriate linked datasets? Note that there are two basic issues here: first, finding an appropriate dataset (discovery) then selecting one - that is, you have a bunch of possible candidates, which one is the ‘best suited’.

This reminded me of the much quoted (not least by me back when I was running round doing presentations as part of UKOLN's Collection Description Focus!) metaphor used by Michael Heaney in his An Analytical Model of Collections and their Catalogues, with reference to an academic researcher approaching the "landscape" of research collections:

The scholar surveying this landscape is looking for the high points. A high point represents an area where the potential for gleaning desired information by visiting that spot (physically or by remote means) is greater than that of other areas. To continue the analogy, the scholar is concerned at the initial survey to identify areas rather than specific features – to identify rainforest rather than to retrieve an analysis of the canopy fauna of the Amazon basin.

Judging by the response on the W3C public-lod mailing list, there's a considerable interest in voiD in the linked data community, and I look forward to seeing what sort of new services emerge using it.

January 22, 2009

Why can't I find a library book in my search engine?

There's a story in today's Guardian, Why you can't find a library book in your search engine, (seen online but I assume that it is also in the paper version) covering the ongoing situation around the licensing of OCLC WorldCat catalog records.  Rob Styles provides some of the background to this, OCLC, Record Usage, Copyright, Contracts and the Law, though, as he notes, he works for Talis which is one of the commercial organisations that stands to benefit from a change in OCLC's approach.

I don't want to comment in too much detail on this story since I freely admit to not having properly done my homework, but I will note that my default position on this kind of issue is that we (yes, all of us) are better off in those cases where data is able to be made available on an 'open' rather than 'proprietary' basis and I think this view of the world definitely applies in this case.

The Guardian story is somewhat simplistic, IMHO, not on the question of 'open' vs. 'closed' but on how easy it would be for such data, assuming that it was to be made openly available, to get into search engines (by which I assume the article really means Google?) in a meaningful way.  Flooding the Web with multiple copies of metadata about multiple copies of books is non-trivial to get right (just think of the issues around sensibly assigning 'http' URIs to this kind of stuff for example) such that link counting, ranking of books vs. other Web resources, and providing access to appropriate copies can be done sensibly.  There has to be some point of 'concentration' (to use Lorcan Dempsey's term) around which such things can happen - whether that is provided by Google, Amazon, Open Library, OCLC, Talis, the Library of Congress or someone else.  Too many points of concentration and you have a problem... or so it seems to me.

December 24, 2008

Finding eBook Neverland

Or "why publishers need to unlock more than their imagination".

At the JISC IE and e-Research Call briefing day last week John Smith of UKC mentioned that discovering the availability of eBook titles is way harder than it should be. The lack of any single point of aggregation of information about eBooks means that libraries are basically left manually searching/browsing multiple suppliers to see who has what.

I just took a very quick look at NetLibrary, Dawsonera, MyiLibrary and Books@Ovid, wondering what information I could find from each about a search API, RSS feed or anything vaguely machine-to-machine oriented.


Apologies if I missed something obvious.

I mean, come on guys... this is the bread and butter of the Web these days isn't it?  Throw us a frickin' bone :-).  I'm not asking you to make your eBook content openly available, just offer an interface that lets me write code to see what you have available without having to manually browse or search your Web pages.

December 18, 2008

JISC IE and e-Research Call briefing day

I attended the briefing day for the JISC's Information Environment and e-Research Call in London on Monday and my live-blogged notes are available on eFoundations LiveWire for anyone that is interested in my take on what was said.

Quite an interesting day overall but I was slightly surprised at the lack of name badges and a printed delegate list, especially given that this event brought together people from two previously separate areas of activity. Oh well, a delegate list is promised at some point.  I also sensed a certain lack of buzz around the event - I mean there's almost £11m being made available here, yet nobody seemed that excited about it, at least in comparison with the OER meeting held as part of the CETIS conference a few weeks back.  At that meeting there seemed to be a real sense that the money being made available was going to result in a real change of mindset within the community.  I accept that this is essentially second-phase money, building on top of what has gone before, but surely it should be generating a significant sense of momentum or something... shouldn't it?

A couple of people asked me why I was attending given that Eduserv isn't entitled to bid directly for this money and now that we're more commonly associated with giving grant money away rather than bidding for it ourselves.

The short answer is that this call is in an area that is of growing interest to Eduserv, not least because of the development effort we are putting into our new data centre capability.  It's also about us becoming better engaged with the community in this area.  So... what could we offer as part of a project team? Three things really: 

  • Firstly, we'd be very interested in talking to people about sustainable hosting models for services and content in the context of this call.
  • Secondly, software development effort, particularly around integration with Web 2.0 services.
  • Thirdly, significant expertise in both Semantic Web technologies (e.g. RDF, Dublin Core and ORE) and identity standards (e.g. Shibboleth and OpenID).

If you are interested in talking any of this thru further, please get in touch.

November 07, 2008

Some (more) thoughts on repositories

I attended a meeting of the JISC Repositories and Preservation Advisory Group (RPAG) in London a couple of weeks ago.  Part of my reason for attending was to respond (semi-formally) to the proposals being put forward by Rachel Heery in her update to the original Repositories Roadmap that we jointly authored back in April 2006.

It would be unfair (and inappropriate) for me to share any of the detail in my comments since the update isn't yet public (and I suppose may never be made so).  So other than saying that I think that, generally speaking, the update is a step in the right direction, what I want to do here is rehearse the points I made which are applicable to the repositories landscape as I see it more generally.  To be honest, I only had 5 minutes in which to make my comments in the meeting, so there wasn't a lot of room for detail in any case!

Broadly speaking, I think three points are worth making.  (With the exception of the first, these will come as no surprise to regular readers of this blog.)


There may well be some disagreement about this but it seems to me that the collection of material we are trying to put into institutional repositories of scholarly research publications is a reasonably well understood and measurable corpus.  It strikes me as odd therefore that the metrics we tend to use to measure progress in this space are very general and uninformative.  Numbers of institutions with a repository for example - or numbers of papers with full text.  We set targets for ourselves like, "a high percentage of newly published UK scholarly output [will be] made available on an open access basis" (a direct quote from the original roadmap).  We don't set targets like, "80% of newly published UK peer-reviewed research papers will be made available on an open access basis" - a more useful and concrete objective.

As a result, we have little or no real way of knowing if are actually making significant progress towards our goals.  We get a vague feel for what is happening but it is difficult to determine if we are really succeeding.

Clearly, I am ignoring learning object repositories and repositories of research data here because those areas are significantly harder, probably impossible, to measure in percentage terms.  In passing, I suggest that the issues around learning object repositories, certainly the softer issues like what motivates people to deposit, are so totally different from those around research repositories that it makes no sense to consider them in the same space anyway.

Even if the total number of published UK peer-reviewed research papers is indeed hard to determine, it seems to me that we ought to be able to reach some kind of suitable agreement about how we would estimate it for the purposes of repository metrics.  Or we could base our measurements on some agreed sub-set of all scholarly output - the peer-reviewed research papers submitted to the current RAE (or forthcoming REF) for example.

A glass half empty view of the world says that by giving ourselves concrete objectives we are setting ourselves up for failure.  Maybe... though I prefer the glass half full view that we are setting ourselves up for success.  Whatever... failure isn't really failure - it's just a convenient way of partitioning off those activities that aren't worth pursuing (for whatever reason) so that other things can be focused on more fully.  Without concrete metrics it is much harder to make those kinds of decisions.

The other issue around metrics is that if the goal is open access (which I think it is), as opposed to full repositories (which are just a means to an end) then our metrics should be couched in terms of that goal.  (Note that, for me at least, open access implies both good management and long-term preservation and that repositories are only one way of achieving that).

The bottom-line question is, "what does success in the repository space actually look like?".  My worry is that we are scared of the answers.  Perhaps the real problem here is that 'failure' isn't an option?

Executive summary: our success metrics around research publications should be based on a percentage of the newly published peer-reviewed literature (or some suitable subset thereof) being made available on an open access basis (irrespective of how that is achieved).

Emphasis on individuals

Across the board we are seeing a growing emphasis on the individual, on user-centricity and on personalisation (in its widest sense).  Personal Learning Environments, Personal Research Environments and the suite of 'open stack' standards around OpenID are good examples of this trend.  Yet in the repository space we still tend to focus most on institutional wants and needs.  I've characterised this in the past in terms of us needing to acknowledge and play to the real-world social networks adopted by researchers.  As long as our emphasis remains on the institution we are unlikely to bring much change to individual research practice.

Executive summary: we need to put the needs of individuals before the needs of institutions in terms of how we think about reaching open access nirvana.

Fit with the Web

I written and spoken a lot about this in the past and don't want to simply rehash old arguments.  That said, I think three things are worth emphasising:


Global discipline-based repositories are more successful at attracting content than institutional repositories.  I can say that with only minimal fear of contradiction because our metrics are so poor - see above :-).  This is no surprise.  It's exactly what I'd expect to see.  Successful services on the Web tend to be globally concentrated (as that term is defined by Lorcan Dempsey) because social networks tend not to follow regional or organisational boundaries any more.

Executive summary: we need to work out how to take advantage of global concentration more fully in the repository space.

Web architecture

Take three guiding documents - the Web Architecture itself, REST, and the principles of linked data.  Apply liberally to the content you have at hand - repository content in our case.  Sit back and relax. 

Executive summary: we need to treat repositories more like Web sites and less like repositories.

Resource discovery

On the Web, the discovery of textual material is based on full-text indexing and link analysis.  In repositories, it is based on metadata and pre-Web forms of citation.  One approach works, the other doesn't.  (Hint: I no longer believe in metadata as it is currently used in repositories).  Why the difference?  Because repositories of research publications are library-centric and the library world is paper-centric - oh, and there's the minor issue of a few hundred years of inertia to overcome.  That's the only explanation I can give anyway.  (And yes, since you ask... I was part of the recent movement that got us into this mess!). 

Executive summary: we need to 1) make sure that repository content is exposed to mainstream Web search engines in Web-friendly formats and 2) make academic citation more Web-friendly so that people can discovery repository content using everyday tools like Google.

Simple huh?!  No, thought not...

I realise that most of what I say above has been written (by me) on previous occasions in this blog.  I also strongly suspect that variants of this blog entry will continue to appear here for some time to come.

August 01, 2008

SEO and digital libraries

Lorcan Dempsey, SEO is part of our business, picks up on a post by John Wilkin, Our hidden digital libraries, concerning our collective inability to expose digital library content to search engines like Google effectively.

This is something I've touched on several times in recent presentations, particularly with reference to repositories, so I'm really pleased to see it getting some air-time.  This is our problem... we need to solve it!  We can't continue to blame search engines for not trying hard enough to get at and index our content - we need to try harder to expose it in Google-friendly ways.

I agree with John that doing this for many significant digital libraries may not be trivial (though, actually, in the case of your average institutional repository I think it comes pretty close) but it needs doing nonetheless.  As Lorcan says, we need to emphasise "'disclosure' as a new word in our service lexicon. We may not control the discovery process in many cases, so we should be increasingly concerned about effective disclosure to those discovery services. Effective disclosure has be be managed, whether it is about APIs, RSS feeds, support for inbound linking, exposure to search engines, ...".

July 18, 2008

Does metadata matter?

This is a 30 minute slidecast (using 130 slides), based on a seminar I gave to Eduserv staff yesterday lunchtime.  It tries to cover a broad sweep of history from library cataloguing, thru the Dublin Core, Web search engines, IEEE LOM, the Semantic Web, arXiv, institutional repositories and more.

It's not comprehensive - so it will probably be easy to pick holes in if you so choose - but how could it be in 30 minutes?!

The focus is ultimately on why Eduserv should be interested in 'metadata' (and surrounding areas), to a certain extent trying to justify why the Foundation continues to have a significant interest in this area.  To be honest, it's probably weakest in its conclusions about whether, or why, Eduserv should retain that interest in the context of the charitable services that we might offer to the higher education community.

Nonetheless, I hope it is of interest (and value) to people.  I'd be interested to know what you think.

As an aside, I found that the Slideshare slidecast editing facility was mostly pretty good (this is the first time I've used it), but that it seemed to struggle a little with the very large number of slides and the quickness of some of the transitions.

June 30, 2008

Article 2.0 contest from Elsevier

This is interesting...

We’ve worked hard to build the Article 2.0 dataset, and now we’re opening it up to developers via a simple, straightforward REST API. We will provide contestants with access to approximately 7,500 full-text XML scientific articles (including images) and challenge each contestant to be the publisher. In other words, each contestant will have complete freedom for how they would like to present the scientific research articles contained in the Article 2.0 dataset.

Elsevier have announced a competition entitled Article 2.0, asking entrants to build new services on top of a scientific article data that they are making available (though I must admit when I first saw the name I thought they might be asking people to experiment with what academic journal articles of the future might look like).

June 26, 2008

What makes a good tag?

Yonks ago (that's... like.. you know... quite a long time ago) I suggested to the mailing list that we needed an agreed set of tags for labelling UK universities and colleges in Web 2.0 tagging services.  I'd raised the issue because we had just agreed with John Kirriemuir that he would create a Flickr pool in order to collect images of UK HE and FE activity in Second Life, as part of the series of snapshots that we are funding, and we wanted a way that people could consistently tag images according to which institution they represented.

I don't recall the details of what I suggested at the time but I think it was to use tags of the form 'universityofbath' (based on the list of names used by the HERO service).  Whatever... the specifics aren't important.  What happened was that I got deluged by replies offering different and conflicting advice about what makes a good set of tags - from totally unique but not very memorable UCAS codes, thru DNS domain names (, to human-readable but ultimately rather long strings such as the form I'd originally suggested (with or without hyphens and/or using camel-case).

Some useful points came out of the discussion, like the fact that unique but incomprehensible tags based on codes of one kind or another aren't very useful because no-one would ever 1) think of searching for them, or 2) remember them.  Unfortunately, nothing approaching consensus was reached.

We had a brief but rather similar exchange on Twitter yesterday because Brian Kelly suggested that the JISC Emerge project had got their tag strategy wrong by using 'em0608' for their current online conference, largely (I think) on the basis that Americans might get confused as to whether it meant June 2008 or August 2006. I responded along the lines of, "who cares, a tag isn't meant to be parsed anyway", to which Brian, rightly, responded that parsability and memorability are intertwined.

To cut a long(ish) story short, two things have emerged (excuse the pun) from this exchange:

  • firstly, having a conversation in bursts of 140 characters isn't ideal - and is probably annoying for those people not interested in the discussion in the first place, and
  • secondly, there is still little consensus about what makes a good tag!

I suggested that tags (particularly in the context of Twitter) need to be relatively short, relatively unique and relatively memorable.  But as Brian noted, there is a significant tension between shortness and memorableness (is that a word?).  Further, Steven Warburton questioned the value of uniqueness in the context of a relatively short-term forum like Twitter (i.e. it probably doesn't matter too much if your tag gets re-used a year later because the Twitter context has moved on).  However, it's important to remember that tags get shared across all kinds of Web 2.0 services (, Flickr, blogs, YouTube, Slideshare and so on) in order that applications like Hashtags and Onetag can pull everything together and that persistence requirements in those other services may be very different than they are in Twitter.

David Harrison asked a practical question concerning an upcoming UCISA conference - what did we think of 'ucisa-usc2008' as a tag? (Though it subsequently turns out that he meant 'ucisa-usc08'.)

I said that I thought it was too long - 14 characters (you need to prefix the tag with a '#' in Twitter) is 10% of the available bandwidth in a Twitter tweet.  I think that's too wasteful.  I suggested dropping the hyphens and using something like 'ucisausc08' or 'uusc2008' as alternatives but Brian commented that the hyphens were important to improve the tag's 'recitability'.

I'm not totally convinced... though I concede that our use of 'efsym2008' for our symposium earlier this year may have had less impact than it might because people didn't find it easy to remember (either because they didn't know what the 'ef' and 'sym' bits meant or because they got confused about whether it was '2008' or '08').

Ho hum... as I say, and this is basically the whole point of this rather long-winded post, there doesn't seem to be much in the way of agreed best practices around what makes a good tag.  And perhaps that's right and proper - we are talking about user-generated content after all and, in the case of the tags for universities, folksonomies are supposed to grow organically rather than be prescribed (though this isn't true for meeting tags which necessarily have to be prescribed by the organisers in advance of the meeting).

FWIW (which probably isn't much given the apparent level of disagreement) my current feeling is that brevity trumps clarity (at least assuming a desire to use the tags in Twitter), which means that 2-digit years are better than 4, hyphens are usually superfluous, and other strings should be kept as short as possible - but, as always, I reserve the right to change my mind at any point in the future.

June 16, 2008

Web 2.0 and repositories - have we got our repository architecture right?

For the record... this is the presentation I gave at the Talis Xiphos meeting last week, though to be honest, with around 1000 Slideshare views in the first couple of days (presumably thanks to a blog entry by Lorcan Dempsey and it being 'featured' by the Slideshare team) I guess that most people who want to see it will have done so already:

Some of my more recent presentations have followed the trend towards a more "picture-rich, text-poor" style of presentation slides.  For this presentation, I went back towards a more text-centric approach - largely because that makes the presentation much more useful to those people who only get to 'see' it on Slideshare and it leads to a more useful slideshow transcript (as generated automatically by Slideshare).

As always, I had good intentions around turning it into a slidecast but it hasn't happened yet, and may never happen to be honest.  If it does, you'll be the first to know ;-) ...

After I'd finished the talk on the day there was some time for Q&A.  Carsten Ulrich (one of the other speakers) asked the opening question, saying something along the lines of, "Thanks for the presentation - I didn't understand a word you were saying until slide 11".  Well, it got a good laugh :-).  But the point was a serious one... Carsten admitted that he had never really understood the point of services like arXiv until I said it was about "making content available on the Web".

OK, it's a sample of one... but this endorses the point I was making in the early part of the talk - that the language we use around repositories simply does not make sense to ordinary people and that we need to try harder to speak their language.

April 15, 2008

IMLS Digital Collections & Content

Another somewhat belated post.... Andy and I both get occasional invitations to be members of advisory/steering groups for various programmes and projects operating in the areas in which we have an interest. I'm currently a member of the Advisory Group for the second phase of the Digital Collections and Content project which is funded by the Institute of Museum and Library Services and led by a team at the University of Illinois at Urbana-Champaign. Given the UK focus of the Foundation, it's probably slightly unusual for me to take on such a role for a US project, but it combines a number of our interests - repositories, resource discovery, metadata, the use of cultural heritage resources for learning and research, and I have also worked with some members of the project team in the past in the development of the Dublin Core Collections Application Profile.

The group met recently in Chicago, and although I wasn't able to attend the meeting in person, I managed to join in by phone for a couple of hours. One area in which the project seems to be doing some interesting work is in the relationships between collection-level description and item description, and in particular the use of algorithms/rules by which item-level metadata might be inferred from collection-level metadata.

The project is also exploring how collection-level metadata might be presented more effectively during searching, particularly to provide contextual information for individual items.

April 14, 2008

Open Repositories 2008

I spent a large part of last week the week before last (Tuesday, Wednesday & Friday) at the Open Repositories 2008 conference at the University of Southampton.

There were something around 400 delegates there, I think, which I guess is an indicator of the considerable current level of interest around the R-word. Interestingly, if I recall conference chair Les Carr's introductory summary of stats correctly, nearly a quarter of these had described themselves as "developers", so the repository sphere has become a locus for debate around technical issues, as well as the strategic, policy and organisational aspects. The JISC Common Repository Interfaces Group (CRIG) had a visible presence at the conference, thanks to the efforts of David Flanders and his comrades, centred largely around the "Repository Challenge" competition (won by Dave Tarrant, Ben O’Steen and Tim Brody with their "Mining with ORE" entry).

The higher than anticipated number of people did make for some rather crowded sessions at times. There was a long queue for registration, though that was compensated for by the fact that I came away from that process with exactly two small pieces of paper: a name badge inside an envelope on which were printed the login details or the wireless network. (With hindsight, I could probably have done with a one page schedule of what was on in which location - there probably was one which I missed picking up!) Conference bags (in a rather neat "vertical" style which my fashion-spotting companions reliably informed me was a "man bag") were available, but optional. (I was almost tempted, as I do sport such an accessory at weekends, and it was black rather than dayglo orange, but decided to resist on the grounds that there was a high probability of it ending up in the hotel wastepaper bin as I packed up to leave.) Nul points, however, to those advertisers who thought it was a good idea to litter every desktop surface in the crowded lecture theatre with their glossy propaganda, with the result that a good proportion of it ended up on the floor as (newly manbagged-up) delegates squeezed their way to their seats.

The opening keynote was by Peter Murray-Rust of the Unilever Centre for Molecular Informatics, University of Cambridge. With some technical glitches to contend with, which must have been quite daunting in the circumstances - Peter has posted a quick note on his view of the experience! "I have no idea what I said" :-)) - , Peter delivered a somewhat "non-linear" but always engaging and entertaining overview of the role of repositories for scientific data. He noted the very real problem that while ever increasing quantities of data are being generated, very little of it is being successfully captured, stored and made accessible to others. Peter emphasised that any attempt to capture this data effectively must fit in with the existing working practices of scientists, and must be perceived as supporting the primary aims of the scientist, rather than introducing new tasks which might be regarded as tangential to those aims. And the practices of those scientists may, in at least some areas of scientific research, be highly "locally focused" i.e. the scientists see their "allegiances" as primarily to a small team with whom data is shared - at least in the first instance, an approach categorised as "long tail science" (a term attributed to Peter's colleague Jim Downing). Peter supported his discussion with examples drawn from several different e-Chemistry projects and initiatives, including the impressive OSCAR-3 text mining software which extracts descriptions of chemical compounds from documents)

Most of the remainder of the Tuesday and Wednesday I spent in paper sessions. The presentation I enjoyed most was probably a presentation by Jane Hunter from the University of Queensland on the work of the HarvANA project on a distributed approach to annotation and tagging of resources from the Picture Australia collection (in the first instance at least - at the end, Jane whipped through a series of examples of applying the same techniques to other resources). Jane covered a model for annotation on tagging based on the W3C Annotea model, a technical architecture for gathering and merging distributed annotations/taggings (using OAI-PMH to harvest from targets at quite short time intervals (though those intervals could be extended if preferred/required)), browser-based plug-in tools to perform annotation/tagging, and also touched on the relationships between tagging and formally-defined ontologies. The HarvANA retrieval system currently uses an ontology to enhance tag-based retrieval - "ontology-based or ontology-directed folksonomy" - , but the tags provided could also contribute to the development/refinement of that ontology, "folksonomy-directed ontology". Although it was in many ways a repository-centric approach and Jane focused on the use of existing, long-established technologies, she also succeeded in placing repositories firmly in the context of the Web: as systems which enable us to expose collections of resources (and collections of descriptions of those resources), which then enter the Web of relationships with other resources managed and exposed by other systems - here, the collections of annotations exposed by the Annotea servers, but potentially other collections too.

At Wednesday lunch time, (once I managed to find the room!) I contributed to a short "birds of a feather" session co-ordinated by Rosemary Russell of UKOLN and Julie Allinson of the University of York on behalf of the Dublin Core Scholarly Communications Community. We focused mainly on the Scholarly Works Application Profile and its adoption of a FRBR-based model, and talked around the extension of that approach to other resource types which is under consideration in a number of sibling projects currently being funded by JISC. (Rather frustratingly for me, this meeting clashed with another BoF session on Linked Data which I would really have liked to attend!)

I should also mention the tremendously entertaining presentation by Johan Bollen of the Los Alamos National Laboratory on the research into usage metrics carried out by the MESUR project. Yes, I know, "tremendously entertaining" and "usage statistics" aren't the sort of phrases I expect to see used in close proximity either. Johan's base premise was, I think, that seeking to illustrate impact through blunt "popularity" measures was inadequate, and he drew a distinction between citation - the resources which people announce in public that they have read - and usage - the actual resources they have downloaded. Based on a huge dataset of usage statistics provided by a range of popular publishers and aggregators, he explored a variety of other metrics, comparing the (surprisingly similar) rankings of journals obtained via several of these metrics with the rankings provided by the citation-based Thomson impact factor. I'm not remotely qualified to comment on the appropriateness of Johan's choice of algorithms, but the fact that Johan kept a large audience engaged at the end of a very long day was a tribute to his skill as a presenter. (Though I'd still take issue with the Britney (popular but insubstantial?)/Big Star (low-selling but highly influential/lauded by the cognoscenti) opposition: nothing by Big Star can compare with the strutting majesty of "Toxic". No, not even "September Gurls".)

On the Friday, I attended the OAI ORE Information Day, but I'll make that the subject of a separate post.

All in all - give or take a few technical hiccups - it was a successful conference, I think (and thanks to Les and his team for their hard work) - perhaps more so in terms of the "networking" that took place around the formal sessions, and the general "buzz" there seemed to be around the place, than because of any ground-breaking presentations.

And yet, and yet... at the end of the week I did come away from some of the sessions with my niggling misgivings about the "repository-centric" nature of much of the activity I heard described slightly reinforced. Yes, I know: what did I expect to hear at a conference called "Open Repositories"?! :-) But I did feel an awful lot of the emphasis was on how "repository systems" communicate with each other (or how some other app communicates with one repository system and then with another repository system ) e.g. how can I "get something out" of your repository system and "put it into" my repository system, and so on. It seems to me that - at the technical level at least - we need to focus less on seeing repository systems as "specific" and "different" from other Web applications, and focus more on commonalities. Rather than concentrating on repository interfaces we should ensure that repository systems implement the uniform interface defined by the RESTful use of the HTTP protocol. And then we can shift our focus to our data, and to

  • the models or ontologies (like FRBR and the CIDOC Conceptual Reference Model, or even basic one-object-is-made-available-in-multiple-formats models) which condition/determine the sets of resources we expose on the Web, and see the use of those models as choices we make rather than something "technologically determined" ("that's just what insert-name-of-repository-software-app-of-choice does");
  • the practical implementation of formalisms like RDF which underpin the structure of our representations describing instances of the entities defined by those models, through the adoption of conventions such as those advocated by the Linked Data community

In this world, the focus shifts to "Open (Managed) Collections" (or even "Open Linked Collections"), collections of documents, datasets, images, of whatever resources we choose to model and expose to the world. And as a consumer of those resources  I (and, perhaps more to the point, my client applications) really don't need to know whether the system that manages and exposes those collections is a "repository" or a "content management system" or something else (or if the provider changes that system from one day to the next): they apply the same principles to interactions with those resources as they do to any other set of resources on the Web.

March 14, 2008

Yahoo search & the Semantic Web

There was a good deal of excitement yesterday at an announcement on the Yahoo! Search weblog that they will be introducing support in the Yahoo Search Monkey platform for indexing some data made available on the Web using Semantic Web standards or using some microformats. In yesterday's post, "Dublin Core" is mentioned as one of the vocabularies which will be supported; it also refers to support for both the W3C's RDFa and Ian Davis' Embedded/Embeddable RDF (Aside: I've been starting to explore RDFa recently and I'm quite excited about the potential, but that should be the topic of a separate post.)

A post by Micah Dubinko provides some further detail in an FAQ style.

It is worth bearing in mind the note of caution from Paul Miller that such an approach brings with it the challenges of dealing with malicious or mischievous attempts to spam rankings, and as I think Micah Dubinko's post makes clear, this is not going to be an aggregator of all the RDF data on the Web. But nevertheless it seems to represents a very significant development in terms of the use of metadata by a major Web search engine (after all the years I've spent having to break it to dismayed Dublin Core aficionados that the metadata from their HTML headers almost certainly wasn't going to be used by any of the global search engines, and unless they knew of an application that was going to index/harvest it, they might wish to consider whether the effort was worthwhile!) - and for the use of Semantic Web technologies in particular.

February 21, 2008

Linked Data (and repositories, again)

This is another one of those posts that started life in the form of various drafts which I didn't publish because I thought they weren't quite "finished", but then seemed to become slightly redundant because anything of interest had already been said by lots of other people who were rather more on the ball than I was. But as there seems to be a rapid growth of interest in this area at the moment, and as it ties in with some of the themes Andy highlights in his recent posts about his presentation at VALA 2008, I thought I'd make an effort to pull try to pull some of these fragments together.

If I'd got round to compiling my year-end Top 5 Technical Documents list for 2007 (whaddya mean, you don't have a year-end Top 5 Technical Documents list?), my number one would have been How to Publish Linked Data on the Web by Chris Bizer, Richard Cyganiak and Tom Heath.

In short, the document fleshes out the principles Tim Berners-Lee sketches in his Linked Data note - essentially the foundational principles for the Semantic Web. As Berners-Lee notes

The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data.  With linked data, when you have some of it, you can find other, related, data. (emphasis added)

And the key to realising this, argues Berners-Lee, lies in following four base rules:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs. so that they can discover more things.

Bizer, Cyganiak & Heath present linked data as a combination of key concepts from the Web Architecture on the one hand (including the TAG's resolution to the httpRange-14 issue) and the RDF data model on the other, and distill them into a form which is on the one hand clear and concise, and on the other backed up by effective, practical guidelines for their application. While many of those guidelines are available in some form elsewhere (e.g. in TAG findings or in notes such as Cool URIs...), it's extremely helpful to have these ideas collated and presented in a very practically focused style.

As an aside, in the course of assembling those guidelines, they suggest that some of those principles might benefit from some qualification, in particular the use of URI aliases, which the Web Architecture document suggests are best avoided. For the authors,

URI aliases are common on the Web of Data, as it can not realistically be expected that all information providers agree on the same URIs to identify a non-information resources. URI aliases provide an important social function to the Web of Data as they are dereferenced to different descriptions of the same non-information resource and thus allow different views and opinions to be expressed. (emphasis added)

I'm prompted to mention Linked Data now in part by Andy's emphasis on Web Architecture and Semantic Web technologies, but also by a post by Mike Bergman a couple of weeks ago, reflecting on the growth in the quantity of data now available following the principles and conventions recommended by the Bizer, Cyganiak & Heath paper. In his post, Bergman includes a copy of a graphic from Richard Cyganiak providing a "birds-eye view "of the Linked Data landscape, and highlighting the principal sources by domain or provider.

"What's wrong with that picture?", as they say. I was struck (but not really surprised) by the absence - with the exception of the University of Southampton's Department of Electronics & Computer Science - of any of the data about researchers and their outputs that is being captured and exposed on the Web by the many "repository" systems of various hues within the UK education sector. While in at least some cases institutions (or trans-institutional communities) are having a modicum of success in capturing that data, it seems to me that the ways in which it is typically made available to other applications mean that it is less visible and less usable than it might be.

Or, to borrow an expression used by Paul Miller of Talis in a post  on Nodalities, we need to think about how to make sure our repository systems are not simply "on the Web" but firmly "of the Web" - and the practices of the growing Linked Data community, it seems to me, provide a firm foundation for doing that.

February 13, 2008

Repositories thru the looking glass

P1050338 I spent last week in Melbourne, Australia at the VALA 2008 Conference - my first trip over to Australia and one that I thoroughly enjoyed.  Many thanks to all those locals and non-locals that made me feel so welcome.

I was there, first and foremost, to deliver the opening keynote, using it as a useful opportunity to think and speak about repositories (useful to me at least - you'll have to ask others that were present as to whether it was useful for anyone else).

It strikes me that repositories are of interest not just to those librarians in the academic sector who have direct responsibility for the development and delivery of repository services.  Rather they represent a microcosm of the wider library landscape - a useful case study in the way the Web is evolving, particularly as manifest through Web 2.0 and social networking, and what impact those changes have on the future of libraries, their spaces and their services.

My keynote attempted to touch on many of the issues in this area - issues around the future of metadata standards and library cataloguing practice, issues around ownership, authority and responsibility, issues around the impact of user-generated content, issues around Web 2.0, the Web architecture and the Semantic Web, issues around individual vs. institutional vs. national, vs. international approaches to service provision.

In speaking first I allowed myself the luxury of being a little provocative and, as far as I can tell from subsequent discussion, that approach was well received.  Almost inevitably, I was probably a little too technical for some of the audience.  I'm a techie at heart and a firm believer that it is not possible to form a coherent strategic view in this area without having a good understanding of the underlying technology.  But perhaps I am also a little too keen to inflict my world-view on others. My apologies to anyone who felt lost or confused.

I won't repeat my whole presentation here.  My slides are available from Slideshare and a written paper will become available on the VALA Web site as soon as I get round to sending it to the conference organisers!

I can sum up my talk in three fairly simple bullet points:

  • Firstly, that our current preoccupation with the building and filling of 'repositories' (particularly 'institutional repositories') rather than the act of surfacing scholarly material on the Web means that we are focusing on the means rather than the end (open access).  Worse, we are doing so using language that is not intuitive to the very scholars whose practice we want to influence.
  • Secondly, that our focus on the 'institution' as the home of repository services is not aligned with the social networks used by scholars, meaning that we will find it very difficult to build tools that are compelling to those people we want to use them.  As a result, we resort to mandates and other forms of coercion in recognition that we have not, so far, built services that people actually want to use.  We have promoted the needs of institutions over the needs of individuals.  Instead, we need to focus on building and/or using global scholarly social networks based on global repository services.  Somewhat oddly, ArXiv (a social repository that predates the Web let alone Web 2.0) provides us with a good model, especially when combined with features from more recent Web 2.0 services such as Slideshare.
  • Finally, that the 'service oriented' approaches that we have tended to adopt in standards like the OAI-PMH, SRW/SRU and OpenURL sit uncomfortably with the 'resource oriented' approach of the Web architecture and the Semantic Web.  We need to recognise the importance of REST as an architectural style and adopt a 'resource oriented' approach at the technical level when building services.

I'm pretty sure that this last point caused some confusion and is something that Pete or I need to return to in future blog entries.  Suffice to say at this point that adopting a 'resource oriented' approach at the technical level does not mean that one is not interested in 'services' at the business or function level.

[Image: artwork outside the State Library of Victoria]

February 06, 2008

Google, Social Graphs, privacy & the Web

This has already received a fair amount of coverage elsewhere (TechcrunchDanny Ayers, Read-Write Web (1), Joshua Porter (1), Read-Write Web (2), Joshua Porter (2), to pick just a few) but I thought it was worth providing a quick pointer. Last week Google announced the availability of what they are calling their Social Graph API.

The YouTube video by Brad Fitzpatrick provides a good overview:

This is a Google-provided service which offers a (service-specific) query interface to a dataset that is generated by crawling data publicly available on the Web in the form of:

Result sets are returned in the form of JSON documents.

On the technical side, I have seen a few critical comments (see discussion on Semantic Web Interest Group IRC channel) around some points of respecting Web architecture principles (e.g. the conflation of (URIs for) people and (URIs for) documents (see the draft W3C TAG finding Dereferencing HTTP URIs) and what looks like the introduction of an unnecessary new URI scheme (see the draft W3C TAG finding URNs, Namespaces and RegIstries)). And some concerns are also voiced about introducing dependency on a centralised Google-provided service - though of course the data is created and held independently and other providers could aggregate that data and offer similar services, even using the same interface (though whether they will be able to do so as effectively as Google can, given their experience in this area, and/or attract the user base which a Google service inevitably will, remains to be seen). And of course there are the usual issues of spamming and trust and the significance of reciprocation: who says "PeteJ is friends with XYZ" and what does XYZ have to say about that?

Overall, however, I think the approach of such a high-profile provider exposing data gathered from distributed, devolved, openly available sources on the Web, rather than from the database of a single social networking service, is being seen as a significant development.

There are some thoughtful voices of caution, however. In a comment to Joshua Porter's first post listed above, Thomas Vanderwal notes

I am quite excited about this in a positive manner. I do have great trepidation as this is exactly the tool social engineering hackers have been hoping for and working toward.


The Google SocialGraph API is exposing everybody who has not thought through their privacy or exposing of their connections.

And in particular, a post by Danah Boyd encourages us to reflect on the social, political and ethical implications of aggregating this data and facilitating access to that aggregation in this way, and reminds us that as individuals we live within a set of power relationships which mean that some are more vulnerable than others to the use of such technologies:

Being socially exposed is AOK when you hold a lot of privilege, when people cannot hold meaningful power over you, or when you can route around such efforts. Such is the life of most of the tech geeks living in Silicon Valley. But I spend all of my time with teenagers, one of the most vulnerable populations because of their lack of agency (let alone rights). Teens are notorious for self-exposure, but they want to do so in a controlled fashion. Self-exposure is critical for the coming of age process - it's how we get a sense of who we are, how others perceive us, and how we fit into the world. We exposure during that time period in order to understand where the edges are. But we don't expose to be put at true risk. Forced exposure puts this population at a much greater risk, if only because their content is always taken out of context. Failure to expose them is not a matter of security through obscurity... it's about only being visible in context.

Even if - as Google take pains to emphasise is the case - the individual data sources are already "public", the merging of data sources, and the change of the context in which information is presented can be significant.

The opposing view is perhaps most vividly expressed in Tim O'Reilly's comment:

The counter-argument is that all this data is available anyway, and that by making it more visible, we raise people's awareness and ultimately their behavior. I'm in the latter camp. It's a lot like the evolutionary value of pain. Search creates feedback loops that allow us to learn from and modify our behavior. A false sense of security helps bad actors more than tools that make information more visible.

One of my tests for whether a Web 2.0 innovation is "good", despite the potential for abuse, is whether it makes us smarter.

I left this post half-finished at this point last night feeling very uneasy with what I perceived as an undertone of almost Darwinian "ruthlessness" in the O'Reilly position, but at the same time struggling to articulate an alternative that I was really convinced of.

So I was delighted this morning when, on opening up my Bloglines feeds, I found an excellent post by Dan Brickley which I think reflects some of the ambivalence I was feeling ("The end of privacy by obscurity should not mean the death of privacy. Privacy is not dead, and we will not get over it. But it does need to be understood in the context of the public record"), and, really, I can only recommend that you read the post in full because I think it's a very sensitive, measured contribution to the debate, based on Dan's direct experience of the issues arising from the deployment of these technologies over several years working on FOAF.

And, far from sitting on the fence, Dan concludes with very practical recommendations for action:

  • Best practice codes for those who expose, and those who aggregate, social Web data
  • Improved media literacy education for those who are unwittingly exposing too much of themselves online
  • Technology development around decentralised, non-public record communication and community tools (eg. via Jabber/XMPP)

Google's announcement of this API has certainly brought both the technical and the social issues to the attention of a wider audience, and sparked some important debate, and perhaps that in itself is a significant contribution in an area where the landscape suddenly seems to be shifting very quickly indeed.

And if I can unashamedly take the opportunity to make a another plug for the activities of the Foundation, I'm sure there's plenty of food for thought here for anyone considering a proposal to the current Eduserv Research Grants call :-)

January 30, 2008

Learning Materials & FRBR

JISC is currently funding a study, conducted by Phil Barker of JISC CETIS, to survey the requirements for a metadata application profile for learning materials held by digital repositories. Yesterday Phil posted an update on work to date, including a pointer to a (draft) document titled Learning Materials Application Profile Pre-draft Domain Model which 'suggests a "straw man" domain model for use during the project which, hopefully, will prove useful in the analysis of the metadata requirements'.

The document outlines two models: the first is of the operations applied to a learning object (based on the OAIS model) and the second is a (very outline) entity-relational model for a learning resource - which is based on a subset of the Functional Requirements for the Bibliographic Record (FRBR) model. As far as I can recall, this is the first time I've seen the FRBR model applied to the learning object space - though of course at least some of the resources which are considered "learning resources" are also described as bibliographic resources, and I think at least some, if not many, of the functions to be supported by "learning object metadata" are analogous to those to be supported by bibliographic metadata.

I do have some quibbles with the model in the current draft. Without a fuller description of the functions to be supported, it's difficult to assess whether it meets those requirements - though  I recognise that, as I think the opening comment I cited above indicates, there's an element of "chicken and egg" involved in this process: you need to have at least an outline set of entity types before you can start talking about operations on instances of those types. Clearly a FRBR-based approach should facilitate interoperability between learning object repositories and systems based on FRBR or on FRBR-derivatives like the Eprints/Scholarly Works Application Profile (SWAP). I have to admit the way "Context" is modelled at present doesn't look quite right to me, and I'm not sure about the approach of collapsing the concepts of an individual agency and a class of agents into a single "Agent" entity type in the model. (For me the distinguishing characteristic of what the SWAP calls an "Agent" is that, while it encompasses both individuals and groups, an "Agent" is something which acts as a unit, and I'm not sure that applies in the same way to the intended audience for a resource.) The other aspect I was wondering about is the potential requirement to model whole-part relationships, which, AFAICT, are excluded from the current draft version. FRBR supports a range of variant whole-part relations between instances of the principal FRBR entity types, although in the case of the SWAP, I don't think any of them were used.

But I'm getting ahead of myself here really - and probably ending up sounding more negative than I intend! I think it's a positive development to see members of the "learning metadata community" exploring - critically - the usefulness of a model emerging from the library community. I need to read the draft more carefully and formulate my thoughts more coherently, but I'll be trying to send some comments to Phil.

January 17, 2008

Flickr Commons

Via a tweet by @briankelly I discovered Flickr Commons, a collaboration between the Library of Congress and Flickr to "give you a taste of the hidden treasures in the huge Library of Congress collection" and to demonstrate "how your input of a tag or two can make the collection even richer".  There are more formal announcements here and here.

Brian's initial tweet generated a mini Twitter discussion (something that some people say Twitter isn't supposed to be used for though I tend to disagree).  The general consensus seemed to be that using the resources and tools of the private sector to widen access to public collections makes perfect sense, provided ownership of the data is retained - i.e. in this case it is OK because Flickr isn't Facebook! :-)  There are certainly some very, very obvious benefits in terms of visibility of content, size of audience, quality of user experience, and so on.

On that basis alone, this is a very interesting development and one that I'm sure many parts of the cultural heritage sector will be keeping a close eye on.  Congratulations to the Library of Congress and Flickr for getting their fingers out and doing something to bring these worlds together!  I'm guessing that the two collections that have been made available via Flickr so far are part of the American Memory collection - I haven't checked.  I'm also guessing that, like much of that collection, these images are effectively in the public domain?

As I've said before, what is frustrating for those of us in the UK about this development is that it is much harder to see this kind of thing happening here, where so many of our cultural collections are locked behind restrictive 'personal', 'educational' use licences.

Operating a hand drill at Vultee-Nashville, woman is working on a It'll be fascinating to see what kinds of tags people add.  The Flickr policy statement - "Any Flickr member is able to add tags or comment on these collections. If you're a dork about it, shame on you. This is for the good of humanity, dude!!" is short and to the point.  Like it!

I took a quick browse around the 1930s-40s in Color collection/set.  Here's a nice image (see right), now tagged with 'bandana', a word not in the original catalogue record as far as I can tell.  From there it is possible to navigate to other images in the collection with the same tag - there are three at the time of writing.  OK, so this isn't a earth-shattering example of user-generated content but you get the idea, and bandana researchers all over the world might well be hugely grateful to have three more resources at their disposal! :-)

It will also be interesting to see the kind of comments that people leave.  Hopefully we'll get beyond the use of 'wow' and 'awesome'!  Wouldn't it be great to see comments by the people (or their families or colleagues) in the photos.

Final thought... we've been making the point here for a while that Flickr is a repository and that the Flickr experience is a useful benchmark when we think about how repositories should look and feel - I think this kind of development makes that even more obvious.

January 16, 2008

Generation G

As Paul Walk notes, coincidence is a wonderful thing.  In this case, the coincidence is the JISC's publication of a report entitled "Information Behaviour of the Researcher of the Future" (PDF only) following hot on the heels of the debate around whether Google and the Internet should be blamed for students' lack of critical skills when evaluating online resources.

The report, in part, analyses the myths and realities around the google generation, though it actually goes much further than this, providing a very valuable overview of how researchers of the future (those currently in their school or pre-school years) might reasonably be expected to "access and interact with digital resources in five to ten years' time".  Overall, the report seems to indicate that there is little evidence to suggest that there is much generational impact on our information skill and research behaviours:

Whether or not our young people really have lower levels of traditional information skills than before, we are simply not in a position to know. However, the stakes are much higher now in an educational setting where `self-directed learning’ is the norm. We urgently need to find out.


Our overall conclusion is that much writing on the topic of this report overestimates the impact of ICTs on the young and underestimates its effect on older generations. A much greater sense of balance is needed.

Or as the JISC press release puts it:

research-behaviour traits that are commonly associated with younger users – impatience in search and navigation, and zero tolerance for any delay in satisfying their information needs – are now the norm for all age-groups, from younger pupils and undergraduates through to professors

The message is pretty clear.  Information skills are increasingly important and teaching them at university level appears to be shutting the stable door after the horse has bolted.  There is some evidence that to be effective, information skills need to be developed during the formative school years.  Interestingly, to me as a parent at least, is the evidence from the US that indicates that when "the top and bottom quartiles of students - as defined by their information literacy skills - are compared, it emerges that the top quartile report a much higher incidence of exposure to basic library skills from their parents, in the school library, classroom or public library in their earlier years".

The report ends by enumerating sets of implications for information experts, research libraries, policy makers, and ultimately all of us.  Well worth reading.

December 04, 2007

ACAP unveiled

Funny... I've been on the mailing list for a long, long time but there's been very little traffic for the last while (like 5 years or so) to the point that I'd kinda forgotten I was on it.  Just recently it has popped back into life with the announcement of the Automated Content Access Protocol (or ACAP):

Following a successful year-long pilot project, ACAP (Automated Content Access Protocol) has been devised by publishers in collaboration with search engines to revolutionise the creation, dissemination, use, and protection of copyright-protected content on the worldwide web.

Danny Sullivan, over at Search Engine Land, explains some of the background to this development.  It is clear that this initiative was born out of a certain amount of publisher mistrust about what search engines are doing with their content - something that makes the strap line, "unlocking content for all" a bit of a misnomer.  There's an emphasis on explicitly granting permission and an attempt to move away from the current default of assuming that everything is open to indexing.

Given that none of the big search engines currently support it, one is tempted to react with a big "huh!?".  I guess it's a case of wait and see.  Maybe this will turn into robots.txt 2.0, maybe it won't... but I think that decision lies with the search engines rather than with the publishers who initiated the exercise.  As Danny puts it:

So has the entire ACAP project been a waste of time, or as Andy Beal's great headline put it when ACAP was announced last year, Publishers to Spend Half Million Dollars on a Robots.txt File? That still makes me laugh.

No, I'd say not. I think it's been very useful that some group has diligently and carefully tried to explore the issues, and having ACAP lurking at the very least gives the search engines themselves a kick in the butt to work on better standards. Plus, ACAP provides some groundwork they may want to use. Personally, I doubt ACAP will become Robots.txt 2.0 -- but I suspect elements of ACAP will flow into that new version or a successor.

November 20, 2007

Semantic structures for teaching and learning

I'm attending the JISC CETIS conference at Aston University in Birmingham over the next couple of days.  One of the sessions that I've chosen to attend is on the use of the semantic Web in elearning, Semantic Structures for Teaching and Learning.  A couple of days ago a copy of all the position papers by the various session speakers came thru for people to read - hey, I didn't realise I was actually going to have to do some work for this conference! :-)

The papers made interesting reading, all essentially addressing the question of why the semantic Web hasn't had as much impact on elearning as we might have hoped it would a few years back, all taking a variety of viewpoints and perspectives.

Reading them got me thinking...

Some readers will know that I have given a fair few years of my recent career to metadata and the semantic Web, and to Dublin Core in particular.  I've now stepped back from that a little, partly to allow me to focus on other stuff... but partly out of frustration with the lack of impact that these kinds of developments seem to be having.

Let's consider the area of resource discovery for a moment, since that is probably what comes to mind first and foremost when people talk about semantic Web technologies.  Further, let's break the world into three classes of people - those who have content to make available, those who want to discover and use the content provided by others, and those that are building tools to put the first two groups in touch with each other.  Clearly the are significant overlaps between these groups and I realise that I'm simplifying things significantly but bear with me for a second.

The first group is primarily interested in the effective disclosure and use of their content.  They will do whatever they need to do to ensure that their content gets discovered by people in the second group, choosing tools supplied by the third group that they deem to be most effective and balancing the costs of their exposure-related efforts against the benefits of what they are likely enable in terms of resource discovery.  Clearly, one of the significant criteria in determining which tools are 'effective' has to do with critical mass (how many people in the second group are using the tool being evaluated).

It's perhaps worth noting that sometimes things go a bit haywire.  People in the first group put large amounts of effort into activities related to resource discovery where there is little or no evidence of tools being provided by the third group to take advantage of it.  Embedding Dublin Core metadata into HTML Web pages strikes me as an example of this - at least in some cases.  I'm not quite clear why this happens, but suspect that it has something to do with policy drivers taking precedence over the natural selection of what works or doesn't.

People in the second group want to discover stuff and are therefore primarily interested in the use of tools developed by the third group that they feel are most useful.  Their choices will be based on what they perceive to work best for resource discovery, balanced against other factors such as usability.  Again, critical mass is important - tools need to be comprehensive (within a particular area) to be deemed effective.

The third group need users from the other groups to use their tools - they want to build up a user-base.  The business drivers for why they want to do this might vary (ad revenue, subscription income, preparing for the sale of the business as a whole, kudos, etc.), but, often, that is the bottom line.  They will therefore work with the first group to ensure that users in the second group get what they want.

Now, when I use the phrase 'work with' I don't mean in a formal business arrangement kind of way - as a member of the first group, I don't 'work with', say, Google in that sense.  But I do work within the framework given to me by Google (or whoever) to ensure that my content gets discovered.  I'll optimise my content according to agreed best-practices for search-engine optimisation.  I'll add my content to and similar tools in order to improve its Google-juice.  I'll add a Google site map to my site.  And so on and so forth...

I'll do this because I know that Google has the attention of people in the second group.  The benefits in terms of resource discovery of working within the Google framework outweigh the costs of what I have to do to take part.  In truth, the costs are relatively small and the benefits relatively large.

Overall, one ends up with a loosely coupled cooperative system where the rules of engagement between the different parties are fairly informal, are of mutual benefit, evolve according to natural selection, and are endorsed by agreed conventions (sometimes turning into standards) around best-practice.

I've made this argument largely in terms of resource discovery tools and services but I suspect that the same can be said of technologies and other service areas.  The reasons people adopt, say, RSS have to do with low cost of implementation, high benefit, critical mass and so on.  Again, there is a natural selection aspect at play here.

So, what about the Semantic Web?  Well, it suffers from a classic chicken and egg problem.  Not enough content is exposed by members of the first group in a form suitable for members of the third group to develop effective tools for members of the second group.  Because the tools don't exist, the potential benefits of 'semantic' approaches aren't fully realised.  Members of the second group don't use the tools because they aren't felt to be good or comprehensive enough.  As a result, members of the first group perceive the costs of exposing richer Semantic Web data to outweigh any possible benefits because of lack of critical mass.

Can we break out of this cycle?  I don't know.  I would hope so... and Eduserv continue to put work into Semantic Web technologies such as the Dublin Core on the basis that we will.  On the other hand, I've felt that way for a number of years and it hasn't happened yet!  In rounding up the position papers in her blog, Lorna Campbell quotes David Millard, University of Southampton:

the Semantic Web hasn’t failed, it just hasn’t succeeded enough.

That's one way of looking at it I suppose and it's probably a reasonable view for now.  That said, I'm not convinced that it is a position that can reasonably be adopted forever and, with reference to my earlier use of the phrase "natural selection" it hardly makes one think of the survival of the fittest!?

What do I conclude from this?  Nothing earth shattering I'm afraid.  Simply that for semantic approaches to succeed they will need to be low cost to implement, of high value, and adopted by a critical mass of parties in all parts of the system.  I suspect that means we need to focus our semantic attention on things that aren't already well catered for by the very clever but essentially brute-force approaches across large amounts of low-semantic Web data that work well for us now... i.e. there's no point in simply inventing a semantic Web version of what Google can already do for us.  One of the potential problems with activities based on the Dublin Core is that one gets the impression that is what people are trying to do.

Again, I'm not trying to argue against the semantic Web, metadata, Dublin Core or other semantic approaches here... just suggesting that we need to be clearer about where their strengths lie and how they most effectively fit into the overall picture of services on the Web.



eFoundations is powered by TypePad