March 25, 2011

RDTF metadata guidelines - next steps

A few weeks ago I blogged about the work that Pete and I have been doing on metadata guidelines as part of the JISC/RLUK Resource Discovery Task Force, RDTF metadata guidelines - Limp Data or Linked Data?.

In discussion with the JISC we have agreed to complete our current work in this area by:

  • delivering a single summary document of the consultation process around the current draft guidelines, incorporating the original document and all the comments made using the JISCpress site during the comment period; and
  • suggesting some recommendations about any resulting changes that we would like to see made to the draft guidelines.

For the former, a summary view of the consultation is now available. It's not 100% perfect (because the links between the comments and individual paragraphs are not retained in the summary) but I think it is good enough to offer a useful overview of the draft and the comments in a single piece of text. Furthermore, the production of this summary was automated (by post-processing the export 'dump' from Wordpress), so the good news is that a similar view can be obtained for any future (or indeed past) JISCpress consultations.

For the latter, this blog post forms our recommendations.

As noted previously, there were 196 comments during the comment period (which is not bad!), many of which were quite detailed in terms of particular data models, formats and so on. On the basis that we do not know quite what form any guidelines might take from here on (that is now the responsibility of the RDTF team at MIMAS I think), it doesn't seem sensible to dig into the details too much. Rather, we will make some comments on the overall shape of the document and suggest some areas where we think it might be useful for JISC and RLUK to undertake additional work.

You may recall that our original draft proposed three approaches to exposing metadata, which we refered to as the community formats approach, the RDF data approach and the Linked Data approach. In light of comments (particularly those from Owen Stephens and Paul Walk) we have been putting some thought into how the shape of the whole document might be better conceptualised. The result is the following four-quadrant model:

Like any simple conceptualisation, there is some fuzziness in this but we think it's a useful way of thinking about the space.

Traditionally (in the world of library, museum and archives at least), most sharing of metadata has happened in the bottom-left quadrant - exchanging bulk files of MARC records for example. And, to an extent, this approach continues now, even outside those sectors. Look at the large amount of 'open data' that is shared as CSV files on sites like data.gov.uk for example. Broadly speaking, this is what we refered to as the community formats approach (though I think our inclusion of the OAI-PMH in that area probably muddied the waters a little - see below).

One can argue that moving left to right across the quadrants offers semantically richer metadata in a 'small pieces, loosely joined' kind of way (though this quickly becomes a religious argument with no obvious point of closure! :-) ) and that moving bottom to top offers the ability to work with individual item descriptions rather than whole collections of them - and that, in particular, it allows for the assignment of 'http' URIs to those descriptions and the dereferencing of those URIs to serve them.

Our three approaches covered the bottom-left, bottom-right and top-right quadrants. The web, at least in the sense of serving HTML pages about things of interest in libraries, museums and archives, sits in the top-left quadrant (though any trend towards embedded RDFa in HTML pages moves us firmly towards the top-right).

Interestingly, especially in light of the RDTF mantra to "build better websites", our guidelines managed to miss that quadrant. In their comments, Owen and Paul argued that moving from bottom to top is more important than moving left to right - and, on balance, we tend to agree.

So, what does this mean in terms of our recommendations?

We think that the guidelines need to cover all four quadrants and that, in particular, much greater emphasis needs to be placed on the top-left quadrant. Any guidance needs to be explicit that the 'http' URIs assigned to descriptions served on the web are not URIs for the things being described; that, typically, multiple descriptions may be served for the things being described (an HTML page and an XML document for example, each of which will have separate URIs) and that mechanisms such as '<link rel="alternative" ...>' can be used to tie them together; and that Google sitemaps (on the left) and semantic sitemaps (on the right) can be used to guide robots to the descriptions (either individually or in collections).

Which leaves the issue of the OAI-PMH. In a sense, this sits along-side the top-left quadrant - which is why, I think, it didn't fit particularly well with our previous three approaches. If you think about a typical repository for example, it is making descriptions of the content it holds available as HTML 'splash' pages (sometimes with embedded links to descriptions in other formats). In that sense it is functioning in top-left, "page per thing", mode. What the OAI-PMH does is to give you a protocol mechanism for getting at that those descriptions in a way that is useful for harvesting.

Several people noted that Atom and RSS might be used as an alternative to both sitemaps and the OAI-PMH, and we agree - though it may be that some additional work is needed to specify the exact mechanisms for doing so.

There were some comments on our suggestion to follow the UK government guidelines on assigning URIs. On reflection, we think it would make more sense to recommend only the W3C guidelines on Cool URIs for the Semantic Web, particularly on the separation of things from the desriptions of things, suggesting that it may be sensible to fund (or find) more work in this area making specific recommendations around persistent URIs (for both things and their descriptions).

Finally, there were a lot of comments on the draft guidelines about our suggested models and formats - notably on FRBR, with many commenters suggesting that this was premature given significant discussion around FRBR elsewhere. We think it would make sense to separate out any guidance on conceptual models and associated vocabularies, probably (again) as a separate piece of work.

To summarise then, we suggest:

  • that the four-quadrant model above is used to frame the guidelines - we think all four quadrants are useful, and that there should probably be some guidance on each area;
  • that specific guidance be developed for serving an HTML page description per 'thing' of interest (possibly with associated, and linked, alternative formats such as XML);
  • that guidance be developed (or found) about how to sensibly assign persistent 'http' URIs to everything of interest (including both things and descriptions of things);
  • that the definition of 'open' needs more work (particularly in the context of whether commercial use is allowed) but that this needs to be sensitive to not stirring up IPR-worries in those domains where they are less of a concern currently;
  • that mechanisms for making statement of provenance, licensing and versioning be developed where RDF triples are being made available (possibly in collaboration with Europeana work); and
  • that a fuller list of relevant models that might be adopted, the relationships between them, and any vocabularies commonly associated with them be maintained separately from the guidelines themselves (I'm trying desperately not to use the 'registry' word here!).


I don't suppose you can write a quick tutorial on how you produced that overview document? Very useful!

A couple of thoughts... both related to the diagram above.

Firstly, we talk about 'collections of descriptions' but we don't really talk much about 'collection descriptions', which in the past have been used in those cases where it was too costly (or there was some other impediment) to creating a collection of item-level descriptions. I don't know what to make of this... just noting it.

Secondly, in discussion about the diagram yesterday at the RDTF Management Board meeting, Owen Stevens noted that there is a third dimension of "closedness vs openness" running across the two current axes. So, for example, he noted that many aggregation activities currently (The Archives Hub and COPAC for example) use the same technology as might be used in the 'bulk download' quadrant but do so in a closed way (at least as far as making the records available from the provider is concerned). He wondered what impact transitioning such closed models to an open model might have... and further what impact moving to the 'page per thing' quadrant might have. Both interesting questions I think.

Our original draft, implicitly suggested an across-and-up movement thru the quadrants, from 'bulk transfer' to 'Linked Data'. Looking at it now, my guess is that an up-and-across movement is a more likely path of adoption, quite possibly stopping after the up bit! :-)

Anyway, listening to the discussion yesterday, I'm happy that the quadrants offer a useful way of framing discussion in this area. If nothing else about the RDTF metadata guidelines work that we've done, this much at least seems potentially helpful.

Re collection descriptions, a couple of thoughts on two different aspects:

- For the "RDF Data" and "Linked Data" cases we do talk about using VoiD to describe the dataset. So this amounts to providing a "collection-level description" of the collection of descriptions (if you se what I mean!). And I guess in principle one could, maybe should, specify something similar for the other two cases using e.g. the DC Collections Application Profile (or something similar).

- I agree we don't explicitly address the point about the descriptions/records themselves being "collection-level descriptions", though having said that, we've probably "given ourselves enough wiggle room" to include this case when we say (in the "community formats" section): "For the purposes of these guidelines, a significant resource is one that is likely to be of interest to end-users (etc)" - and for archival metadata and formats like EAD, it's almost certainly the case that some, if not all, of the things described are "collections" of some sort.

