RDTF metadata guidelines - next steps
A few weeks ago I blogged about the work that Pete and I have been doing on metadata guidelines as part of the JISC/RLUK Resource Discovery Task Force, RDTF metadata guidelines - Limp Data or Linked Data?.
In discussion with the JISC we have agreed to complete our current work in this area by:
- delivering a single summary document of the consultation process around the current draft guidelines, incorporating the original document and all the comments made using the JISCpress site during the comment period; and
- suggesting some recommendations about any resulting changes that we would like to see made to the draft guidelines.
For the former, a summary view of the consultation is now available. It's not 100% perfect (because the links between the comments and individual paragraphs are not retained in the summary) but I think it is good enough to offer a useful overview of the draft and the comments in a single piece of text. Furthermore, the production of this summary was automated (by post-processing the export 'dump' from Wordpress), so the good news is that a similar view can be obtained for any future (or indeed past) JISCpress consultations.
For the latter, this blog post forms our recommendations.
As noted previously, there were 196 comments during the comment period (which is not bad!), many of which were quite detailed in terms of particular data models, formats and so on. On the basis that we do not know quite what form any guidelines might take from here on (that is now the responsibility of the RDTF team at MIMAS I think), it doesn't seem sensible to dig into the details too much. Rather, we will make some comments on the overall shape of the document and suggest some areas where we think it might be useful for JISC and RLUK to undertake additional work.
You may recall that our original draft proposed three approaches to exposing metadata, which we refered to as the community formats approach, the RDF data approach and the Linked Data approach. In light of comments (particularly those from Owen Stephens and Paul Walk) we have been putting some thought into how the shape of the whole document might be better conceptualised. The result is the following four-quadrant model:
Traditionally (in the world of library, museum and archives at least), most sharing of metadata has happened in the bottom-left quadrant - exchanging bulk files of MARC records for example. And, to an extent, this approach continues now, even outside those sectors. Look at the large amount of 'open data' that is shared as CSV files on sites like data.gov.uk for example. Broadly speaking, this is what we refered to as the community formats approach (though I think our inclusion of the OAI-PMH in that area probably muddied the waters a little - see below).
One can argue that moving left to right across the quadrants offers semantically richer metadata in a 'small pieces, loosely joined' kind of way (though this quickly becomes a religious argument with no obvious point of closure! :-) ) and that moving bottom to top offers the ability to work with individual item descriptions rather than whole collections of them - and that, in particular, it allows for the assignment of 'https' URIs to those descriptions and the dereferencing of those URIs to serve them.
Our three approaches covered the bottom-left, bottom-right and top-right quadrants. The web, at least in the sense of serving HTML pages about things of interest in libraries, museums and archives, sits in the top-left quadrant (though any trend towards embedded RDFa in HTML pages moves us firmly towards the top-right).
Interestingly, especially in light of the RDTF mantra to "build better websites", our guidelines managed to miss that quadrant. In their comments, Owen and Paul argued that moving from bottom to top is more important than moving left to right - and, on balance, we tend to agree.
So, what does this mean in terms of our recommendations?
We think that the guidelines need to cover all four quadrants and that, in particular, much greater emphasis needs to be placed on the top-left quadrant. Any guidance needs to be explicit that the 'https' URIs assigned to descriptions served on the web are not URIs for the things being described; that, typically, multiple descriptions may be served for the things being described (an HTML page and an XML document for example, each of which will have separate URIs) and that mechanisms such as '<link rel="alternative" ...>' can be used to tie them together; and that Google sitemaps (on the left) and semantic sitemaps (on the right) can be used to guide robots to the descriptions (either individually or in collections).
Which leaves the issue of the OAI-PMH. In a sense, this sits along-side the top-left quadrant - which is why, I think, it didn't fit particularly well with our previous three approaches. If you think about a typical repository for example, it is making descriptions of the content it holds available as HTML 'splash' pages (sometimes with embedded links to descriptions in other formats). In that sense it is functioning in top-left, "page per thing", mode. What the OAI-PMH does is to give you a protocol mechanism for getting at that those descriptions in a way that is useful for harvesting.
Several people noted that Atom and RSS might be used as an alternative to both sitemaps and the OAI-PMH, and we agree - though it may be that some additional work is needed to specify the exact mechanisms for doing so.
There were some comments on our suggestion to follow the UK government guidelines on assigning URIs. On reflection, we think it would make more sense to recommend only the W3C guidelines on Cool URIs for the Semantic Web, particularly on the separation of things from the desriptions of things, suggesting that it may be sensible to fund (or find) more work in this area making specific recommendations around persistent URIs (for both things and their descriptions).
Finally, there were a lot of comments on the draft guidelines about our suggested models and formats - notably on FRBR, with many commenters suggesting that this was premature given significant discussion around FRBR elsewhere. We think it would make sense to separate out any guidance on conceptual models and associated vocabularies, probably (again) as a separate piece of work.
To summarise then, we suggest:
- that the four-quadrant model above is used to frame the guidelines - we think all four quadrants are useful, and that there should probably be some guidance on each area;
- that specific guidance be developed for serving an HTML page description per 'thing' of interest (possibly with associated, and linked, alternative formats such as XML);
- that guidance be developed (or found) about how to sensibly assign persistent 'https' URIs to everything of interest (including both things and descriptions of things);
- that the definition of 'open' needs more work (particularly in the context of whether commercial use is allowed) but that this needs to be sensitive to not stirring up IPR-worries in those domains where they are less of a concern currently;
- that mechanisms for making statement of provenance, licensing and versioning be developed where RDF triples are being made available (possibly in collaboration with Europeana work); and
- that a fuller list of relevant models that might be adopted, the relationships between them, and any vocabularies commonly associated with them be maintained separately from the guidelines themselves (I'm trying desperately not to use the 'registry' word here!).