« UCISA cloud computing event | Main | Term-based thesauri and SKOS (Part 2): Linked Data »

February 25, 2011

RDTF metadata guidelines - Limp Data or Linked Data?

Having just finished reading thru the 196 comments we received on the draft metadata guidelines for the UK RDTF I'm now in the process of wondering where we go next. We (Pete and I) have relatively little effort to take this work forward (a little less than 5 days to be precise) so it's not clear to me how best we use that effort to get something useful out for both RDTF and the wider community.

By the way... many thanks to everyone who took the time to comment. There are some great contributions and, if nothing else, the combination of the draft and the comments form a useful snapshot of the current state of issues around library, museum and archival metadata in the context of the web.

Here's my brief take on what the comments are saying...

Firstly, there were several comments asking about the target audience for the guidelines and whether, as written, they will be meaningful to... well... anyone I guess! It's worth pointing out that my understanding is that any guidelines we come up with thru the current process will be taken forward as part of other RDTF work. What that means is that the guidelines will get massaged into a form (or forms) that are digestable by the target audience (or audiences), as determined by other players with the RDTF activity. What we have been tasked with are the guidelines themselves - not how they are presented. We perhaps should have made this clearer in the draft guidelines. In short, I don't think the document, as written, will be put directly in-front of anyone who doesn't go to the trouble of searching it out explicitly.

Secondly, there were quite a number of detailed comments on particular data formats, data models, vocabularies and so on. This is great and I'm hopeful that as a result we can either extend the list of examples given at various points in the guidelines (or, in some cases, drop back to not having examples but simply say, "do whatever is the emerging norm here in your community").

Thirdly, the were some concerns about what we meant by "open". As we tried to point out in the draft, we do not consider this to be our problem - it is for other activity in RDTF to try and work out what "open" means - we just felt the need to give that word a concrete definition, so that people could understand where we were coming from for the purposes of these guidelines.

Finally, there were some bigger concerns - these are the things that are taxing me right now - that broadly fell into two, related, camps. Firstly, that the step between the community formats approach and the RDF data approach is too large (though no-one really suggested what might go in the middle). And secondly, that we are missing a trick by not encouraging the assignment of 'http' URIs to resources as part of the community formats approach.

As it stands, we have, on the one hand, what one might call Limp Data (MARC records, XML files, CVS, EAD and the rest) and, on the other, Linked Data and all that entails, with a rather odd middle ground that we are calling RDF data (in the current guidelines).

I was half hoping that someone would simply suggest collapsing our RDF data and Linked Data approaches into one - on the basis that separating them into two is somewhat confusing (but as far as I can tell no-one did... OK, I'm doing it now!). That would leave a two-pronged approach - community formats and Linked Data - to which we could add a 'community formats with http URIs' middle ground. My gut feel is that there is some attraction in such an approach but I'm not sure how feasible it is given the characteristics of many existing community formats.

As part of his commentary around encouraging http URIs (building a 'better web' was how he phrased it), Owen Stephens suggested that every resource should have a corresponding web page. I don't disagree with this... well, hang on... actually I do (at least in part)! One of the problems faced by this work is the fundamental difference between the library world and museums and archives. The former is primarily dealing with non-unique resources (at the item level), the latter with unique resources. (I know that I'm simplifying things here but bear with me). Do I think that resource discovery will be improved if every academic library in the UK (or indeed in the world) creates an http URI for every book they hold at which they serve a human-readable copy of their catalogue record? No, I don't. If the JISC and RLUK really want to improve web-scale resource discovery of books in the library sector, they would be better off spending their money encouraging libraries to sign up to OCLC WorldCat and contributing their records there. (I'm guessing that this isn't a particular popular viewpoint in the UK - at least, I'm not sure that I've ever heard anyone else suggest it - but it seems to me that WorldCat represents a valuable shared service approach that will, in practice, be hard to beat in other ways.) Doing this would both improve resource discovery (e.g. thru Google) and provide a workable 'appropriate copy' solution (for books). Clearly, doing this wouldn't help build a more unified approach across the GLAM domains but, as at least one commenter pointed out, it's not clear that the current guidelines do either. Note: I do agree with Owen that every unique museum and archival resource should have an http URI and a web page.

All of which, as I say, leaves us with a headache in terms of how we take these guidelines forward. Ah well... such is life I guess.


TrackBack URL for this entry:

Listed below are links to weblogs that reference RDTF metadata guidelines - Limp Data or Linked Data?:


Sorry these comments are rather delayed, but following a discussion on Twitter with @lukask and @tillk, I think it is worth capturing some counterpoints here to your points of disagreement with me :)

In fairness to me (of course!) I didn't just suggest 'one webpage per resource' four things which I think should be taken together:

  • An html document per record published

  • Each metadata record published should have a simple, persistent, URI (i.e. the URL of the html document at least)

  • Each html document representing a metadata record should include a structured representation of the metadata

  • These documents/records should be crawlable by web bots/spiders such as the Googlebot

To look at a specific example, the University of Huddersfield provide persistent html pages for each book in their catalogue, and these are crawled by Google (and presumably others). This means that if I search Google for huddersfield harry potter library I find a book in the University Library in the top 10 search hits. However, if I do a similar search for most other university towns/cities I don't see library catalogue records in the top 10 (e.g. bath harry potter library).

I'd argue that being in Google results (other search engines are available) increases discoverability....

You say "If the JISC and RLUK really want to improve web-scale resource discovery of books in the library sector, they would be better off spending their money encouraging libraries to sign up to OCLC WorldCat and contributing their records there"

I definitely agree that getting library records into WorldCat is a good idea and will increase discoverability. The question for me is not whether we should do this, but the mechanism by which we do this. Rather than libraries contributing records to WorldCat, we should be looking at how libraries publish their records publicly, and how any aggregator (whether WorldCat or some other) can find and aggregate these records. For me this is a fundamental of the RDTF vision - Publish and Aggregate.

One further note then I'll stop...

I'd argue that publish an html page with a structured representation of the resource is simply a special case of what you described in the guidelines as the 'Community Formats' approach (as long as you also fulfilled few extra conditions) - do you agree and if not, why not?

You say: "I was half hoping that someone would simply suggest collapsing our RDF data and Linked Data approaches into one".

Actually, I did, but maybe not bluntly enough: "I’m not convinced that the distinction between “RDF data” and “Linked Data” is going to add practical value to implementers."

The comments to this entry are closed.



eFoundations is powered by TypePad