« January 2011 | Main | March 2011 »

February 25, 2011

RDTF metadata guidelines - Limp Data or Linked Data?

Having just finished reading thru the 196 comments we received on the draft metadata guidelines for the UK RDTF I'm now in the process of wondering where we go next. We (Pete and I) have relatively little effort to take this work forward (a little less than 5 days to be precise) so it's not clear to me how best we use that effort to get something useful out for both RDTF and the wider community.

By the way... many thanks to everyone who took the time to comment. There are some great contributions and, if nothing else, the combination of the draft and the comments form a useful snapshot of the current state of issues around library, museum and archival metadata in the context of the web.

Here's my brief take on what the comments are saying...

Firstly, there were several comments asking about the target audience for the guidelines and whether, as written, they will be meaningful to... well... anyone I guess! It's worth pointing out that my understanding is that any guidelines we come up with thru the current process will be taken forward as part of other RDTF work. What that means is that the guidelines will get massaged into a form (or forms) that are digestable by the target audience (or audiences), as determined by other players with the RDTF activity. What we have been tasked with are the guidelines themselves - not how they are presented. We perhaps should have made this clearer in the draft guidelines. In short, I don't think the document, as written, will be put directly in-front of anyone who doesn't go to the trouble of searching it out explicitly.

Secondly, there were quite a number of detailed comments on particular data formats, data models, vocabularies and so on. This is great and I'm hopeful that as a result we can either extend the list of examples given at various points in the guidelines (or, in some cases, drop back to not having examples but simply say, "do whatever is the emerging norm here in your community").

Thirdly, the were some concerns about what we meant by "open". As we tried to point out in the draft, we do not consider this to be our problem - it is for other activity in RDTF to try and work out what "open" means - we just felt the need to give that word a concrete definition, so that people could understand where we were coming from for the purposes of these guidelines.

Finally, there were some bigger concerns - these are the things that are taxing me right now - that broadly fell into two, related, camps. Firstly, that the step between the community formats approach and the RDF data approach is too large (though no-one really suggested what might go in the middle). And secondly, that we are missing a trick by not encouraging the assignment of 'http' URIs to resources as part of the community formats approach.

As it stands, we have, on the one hand, what one might call Limp Data (MARC records, XML files, CVS, EAD and the rest) and, on the other, Linked Data and all that entails, with a rather odd middle ground that we are calling RDF data (in the current guidelines).

I was half hoping that someone would simply suggest collapsing our RDF data and Linked Data approaches into one - on the basis that separating them into two is somewhat confusing (but as far as I can tell no-one did... OK, I'm doing it now!). That would leave a two-pronged approach - community formats and Linked Data - to which we could add a 'community formats with http URIs' middle ground. My gut feel is that there is some attraction in such an approach but I'm not sure how feasible it is given the characteristics of many existing community formats.

As part of his commentary around encouraging http URIs (building a 'better web' was how he phrased it), Owen Stephens suggested that every resource should have a corresponding web page. I don't disagree with this... well, hang on... actually I do (at least in part)! One of the problems faced by this work is the fundamental difference between the library world and museums and archives. The former is primarily dealing with non-unique resources (at the item level), the latter with unique resources. (I know that I'm simplifying things here but bear with me). Do I think that resource discovery will be improved if every academic library in the UK (or indeed in the world) creates an http URI for every book they hold at which they serve a human-readable copy of their catalogue record? No, I don't. If the JISC and RLUK really want to improve web-scale resource discovery of books in the library sector, they would be better off spending their money encouraging libraries to sign up to OCLC WorldCat and contributing their records there. (I'm guessing that this isn't a particular popular viewpoint in the UK - at least, I'm not sure that I've ever heard anyone else suggest it - but it seems to me that WorldCat represents a valuable shared service approach that will, in practice, be hard to beat in other ways.) Doing this would both improve resource discovery (e.g. thru Google) and provide a workable 'appropriate copy' solution (for books). Clearly, doing this wouldn't help build a more unified approach across the GLAM domains but, as at least one commenter pointed out, it's not clear that the current guidelines do either. Note: I do agree with Owen that every unique museum and archival resource should have an http URI and a web page.

All of which, as I say, leaves us with a headache in terms of how we take these guidelines forward. Ah well... such is life I guess.

February 22, 2011

UCISA cloud computing event

I attended UCISA's Cloud Computing Seminar last week, a pretty good event overall though, like many 'cloud' events, there was quite a mix of IaaS (e.g. Amazon Web Services (AWS)) and SaaS (e.g. Google Apps) presentations so it sometimes felt like the programme was jumping around a bit. There is no doubt that the 'cloud' is generating a lot of interest at the moment, which is gratifying since it is also the topic of our annual Eduserv Symposium this year (May 12th in London).

Phil Richards, of Loughborough, talked about their partnership with Logicalis, building what he called a 'hybrid cloud' comprising both an on-campus virtualisation infrastructure and some in-the-cloud burst capacity (based on Logicalis' Cooperative Cloud Service). He seemed to make the point that research and teaching, being two of the key differentiators between universities, are inappropriate for outsourcing to the cloud. Well, yes, I tend to agree - but that doesn't mean that the compute infrastructure on which those things are built can't be outsourced? With the exception of some very specialised cases, I doubt that many people choose their place of research or learning based on the size of its data centre?

And, as David Wallom of OeRC pointed out in his talk about the FleSSR project, outsourcing to cloud infrastructure is already happening, albeit in a rather ad hoc and bottom-up way. He suggested that most institutions (certainly research intensive institutions) will probably have around 200-300 researchers that are already using AWS (or equivalent) for some aspects of their research. So, for at least some researchers, the decision to use cloud infrastructure has already been taken, often on the back of a personal credit card! The problem for universities is that it is happening in an unmanaged (and in some senses unmanagable) way.

Given the announcement of UMF funding a couple of weeks ago, which includes a pilot "virtual server infrastructure (a 'cloud')" hosted by us, and given our involvement in the FleSSR project, we now fall rather squarely into the camp of those people thinking about building shared 'cloud' infrastructure services for the education sector. Understanding both the needs of those individual researchers who are currently choosing to go to Amazon and those of university IT Services who likely have more strategic 'virtualisation' issues in mind brings, I suspect, some interesting tensions, not least around business models (which was the topic of Matt Johnson's talk), pricing models and, ultimately, sustainability.

Interestingly, JANET's new role as a broker for the "procurement of shared virtual servers and data centre capacity" (as part of the UMF funding announcement) got positive support on a couple of occassions during the day, with the speakers from UCD saying that they'd like to see a similar service being set up in Ireland.

So-called 'cloud bursting' was also refered to several times during the day as being an attractive option. This approach, like that adopted by Loughborough, retains virtualisation/compute and storage capacity in-house but uses the cloud to meet demand when it exceeds local capacity ('bursting'). This is also the architectural approach being investigated by the FleSSR project. What is not clear to me, when we view the UK HE community as a whole, is the extent to which this kind of approach is able to achieve such significant overall cost savings when compared to a more whole-hearted 'push everything out to a shared cloud provider' model, nor the ease with which such cloud-bursting services can become sustainable.

From our perspective, the issues around business models, costing and sustainability are taxing our minds at least as much as the nuts and bolts of building the infrastructure as we consider the future of both FleSSR and our UMF pilot. More anon...

SALDA

As I've mentioned here before, I'm contributing to a project called LOCAH, funded by the JISC under its 02/10 call Deposit of research outputs and Exposing digital content for education and research (JISCexpo), working with MIMAS and UKOLN on making available bibliographic and archival metadata as linked data.

Partly as a result of that work, I was approached by Chris Keene from the University of Sussex to be part of a bid they were preparing to another recent JISC call, 15/10: Infrastructure for education and research programme, under the "Infrastructure for resource discovery" strand, which seeks to implement some of the approaches outlined by the Resource Discovery Task Force.

The proposal was to make available metadata records from the Mass Observation Archive, data currently managed in a CALM archival management system, as linked data. I'm pleased to report that the bid was successful, and the project, Sussex Archive Linked Data Application (SALDA), has been funded. It's a short project, running for six months from now (February) to July 2011. There's a brief description on the JISC Web site here, a SALDA project blog has just been launched, and the project manager Karen Watson provides more details of the planned work in her initial post there.

I'm looking forward to working with the Sussex team to adapt and extend some of the work we've done with LOCAH for a new context. I expect most information will appear over on the SALDA blog, but I'll try to post the occcasional update on progress here, particularly on any aspects of general interest.

February 11, 2011

Term-based thesauri and SKOS (Part 1)

I'm currently doing a piece of work on representing a thesaurus as linked data. I'm working on the basis that the output will make use of the SKOS model/RDF vocabulary. Some of the questions I'm pondering are probably SKOS Frequently Asked Questions, but I thought it was worth working through my proposed solution and some of the questions I'm pondering here, partly just to document my own thought processes and partly in the hope that SKOS implementers with more experience than me might provide some feedback or pointers.

SKOS adopts a "concept-based" approach (i.e. the primary focus is on the description of "concepts" and the relationships between them); the source thesaurus uses a "term-based" approach based on the ISO 2788 standard. I found the following sources provided helpful summaries of the differences between these two approaches:

Briefly (and simplifying slightly for the purposes of this discussion - I've omitted discussion of documentary attributes like "scope notes"), in the term-based model (and here I'm dealing only with the case of a simple monolingual thesaurus):

  • the only entities considered are "terms" (words or sequences of words)
  • terms can be semantically equivalent, each expressing a single concept, in which case they are distinguished as "preferred terms"/descriptors or "non-preferred terms"/non-descriptors, using USE (non-preferred to preferred) and UF (use for) (preferred to non-preferred) relations. Each non-preferred term is related to a single preferred term; a preferred term may have many non-preferred terms
  • a preferred term can be related by ("hierarchical") BT (broader term) and NT (narrower term) relations to indicate that it is more specific in meaning, or more general in meaning, than another term
  • a preferred term can also be related by an ("associative") RT (related term) relation to a second term, where there is some other relationship which may be useful for retrieval (e.g. overlapping meanings)

In the SKOS (concept-based) model:

  • the primary entities considered are "concepts", "units of thought"
  • a concept is associated with labels, of which at most one (in the monolingual case) is the preferred label, the others alternative labels or hidden labels
  • a concept can be related by "has broader", "has narrower" and "has related" relations to other concepts

The concept-based model, then, is explicit that the thesaurus "contains" two distinct types of thing: concepts and labels. In the base SKOS model, the labels are modelled as RDF literals, so the expression of relationships between labels is not supported. SKOS-XL provides an extension to SKOS which models labels as RDF resources in their own right, which can be identified, described and linked.

To represent a term-based thesaurus in SKOS, then, it is necessary to make a mapping from the term-based model with its term-term relationships to the concept-based model with its concept-label and concept-concept relationships. (An alternative approach would be to represent the thesaurus in RDF using a model/vocabulary which expresses directly the "native" "term-based" model of the source thesaurus. I'm working on the basis that using the SKOS/concept-based approach will facilitate interoperability with other thesauri/vocabularies published on the Web.

Core Mapping

The key principle underlying the mapping is the notion that, in the term-based approach, a preferred term and its multiple related non-preferred terms are acting as labels for a single concept.

Each set of terms related by USE/UF relations in the term-based approach, then, is mapped to a single SKOS concept, with the single "preferred term" becoming the "preferred label" for that concept and the (possibly multiple) "non-preferred terms" each becoming "alternative labels" for that same concept.

Consider the following example, where the term "Political violence" is a preferred term for "Civil violence" and "Violent unrest", and has a broader term "Violence" and narrower term "Terrorism":


Civil violence
USE Political violence

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrrorism

Terrorism
BT Political violence

Violence
NT Political violence

Violent protest
USE Political violence

(Leaving aside for a moment the question of how the URIs might be generated) an SKOS representation takes the form:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix con: <http://example.org/id/concept/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skos:altLabel "Civil violence"@en;
       skos:altLabel "Violent protest"@en;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skos:broader con:C2 .

con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skos:narrower con:C2 .

A graphical representation perhaps makes the links clearer (to keep things simple, I've omitted the arcs and nodes corresponding to the rdf:type triples here):

Fig1

One of the consequences of this approach is that some of the "direct" relationships between terms are reflected in SKOS as "indirect" relationships. In the term-based model, a non-preferred term is linked to a preferred term by a simple "USE" relationship. In the SKOS example, to find the "preferred term" for the term "Violent protest", one finds the node for the concept of to which it is linked via the skos:altLabel property, and then locates the literal which is linked to that concept via an skos:prefLabel property.

SKOS-XL and terms as Labels

As mentioned above, the SKOS specification defines an optional extension to SKOS which supports the representation of labels as resources in their own right, typically alongside the simple literal representation.

Using this extension, then the example above might be expressed as:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

And in graphical form (as above with the rdf:type triples omitted):

Fig2

That may be rather a lot to take in, so below I've shown a subgraph, including only the labels related to a single concept:

Fig3

Using the SKOS-XL extension in this way does introduce some additional complexity. On the other hand:

  1. it introduces a set of resources (of type skosxl:Label) that have a one-to-one correspondence to the terms in the source thesaurus, so perhaps makes the mapping between the two models more explicit and easier for a human reader to understand.
  2. it makes the labels themselves URI-identifiable resources which can be referenced in this data and in data created by others. So, it becomes possible to make assertions about relationships between the labels, or between labels and other things.

Coincidentally, as I was writing this post, I note that Bob duCharme has posted on the use of SKOS-XL for providing metadata "about" the individual labels, distinct from the concepts. So I might add a triple to indicate, say, the date on which a particular label was created. I don't think there is an immediate requirement for that in the case I'm working on, but there may be in the future.

Another possible use case is the ability for other parties to make links specifically to a label, rather than to the concept which it labels. A document could be "tagged", for example, by association with a particular label from a particular thesaurus, rather than just with the plain literal.

Relationships between Labels

The SKOS-XL vocabulary defines only a single generic relationship between labels (the skosxl:labelRelation property) with the expectation that implementers define subproperties to handle more specific relationship types.

The example given of such a relationship in the SKOS spec is that of a case where one label is an acronym for another label.

One thing I wondered about was introducing properties here to reflect the USE/UF relations in the term-based model e.g. in the above example:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .
@prefix ex: <http://example.org/prop/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en;
        ex:use term:T2 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en;
        ex:useFor term:T1 ;
        ex:useFor term:T5 .
       
term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en;
        ex:use term:T2 .

This wouldn't really add any new information; rather it just provides a "shortcut" for the information that is already present in the form of the "indirect" preferred label/alternative label relationships, as noted above.

I hesitate slightly about adding this data, partly on the grounds that it seems slightly redundant, but also on the grounds that this seems like a rather different "category" of relationship from the "is acronym for" relationship. This point is illustrated in figure 4 in the SWAD-E Review of Thesaurus Work document, which divides the concept-based model into three layers:

  • the (upper in the diagram) "conceptual" layer, with the broader/narrower/related relations between concepts
  • the (lower) "lexical" layer, with terms/labels and the relations between them
  • and a middle "indication"/"implication" layer for the preferred/non-preferred relations between concepts and terms/labels

In that diagram, the example lexical relationships exist independently of the preferred/non-preferred relationships e.g. "AIDS" is an abbreviation for "Acquired Immunodeficiency Syndrome", regardless of which is considered the preferred term in the thesaurus. With the use/useFor relationships here, this would not be the case; they would indeed vary with the preferred/non-preferred relationships.

So, having thought initially it might be useful to include these relationships, I'm becoming more rather less sure.

And without them, I'm not sure whether, for this case, the use of the SKOS-XL would be justified - though it may well be, for the other purposes I mentioned above. So, any thoughts on practice in this area would be welcome.

URI Patterns

Following the linked data principles, I want to assign http URIs for all the resources, so that descriptions of those resources can be provided using HTTP. If we go with the SKOS-XL approach, that means assigning http URIs for both concepts and labels.

The input data includes, for each term within the thesaurus, an identifier code, unique within the thesaurus, and which remains stable over time, i.e. as changes are made to the thesaurus a single term remains associated with the same code. (I'll explore the question of how the thesaurus changes over time in a follow-up post.) So in fact the example above looks something like:


Civil violence
USE Political violence
TNR 1

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

The proposal is to use that identifier as the basis of URIs, for both labels/terms and concepts:

  • Term URI Pattern: http://example.org/id/term/T{termid}
  • Concept URI Pattern: http://example.org/id/concept/C{termid}

The generation of label/term URIs, then, is straightforward, as each term (with identifier code) in the input data maps to a single label/term in the SKOS RDF data. The concept case is a little more complicated. As I noted above, the mapping between the term-based model and the concept-based model means that there is a many-to-one relationship between terms and concepts: several terms are related to a single concept, one as preferred label, others as alternative labels. In my first cut at this at least (more on this in a follow-up post), I've generated the URI of the concept based on the identifier code associated with the preferred/descriptor term.

So in my example above, the three terms "Civil violence, "Political violence", and "Violent protest" are mapped to labels for a single concept, and the URI of that concept is constructed from the identifier code of the preferred/descriptor term ("Political violence").

Summary

I've tried to cover the basic approach I'm taking to generating the SKOS/SKOS-XL RDF data from the term-based thesaurus input. In this post, I've focused only on the thesaurus as a static entity. In a follow-up post, I'll look in some detail at the changes which can occur over time, and examine any implications for the choices made here, particularly for our choices of URI patterns.

February 03, 2011

Metadata guidelines for the UK RDTF - please comment

As promised last week, our draft metadata guidelines for the UK Resource Discovery Taskforce are now available for comment in JISCPress. The guidelines are intended to apply to UK libraries, museums and archives in the context of the JISC and RLUK Resource Discovery Taskforce activity.

The comment period will last two weeks from tomorrow and we have seeded JISCPress with a small number of questions (see below) about issues that we think are particularly worth addressing. Of course, we welcome comments on all aspects of the guidelines, not just where we have raised issues. (Note that you don't have to leave public comments in JISCPress if you don't want to - an email to me or Pete will suffice. Or you can leave a comment here.)

The guidelines recommend three approaches to exposing metadata (to be used individually or in combination), referred to as:

  1. the community formats approach;
  2. the RDF data approach;
  3. the Linked Data approach.

We've used words like 'must' and 'should' but it is worth noting that at this stage we are not in a position to say how these guidelines will be applied - if at all. Nor whether there will be any mechanisms for compliance put in place. On that basis, treat phrases like 'must do this' as meaning, 'you must do this for your metadata to comply with one or other approach as recommended by these guidelines' - no more, no less. I hope that's clear.

When we started this work, we we began by trying to think about functional requirements - always a good place to start. In this case however, that turned out not to make much sense. We are not starting from a green field here. Lots of metadata formats are already in use and we are not setting out with the intent of changing current cataloguing practice across libraries, museums and archives. What we can say is that:

  1. we have tried to keep as many people happy as possible (hence the three approaches), and
  2. we want to help libraries, museums and archives expose existing metadata (and new metadata created using existing practice) in ways that support the development of aggregator services and that integrate well with the web (of data).

As mentioned previously, the three approaches correspond roughly to the 3-star, 4-star and 5-star ratings in the W3C's Linked Data Star Ratings Scheme. To try and help characterise them, we prepared the following set of bullet points for a meeting of the RDTF Technical Advisory Group earlier this week:

The community data approach

  • the “give us what you’ve got” bit
  • share existing community formats (MARC, MODS, BibTeX, DC, SPECTRUM, EAD, XML, CSV, JSON, RSS, Atom, etc.) over RESTful HTTP or OAI-PMH
  • for RESTful HTTP, use sitemaps and robots.txt to advertise availability and GZip for compression
  • for CSV, give us a column called ‘label’ or ‘title’ so we’ve got something to display and a column called 'identifier' if you have them
  • provide separate records about separate resources
  • simples!

The RDF data approach

  • use RDF
  • model according to FRBR, CIDOC CRM or EAD and ORE where you can
  • re-use existing vocabularies where you can
  • assign URIs to everything of interest
  • make big buckets of RDF (e.g. as RDF/XML, N-Tuples, N-Quads or RDF/Atom) available for others to play with
  • use Semantic Sitemaps and the Vocabulary of Interlinked Datasets (VoID) to advertise availability of the buckets

The Linked Data approach

Dunno if that is a helpful summary but we look forward to your comments on the full draft. Do your worst!

For the record, the issues we are asking questions about mainly fall into the following areas:

  • is offering a choice of three approaches helpful?
  • for the community formats approach, are the example formats we list correct, are our recommendations around the use of CSV useful and are JSON and Atom significant enough that they should be treated more prominently?
  • does the suggestion to use FRBR and CIDOC CRM as the basis for modeling in RDF set the bar too high for libraries and museums?
  • where people are creating Linked Data, should we be recommending particular RDF datasets/vocabularies as the target of external links?
  • do we need to be more prescriptive about the ways that URIs are assigned and dereferenced?

Note that a printable version of the draft is also available from Google Docs.

About

Search

Loading
eFoundations is powered by TypePad