March 03, 2012

Moving on

As some of you may have heard by now, yesterday was my last day working at Eduserv. In a few weeks, I'll be taking up a post at Cambridge University Library, working on metadata for their Digital Library.

This feels like a fantastic opportunity, and I'm very excited about the move. I hope it will allow me to apply some of my existing knowledge and skills and also gain some experience in new areas - and to contribute to the development of a high quality resource. I very much enjoyed meeting the team there, and the library's digital collections are superb - as I think the set of materials available already show. It's not often I get to prepare for a job interview by listening to Melvyn Bragg on Radio 4 talking about Isaac Newton's (not always for the faint-hearted!) accounts of his experiments.

I'm sure it's no secret that as Eduserv's focus has changed the divergence from my own experience and interests has become more marked - probably starting right back with the demise of the Eduserv Foundation, but becoming particularly apparent over the last 12-18 months or so, with the strong shift of emphasis towards the provision of cloud services, and with the rest of my Research & Innovation Group colleagues now working pretty much exclusively in this area.

This move will also mark the end of a period of over 11 years working alongside Andy, first at UKOLN and then at Eduserv. We haven't worked closely together on a project for a while now - the last piece of work we did jointly was probably authoring the JISC RDTF/Discovery metadata guidelines. But on the occasions we did collaborate, I always valued and enjoyed the experience, and I like to think we complemented each other and made a good team.

I'll miss Andy's clear-sightedness and common sense approach - though I imagine I'll still be firing off late night emails saying, "OK, here's the thing: I'm a bit stuck here. I could take solution X, or I could take solution Y. What do you think?". I'm also very grateful for the support I've received for my work and ideas, from Andy, and also from the RIG team leader, Matt Johnson. I wish them and the rest of the team all the best with their future work.

I've had so much work to do over the last few weeks and have been working ridiculous (even by my standards!) hours. Sitting here at home looking out at a sunny spring morning in Bristol, I feel it's the first time in several weeks I've been able to catch my breath, and really start to look forward.

Andy and I haven't had a chance to discuss what this change means for this blog, but I dare say we will manage that in the next few days and something will appear here.

Meanwhile, I'm also trying to take this opportunity to reorganise various bits of my personal Web presence (like I need more to do when I have to tie up lots of things and organise a move across the country...). I'm not sure how that is all going to end up, but in the short term the best places to find me are probably on Twitter, identi.ca or Diaspora.

January 10, 2012

Introducing Bricolage

My last post here (two months ago - yikes, must do better...) was an appeal to anyone who might be interested in my making a contribution to a project under the JISC call 16/11: JISC Digital infrastructure programme to get in touch. I'm pleased to say that Jasper Tredgold of ILRT at the University of Bristol contacted me about a proposal he was putting together for a project called Bricolage, with the prospect of my doing some consultancy. The proposal was to work with the University Library's Special Collections department and the University Geology Museum to make available metadata for two collections - the archive of Penguin Ltd. and the specimen collection of the Geology Museum - as Linked Open Data.

And just before I went off for the Christmas break, Jasper let me know that the proposal had been accepted and the project was being funded. I'm very pleased to have another opportunity to carry on applying some of what I've learned in the other JISC-funded projects I've contributed to recently, and also to explore some new categories of data. It's also nice to be working with a local university - I worked on a few projects with folks from ILRT during my time at UKOLN, and from a selfish perspective I look forward to project meetings which involve a twenty-minute walk up the hill for me rather than a 7am start and a three or four hour train journey!

The project will start in February and run through to July. I'm sure there'll be a project blog once we get going and I'll add a URI here when it is available.

November 04, 2011

JISC Digital Infrastructure programme call

JISC currently has a call, 16/11: JISC Digital infrastructure programme, open for project proposals in a number of "strands"/areas, including the following:

  • Resource Discovery: "This programme area supports the implementation of the resource discovery taskforce vision by funding higher education libraries archives and museums to make open metadata about their collections available in a sustainable way. The aim of this programme area is to develop more flexible, efficient and effective ways to support resource discovery and to make essential resources more visible and usable for research and learning."

This strand advances the work of the UK Discovery initiative, and is similar to the "Infrastructure for Resource Discovery" strand of the JISC 15/10 call under which the SALDA project (in which I worked with the University of Sussex Library on the Mass Observation Archive data) was funded. There is funding for up to ten projects of between £25,000 and £75,000 per project in this strand

First, I should say this is a great opportunity to explore this area of work and I think we're fortunate that JISC is able to fund this sort of activity. A few particular things I noticed about the current call:

  • a priority for "tools and techniques that can be used by other institutions"
  • a focus on unique resources/collections not duplicated elsewhere
  • should build on lessons of earlier projects, but must avoid duplication/reinvention
  • a particular mention of "exploring the value of integrating structured data into webpages using microformats, microdata, RDFa and similar technologies" as an area in scope
  • an emphasis on sharing the experience/lessons learned: "The lessons learned by projects funded under this call are expected to be as important as the open metadata produced. All projects should build sharing of lessons into their plans. All project reporting will be managed by a project blog. Bidders should commit to sharing the lessons they learn via a blog"

Re that last point, as I've said before, one of the things I most enjoyed about the SALDA and LOCAH projects was the sense that we were interested in sharing the ideas as well as getting the data out there.

I'm conscious the clock is ticking towards the submission deadline, and I should have posted this earlier, but if anyone reading is considering a proposal and thinks that I could make a useful contribution, I'd be interested to hear from you. My particular areas of experience/interest are around Linked Data, and are probably best reflected by the posts I made on the LOCAH and SALDA blogs, i.e. data modelling, URI pattern design, identification/selection of useful RDF vocabularies, identification of potential relationships with things described in other datasets, construction of queries using SPARQL, etc. I do have some familiarity with RDFa, rather less with microdata and microformats. I'm not a software developer, but I can do a little bit of XSLT (and maybe enough PHP to be dangerous hack together rather flakey basic demonstrators). And I'm not a technical architect, but I did get some experience of working with triple stores in those recent projects.

My recent work has been mainly with archival metadata, and I'd be particularly interested in further work which complements that. I'm conscious of the constraint in the call of not repeating earlier work, so I don't think "reapplying" the sort of EAD to RDF work I did with LOCAH and SALDA would fit the bill. (I'd love to do something around the event/narrative/storytelling angle that I wrote about recently here, for example.) Having said that, I certainly don't mean to limit myself to archival data. Anyway, if you think I might be able to help, please do get in touch (pete.johnston@eduserv.org.uk).

October 05, 2011

Storytelling, archives and linked data

Yesterday on Twitter I saw Owen Stephens (@ostephens) post a reference to a presentation titled "Every Story has a Beginning", by Tim Sherratt (@wragge), "digital historian, web developer and cultural data hacker" from Canberra, Australia.

The presentation content is available here, and the text of the talk is here. I think you really need to read the text in one window and click through the presentation in another. I found it fascinating, and pretty inspiring, from several perspectives.

First, I enjoyed the form of the presentation itself. The content is built up incrementally on the screen, with an engaging element of "dynamism" but kept simple enough to avoid the sort of vertiginous barrage that seems to characterise the few Prezi presentations I've witnessed. And perhaps most important of all, the presentation itself is very much "a thing of the Web": many of the images are hyperlinked through to the "live" resources pictured, providing not only a record of "provenance" for the examples, but a direct gateway into the data sources themselves, allowing people to explore the broader context of those individual records or fragments or visualisations.

Second, it provides some compelling examples of how digitised historical materials and data extracted or derived from them can be brought together in new combinations and used to uncover and (re)tell stories - and stories not just of the "famous", the influential and powerful, but of ordinary people whose life events were captured in historical records of various forms. (Aside: Kate Bagnall has some thoughtful posts looking at some of the ethical issues of making people who were "invisible" "visible").

Finally, what really intrigued me from the technical perspective was that - if I understand correctly - the presentation is being driven by a set of RDF data. (Tim said on Twitter he'd post more explaining some of the mechanics of what he has done, and I admit I'm jumping the gun somewhat in writing this post, so I apologise for any misinterpretations.) In his presentation, Tim says:

What we need is a data framework that sits beneath the text, identifying people, dates and places, and defining relationships between them and our documentary sources. A framework that computers could understand and interpret, so that if they saw something they knew was a placename they could head off and look for other people associated with that place. Instead of just presenting our research we’d be creating a whole series of points of connection, discovery and aggregation.

Sounds a bit far-fetched? Well it’s not. We have it already — it’s called the Semantic Web.

The Semantic Web exposes the structures that are implicit in our web pages and our texts in ways that computers can understand. The Linked Data movement takes the basic ideas of the Semantic Web and turns them into a collaborative activity. You share vocabularies, so that other people (and computers) know when you’re talking about the same sorts of things. You share identifiers, so that other people (and computers) know that you’re talking about a specific person, place, object or whatever.

Linked Data is Storytelling 101 for computers. It doesn’t have the full richness, complexity and nuance that we invest in our narratives, but it does at least help computers to fit all the bits together in meaningful ways. And if we talk nice to them, then they can apply their newly-acquired interpretative skills to the things that they’re already good at — like searching, aggregating, or generating the sorts of big pictures that enable us to explore the contexts of our stories.

So, if we look at the RDF data for Tim's presentation, it includes "descriptions" of many different "things", including people, like Alexander Kelley, the subject of his first "story" (to save space, I've skipped the prefix declarations in these snippets but I hope they convey the sense of the data):

story:kelley a foaf1:Person ;
     bio:death story:kelley_death ;
     bio:event
         story:kelley_cremation,
         story:kelley_discharge,
         story:kelley_enlistment,
         story:kelley_reunion,
         story:kelley_wounded_1,
         story:kelley_wounded_2 ;
     foaf1:familyName "Kelley"@en-US ;
     foaf1:givenName "Alexander"@en-US ;
     foaf1:isPrimaryTopicOf story:kelley_moa ;
     foaf1:name "Alexander Dewar Kelley"@en-US ;
     foaf1:page 
       <http://discontents.com.au/shoebox/every-story-has-a-beginning> . 

There is data about events in his life:

story:kelley_discharge a bio:Event ;
     rdfs:label 
       "Discharged from the Australian Imperial Force."@en-US ;
     dc:date "1918-11-22"@en-US . 

story:kelley_enlistment a bio:Event ;
     rdfs:label 
       "Enlistment in the Australian Imperial Force for 
        service in the First World War."@en-US ;
     dc:date "1916-01-22"@en-US . 
     
story:kelley_ww1_service a bio:Interval ;
     bio:concludingEvent story:kelley_discharge ;
     bio:initiatingEvent story:kelley_enlistment ;
     foaf1:isPrimaryTopicOf story:kelley_ww1_record . 

and about the archival materials that record/describe those events:

story:kelley_ww1_record a locah:ArchivalResource ;
     locah:accessProvidedBy 
       <http://dbpedia.org/resource/National_Archives_of_Australia> ;
     dc:identifier "B2455, KELLEY ALEXANDER DEWAR"@en-US ;
     bibo:uri 
       "http://www.aa.gov.au/cgi-bin/Search?O=I&Number=7336927"@en-US . 

The presentation itself, the conference at which it was presented, various projects and researchers mentioned - all of these are also "things" described in the data.

I'd be interested in hearing more about how this data was created, the extent to which it was possible to extract the description of people, events, archival resources etc directly from existing data sources and the extent to which it was necessary to "hand craft" parts of it.

But I get very excited when I think about the potential in this sort of area if (when!?) we do have the data for historical records available as linked data (and available under open licences that support its free use).

Imagine having a "story building tool" which enables a "narrator" to visit a linked data page provided by the National Archives of Australia or the Archives Hub or one of the other projects Tim refers to, and to select and "intelligently clip" a chunk of data which you can then arrange into the "story" you are constructing - in much the way that bookmarklets for tools like Tumblr and Posterous enable you to do for text and images now. That "clipped chunk of data" could include a description of a person and some of their life events and metadata about digitised archival resources, including URIs of images - as in Tim's examples. You might follow pointers to other datasets from which additional data could be pulled. You might supplement the "clipped" data with your own commentary. Then imagine doing the same with data from the BBC describing a TV programme or radio broadcast related to the same person or events, or with data from a research repository describing papers about the person or events. The tool could generate some "provenance data" for each "chunk" saying "this subgraph was part of that graph over there, which was created by agency A on date D" in much the way that the blogging bookmarklets provide backlinks to their sources.

And the same components might be reorganised, or recombined with others, to tell different stories, or variants of the same story.

Now, yes, I'm sure there are some thorny issues to grapple with here, and coming up with an interface that balances usability and the required degree of control may be a challenge - so maybe I'm getting carried way with my enthusiasm, but it doesn't seem to be an entirely unreasonable proposition.

I think it's important here that, as Tim emphasises towards the end of his text, it is the human "narrator", not an algorithm, who decides on the structure of the story and selects its components and brings them into (possibly new) relationships with each other.

I'm aware that there's other work in this area of narratives and stories, particularly from some of the people at the BBC, but I admit I haven't been able to keep up with it in detail. See e.g. Tristan Ferne on "The Mythology Engine" and Michael Smethurst's thoughtful "Storytellin'".

For me, Tim's very concrete examples made the potential of these approaches seem very real. They suggest a vision of Linked Data not as one more institutional "output", but as a basis for individual and collective creativity and empowerment, for the (re)telling of stories that have been at least partly concealed - stories which may even challenge the "dominant" stories told by the powerful. It seems all too infrequent these days that I come across something that reminds me why I bothered getting interested in metadata and data on the Web in the first place: Tim's presentation was one of those things.

September 19, 2011

Things & their conceptualisations: SKOS, foaf:focus & modelling choices

I thought I'd try to write up some thoughts around an issue which I've come across in a few different contexts recently, and which as a shorthand I sometimes think of as "the foaf:focus question". It was prompted mainly by:

  • my work on modelling the Archives Hub data during the LOCAH project, and looking at datasets to which we wanted to make links, such as VIAF;
  • looking at the data model for the recent Linked Open BNB dataset from the British Library, and how some Dublin Core properties were being used, and some email discussions I had around that;
  • a recent message by Dan Brickley to the foaf-dev mailing list, explaining how the design of some FOAF properties was conditioned by the context at the time, and reflecting on how that context had changed, and what the implications of those changes might be.

Rather by chance on Friday evening, just as I was about to try to tie up what had become a rather long and rambling post, I noticed a conversation on Twitter, initiated by John Goodwin (@gothwin), which I think addressed the broader issue which I'd been circling around without quite addressing it: the use of different "modelling styles" and the issues which arise as a result when we try to link or merge data.

After much chopping and changing, the post adopts a very roughly "chronological" approach. The initial parts cover areas and activities that I wasn't directly involved in at the time, so I am providing my own retrospective interpretation based on my reading of the sources rather than a first-hand account, and I apologise in advance for any omissions or misrepresentations.

FOAF and "interests"

The FOAF Project was an initiative launched by a community of Semantic Web enthusisasts back in 2000, which explored the use of the - then newly emerging - Resource Description Framework specification to express information about individuals, their contact details and their interests and projects - the sort of information that was typically presented on a "personal home page" - and also some of the practical considerations in providing and consuming such information on the Web as RDF. The principle "deliverable" of the project is the Friend of a Friend (FOAF) RDF vocabulary, which continues to evolve and is now very widely used.

As Dan Brickley notes in his recent post, when looking at some of the FOAF properties from the perspective of 2011, their design may seem slightly "unwieldy" to the newcomer, and this is in part because their design was shaped by the context of how the Web was being used at the time of their creation, perhaps nine or ten years ago. At that point, as Dan notes, while there were URIs available for Web documents, and a growing recognition of the importance of maintaining the stability of those URIs, the use of http URIs to identify things other than documents was much less widely adopted, and we often lacked stable URIs for those things of other types that we wanted to "talk about".

One of the use cases covered by FOAF is to express the "interests" of an individual - where that "interest" might be an idea, a place, a person, an event or series of events, anything at all. To work around the issue of the availability of a URI of that thing, FOAF adopted a convention of "indirection" in some of its early properties. So, for example, the foaf:interest property expresses a relation, not between agent and idea/place/person (etc), but between agent and document: it says "this agent is interested in the topic of this page - the thing the page is 'about'" - where that might be anything at all. Using this convention, the topic itself is not explicitly referenced, so the question of its URI does not arise.

So, for example, to express an interest in the Napoleonic Wars, one might make use of the URI of the Wikipedia page 'about' that thing, and say:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:interest
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

A second property, foaf:topic_interest, does allow for the expression of that "direct" relationship, linking the agent to the thing of interest - which again might be anything at all. (I'm not sure whether thse two properties were created at the same time or whether one preceded the other). Even in the absence of URIs for concepts and people and places, RDF allows for the use of "blank nodes" to refer to such things.

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest [ rdfs:label "Napoleonic Wars"@en ] .

However, a blank node is limited in its scope as an identifier to the graph within which it is found: if Fred provides a blank node for the notion of the Napoleonic Wars in his graph and Freda provides a blank node for an interest in her graph, I can't tell (from that information alone) whether those two nodes are references to the same thing or to two different things (e.g. the historical event and a book about the event). Again, historically, one solution to this problem was to introduce the URI of a document, together with some inferencing based on OWL:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest 
          [ rdfs:label "Napoleonic Wars"@en ;
            foaf:isPrimaryTopicOf 
              <http://en.wikipedia.org/wiki/Napoleonic_Wars> ] .

According to the FOAF documentation for foaf:isPrimaryTopicOf:

The isPrimaryTopicOf property is inverse functional: for any document that is the value of this property, there is at most one thing in the world that is the primary topic of that document. This is useful, as it allows for data merging

i.e. if Freda's graph says:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:freda 
        foaf:topic_interest 
          [ rdfs:label "Napoleonic Wars"@en ;
            foaf:isPrimaryTopicOf 
              <http://en.wikipedia.org/wiki/Napoleonic_Wars> ] .

then my application can conclude that they are indeed both interested in the same thing - though that does depend on that application having some "built-in knowledge" of the OWL inferencing rules (or access to another service which does).

http URIs for Things

The "httpRange-14 resolution" by the W3C Technical Architecture Group, the publication of the W3C Note on Cool URIs for the Semantic Web, the adoption of the principles of Linked Data and the emergence of a large number of datasets based on those principles has, of course, changed the landscape considerably, and the use of http URIs to identify things other than documents has become commonplace - even if there remain concerns about the practical challenges of implementing of some of the recommended techniques.

So, now DBpedia assigns a distinct http URI for the thing the Wikipedia page http://en.wikipedia.org/wiki/Napoleonic_Wars "is about", and provides a description of that thing in which it says (amongst other things):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://dbpedia.org/resource/Napoleonic_Wars> 
        rdfs:label "Napoleonic Wars"@en ;
        a <http://dbpedia.org/ontology/Event> ,
          <http://dbpedia.org/ontology/MilitaryConflict> ,
          <http://umbel.org/umbel/rc/Event> ,
          <http://umbel.org/umbel/rc/ConflictEvent> ;
        foaf:page 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

i.e. that thing, the topic of the page, is an event, a military conflict etc.

We could substitute this new DBpedia URI for the blank nodes in our foaf:topic_interest data above:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

<http://dbpedia.org/resource/Napoleonic_Wars>
        rdfs:label "Napoleonic Wars"@en ;
        foaf:isPrimaryTopicOf 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:freda 
        foaf:topic_interest
          <http://dbpedia.org/resource/Napoleonic_Wars> .

<http://dbpedia.org/resource/Napoleonic_Wars>
        rdfs:label "Napoleonic Wars"@en ;
        foaf:isPrimaryTopicOf 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

When those two graphs are merged, the use of the common URI now makes it trivial to determine that Fred and Freda share the same interest.

Concept Schemes, SKOS and Document Metadata

The other factor Dan mentions in his message was the emergence of of the Simple Knowledge Organisation System (SKOS) RDF vocabulary, which after a long evolution became a W3C Recommendation in 2009.

SKOS is designed to provide an RDF representation of the various flavours of "knowledge organisation systems" and "controlled vocabularies" which information managers have traditionally used to organise information about various resources (books in libraries, objects in museums etc etc etc).

The core class in SKOS is that of the concept (skos:Concept). Each concept can be labelled with one or more names; documented with notes of various types; grouped into collections; related to other concepts through relationships such as "broader"/"narrower"/"related"; or mapped to other concepts in other collections.

The Library of Congress has published several library thesauri/classification schemes/controlled vocabularies as SKOS RDF data, including the Library of Congress Subject Headings, which includes a concept named "Napoleonic Wars, 1800-1815" with the URI http://id.loc.gov/authorities/subjects/sh85089767 (this is a subset of the actual data provided):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

lcsh:sh85089767
        a skos:Concept ;
        rdfs:label "Napoleonic Wars, 1800-1815"@en ;
        skos:prefLabel "Napoleonic Wars, 1800-1815"@en ;
        skos:altLabel "Napoleonic Wars, 1800-1814"@en ;
        skos:broader lcsh:sh85045703 ;
        skos:narrower lcsh:sh85144863 ;
        skos:inScheme 
          <http://id.loc.gov/authorities/subjects> .

A metadata creator coming from a bibliographic background and providing Dublin Core-based metadata for the Wikipedia page http://en.wikipedia.org/wiki/Napoleonic_Wars might well use this concept URI to provide the "subject" of that page:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars>
        a foaf:Document ;
        rdfs:label "Napoleonic Wars"@en ;
        dcterms:subject lcsh:sh85089767 .

And indeed the concept URI could (I think) also be used with the foaf:topic or foaf:primaryTopic properties:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars>
        a foaf:Document ;
        rdfs:label "Napoleonic Wars"@en ;
        foaf:topic lcsh:sh85089767 ;
        foaf:primaryTopic lcsh:sh85089767 .

Note that all three of these properties (dcterms:subject, foaf:topic, and foaf:primaryTopic) are defined in such a way that they are not limited to being used with concepts. We've seen this above where the DBpedia data includes:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://dbpedia.org/resource/Napoleonic_Wars> 
        foaf:page 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

which (since foaf:page and foaf:topic are inverse properties) implies:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars> 
        foaf:topic 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

It is also true for the dcterms:subject property. Although traditionally the Dublin Core community has highlighted the use of formal classification schemes like LCSH, and it may well be true that the dcterms:subject property is often used to link things to concepts, it is not limited to taking concepts as values, and one could also link the document to the event using dcterms:subject:

@prefix dcterms: <http://purl.org/dc/terms/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars> 
        dcterms:subject 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

I should acknowledge here that some in the Dublin Core community might disagree with my last example above, and argue that the values of dcterms:subject should be concepts. I think my position is backed up by the current DCMI documentation, and particularly by the fact that when they assigned ranges to the DCMI Terms properties in 2008, the DCMI Usage Board did not specify a range for the dcterms:subject property i.e. the intention is that dcterms:subject may link to a resource of any type.

I also note in passing that when DCMI created the DCMI Abstract Model in its attempt to reflect the "classical view" of Dublin Core (perhaps best expressed in Tom Baker's "A Grammar of Dublin Core") in an RDF-based model, the notion of a "Vocabulary Encoding Scheme" was defined as a set of things of any type, not specifically as a set of concepts.

Things and their Conceptualisations: foaf:focus

The SKOS approach and the SKOS Concept class introduce a new sort of "indirection" from our "things-in-the-world". As Dan puts it in a message to the W3C public-esw-thes list:

a SKOS "butterflies" concept is a social and technological artifact designed to help interconnect descriptions of butterflies, documents (and data) about butterflies, and people with interest or expertise relating to butterflies. I'm quite consciously avoiding saying what a "butterflies" concept in SKOS "refers to", because theories of reference are hard to choose between. Instead, I prefer to talk about why we bother building SKOS and what we hope can be achieved by it.

So, although both the DBpedia URI http://dbpedia.org/resource/Napoleonic_Wars) and the Library of Congress URI http://id.loc.gov/authorities/subjects/sh85089767) may be used in the triple patterns shown above, those two URIs identify two different resources - and both of them are distinct from the Wikipedia page which we cited in the examples back at the very start of this post.

i.e. we now have three separate URIs identifying three separate resources:

  1. a Wikipedia page, a document, created and modified by Wikipedia contributors between 2002 and the present identified by the URI http://en.wikipedia.org/wiki/Napoleonic_Wars
  2. the Napoleonic Wars as event taking place between 1800 and 1815, something with a duration in time, which occurred in physical locations, and in which human beings participated, identified by the DBpedia URI http://dbpedia.org/resource/Napoleonic_Wars
  3. a "conceptualisation of" the Napoleonic Wars, "a social and technological artifact designed to help interconnect", an "abstraction" created by the authors of LCSH editors for the purposes of classifying works; it has "semantic" relationships to other concepts, and is identified by the Library of Congress URI http://id.loc.gov/authorities/subjects/sh85089767

As we've seen, properties like dcterms:subject, foaf:topic/foaf:page, foaf:primaryTopic/foaf:isPrimaryTopicOf provide the vocabulary to express the relationships between the first and second of these resources, and between the first and third. But what about the relationship between the second and third, between the "thing in the world" and its conceptualisation in a classification scheme? Or to make the issue concrete, what happens if, in their "interest" graphs, Fred cites the DBpedia event URI and Frida cites the LCSH concept URI? How do we establish that their interests are indeed related? Can a publisher of an SKOS concept scheme indicate a relationship between a concept and a "conceptualised thing" (person, place, event etc)?

Dan provides a rather neat diagram illustrating the issue, using the example of Ronald Reagan. The arcs Dan labels "it" represent this "missing" (at the time the diagram was drawn) relationship type/property.

The resolution was to create a new property in the FOAF vocabulary, called foaf:focus. (See this page on the FOAF Project Wiki) for some of the discussion of its name).

The FOAF vocabulary specification says of the property:

The focus property relates a conceptualisation of something to the thing itself. Specifically, it is designed for use with W3C's SKOS vocabulary, to help indicate specific individual things (typically people, places, artifacts) that are mentioned in different SKOS schemes (eg. thesauri).

W3C SKOS is based around collections of linked 'concepts', which indicate topics, subject areas and categories. In SKOS, properties of a skos:Concept are properties of the conceptualization (see 2005 discussion for details); for example administrative and record-keeping metadata. Two schemes might have an entry for the same individual; the foaf:focus property can be used to indicate the thing in they world that they both focus on. Many SKOS concepts don't work this way; broad topical areas and subject categories don't typically correspond to some particular entity. However, in cases when they do, it is useful to link both subject-oriented and thing-oriented information via foaf:focus.

It's worth emphasising the point made in the penultimate sentence: not all concepts "have a focus"; some concepts are "just concepts" (poetry, slavery, conscientious objection, anarchism etc etc etc).

Dan summarises how he sees the new property being used in a message to the W3C public-esw-thes list:

The addition of foaf:topic is intended as a modest and pragmatic bridge between SKOS-based descriptions of topics, and other more entity-centric RDF descriptions. When a SKOS Concept stands for a person or agent, FOAF and its extensions are directly applicable; however we expect foaf:focus to also be used with places, events and other identifiable entities that are covered both by SKOS vocabularies as well as by factual datasets like wikipedia/dbpedia and Freebase.

A single "thing in the world" may be "the focus of" multiple concepts: e.g. several different library classification schemes may include concepts for the Napoleonic Wars or Ronald Reagan or Paris. Even within a single scheme, it may be that there are multiple concepts each reflecting different facets or aspects of a single entity.

VIAF

This aspect of the relationship between conceptualisation and "thing in the world" is illustrated in VIAF, the Virtual International Authority File, a service provided by OCLC. VIAF aggregates library "authority records" from multiple library "name authority files" maintained mainly by national libraries. Each record provides a "preferred form" of the name of a person or corporate entity, and multiple "alternate forms" - though that preferred form may vary from one file to the next. VIAF analyses and collates the aggregated data to establish which records refer to the same person or corporate entity, and presents the results as Linked Data.

Jeff Young of OCLC summarises the VIAF model in a post on the Outgoing blog. The post actually describes the transition between an earlier, slightly more complex model and the current model. For the purposes of this discussion, the thing to look at is the "Example (After)" graphic at the top right of the post (direct link)

Consider Dan's example of Ronald Reagan, identified by the VIAF URI http://viaf.org/viaf/76321889. The RDF description provided shows that there are eleven concepts linked to the person resource by a foaf:focus link. Each of those concepts (I think!) corresponds to a record in an authority file harvested by VIAF. Each concept has a preferred label (the preferred form of the name in that authority file) and may have a number of alternate labels. An abridged version of the VIAF description is below:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> . 

<http://viaf.org/viaf/76321889> a foaf:Person ;
        foaf:name "Reagan, Ronald" ,
                  "Reagan, Ronald W." , 
                  "Reagan, Ronald, 1911-2004" , 
                  "Reagan, Ronald W. 1911-2004" , 
                  "Reagan, Ronald Wilson 1911-2004" , 
                  "Reagan, Elvis 1911-2004" ;
# plus various other names!
        owl:sameAs 
          <http://d-nb.info/gnd/118598724> ,
          <http://dbpedia.org/resource/Ronald_Reagan> , 
          <http://libris.kb.se/resource/auth/237204> , 
          <http://www.idref.fr/027091775/id> .

<http://viaf.org/viaf/sourceID/BNE%7CXX1025345#skos:Concept>
        a skos:Concept ;
        skos:prefLabel "Reagan, Ronald, 1911-2004" ;
        skos:altLabel "Reagan, Elvis 1911-2004" , 
                      "Reagan, Ronald W. 1911-2004", 
                      "Reagan, Ronald Wilson 1911-2004" ;
        skos:inScheme 
          <http://viaf.org/authorityScheme/BNE> ;
        foaf:focus 
          <http://viaf.org/viaf/76321889> .

<http://viaf.org/viaf/sourceID/BNF%7C11921304#skos:Concept>
        a skos:Concept ;
        skos:prefLabel "Reagan, Ronald, 1911-2004" ;
        skos:altLabel "Reagan, Ronald Wilson 1911-2004" ;
        skos:inScheme 
          <http://viaf.org/authorityScheme/BNF> ;
        foaf:focus 
          <http://viaf.org/viaf/76321889> .

# plus nine other concepts

(There really is an alternate label of "Reagan, Elvis 1911-2004" in the actual data!)

Fig1

Note that the owl:sameAs links here are between the VIAF person resource and person resources in external datasets.

LOCAH and index terms

My own first engagement with foaf:focus came during the LOCAH project. In deciding how to represent the content of the Archives Hub EAD documents as RDF, we had to decide how to model the use of "index terms" provided using the EAD <controlaccess> element. Within the Hub EAD documents, those index terms are names of one of the following categories of resource:

  • Concepts
  • Persons
  • Families
  • Organisations
  • Places
  • Genres or Forms
  • Functions

The names are sometimes (but not always) drawn from some sort of "controlled list", which is also named in the data. In other cases, they are constructed using some specified set of rules, again named in the data.

For some of these categories (Concepts, Genres/Forms, Functions), the "thing" named is simply a concept, an abstraction; for others (Persons, Families, Organisations and Places), there is a second "non-abstract" entity "out there in the world". And for this second case, we chose to represent the two distinct things, each with their own distinct URI, linked by a triple using the foaf:focus property.

The LOCAH data model is illustrated in the diagram in this post. The Concept entity type is in the lower centre; directly below are four boxes for the related "conceptualised" types (Person, Family, Organisation and Place), each linked from the Concept by the foaf:focus property.

As in the VIAF case, for a single person/family/organisation/place, there may be multiple distinct "conceptualisations", reflecting the fact that different data providers have referred to the same "thing in the world" by citing entries from different "authority files". The nature of the process by which the LOCAH RDF data is generated - the EAD documents are processed on a "document by document" basis - means that in this case, multiple URIs for the person are generated, and the "reconciliation" of these URIs as co-references to a single entity is performed as a subsequent step.

The British Library Linked Data

The British Library recently announced the release of Linked Open BNB, a new Linked Data dataset covering a subset of the British National Bibliography. The approach taken is described in a post by Richard Wallis of Talis, who worked with the BL as consultants in preparing the dataset.

The data model for the BNB data shows quite extensive use of the Concept-foaf:focus-Thing pattern.

For the "subjects" of Bibliographic Resources, a dcterms:subject link is made to a skos:Concept, reflecting an authority file or classification scheme entry, and which is in turn linked using foaf:focus to a Person, Family, Organisation or Place. In other cases, the Bibliographic Resource is linked directly to the "thing-in-the-world" and a corresponding Concept is also provided, linking to the "thing-in-the world" using foaf:focus. This is the case for languages, for persons as creators or contributors to the bibliographic resource, and for the Dublin Core "spatial coverage property".

So for the example http://bnb.data.bl.uk/id/resource/009436036 (again, this is a very stripped-down version of the actual data):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix wgs84_pos: <http://www.w3.org/2003/01/geo/wgs84_pos#> .

<http://bnb.data.bl.uk/id/resource/009436036>
        a dcterms:BibliographicResource ;
        dcterms:creator 
          <http://bnb.data.bl.uk/id/person/KingGRD%28GeoffreyRD%29> ; 
        dcterms:subject
          <http://bnb.data.bl.uk/id/concept/place/lcsh/Aden%28Yemen%29> ;
        dcterms:spatial
          <http://bnb.data.bl.uk/id/place/Aden%28Yemen%29> ;
        dcterms:language 
          <http://lexvo.org/id/iso639-3/eng> .

<http://bnb.data.bl.uk/id/concept/place/lcsh/Aden%28Yemen%29> 
        a skos:Concept ;
        foaf:focus 
          <http://bnb.data.bl.uk/id/place/Aden%28Yemen%29> .

<http://bnb.data.bl.uk/id/place/Aden%28Yemen%29>        
        a dcterms:Location, wgs84_pos:SpatialThing .

In this case, the subject-concept and the spatial coverage-place happen to be linked to each other, but the point I wanted to illustrate was that the object of the dcterms:subject triple is the URI of a concept, and is the subject of an "outgoing" foaf:focus link to a "thing", but the object of the dcterms:spatial triple is the URI of a location/place, and is the object of an "incoming" foaf:focus link from a concept.

The two cases are perhaps best illustrated using the graph representation:

Fig2

Authority Files, Concepts, foaf:focus and Dublin Core

As discussed above, the dcterms:subject property is defined in such a way that it can be used to link to a thing of any type, although there may be a preference amongst some implementers to use dcterms:subject to link only to concepts.

For the other four properties I highlighted in the BL model, DCMI specifies an rdfs:range for the properties:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .

dcterms:creator rdfs:range dcterms:Agent .
dcterms:contributor rdfs:range dcterms:Agent .
dcterms:spatial rdfs:range dcterms:Location .
dcterms:language rdfs:range dcterms:LinguisticSystem .

Those three classes (dcterms:Agent, dcterms:LinguisticSystem, dcterms:Location) are described using RDFS, and although there is no formal statement that they are disjoint from the class skos:Concept, I think their human-readable descriptions, taken together with those of the properties, carry a fairly strong suggestion that instances of these classes are the "things in the world" rather than their conceptualisations - certainly for the first three cases at least. A dcterms:Agent is "A resource that acts or has the power to act", which a concept can not; a dcterms:Location is "A spatial region or named place", which again seems distinct from a concept.

The case of dcterms:LinguisticSystem, "A system of signs, symbols, sounds, gestures, or rules used in communication", seems a bit less clear, as this is a "conceptual thing" but I think one can argue that the actual linguistic system practiced by a community of language speakers is a distinct concept from the "conceptualisation" created within a classification scheme.

And this is, I think, reflected in the patterns used in the BL data model.

As I noted earlier, the Library of Congress has published SKOS representations of a number of controlled vocabularies, and these include:

In each case, the entries/members of the vocabularies are modelled as instances of skos:Concept. Following the argument I just constructed above, then, to use these vocabularies with the dcterms:spatial and dcterms:languageproperties, strictly speaking, one should adopt the patterns used in the BL model, where the concept URI is not the direct object, but is linked to a thing (location, linguistic system) URI by a foaf:focus link.

Finally, I'd draw attention to lexvo.org, which also provides Linked Data representations for ISO639-3 languages and ISO 3166-1 / UN M.49 geographical regions. In contrast to the Library of Congress SKOS representations, lexvo.org models "the things themselves", the languages and geographical regions, e.g. (again, a subset of the actual data)

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix lvont: <http://lexvo.org/ontology#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://lexvo.org/id/iso639-3/spa>
        a lvont:Language ;
        rdfs:label "Spanish"@en ;
        lvont:usedIn 
          <http://lexvo.org/id/iso3166/ES> ;
        owl:sameAs 
          <http://dbpedia.org/resource/Spanish_language> .
        
<http://lexvo.org/id/iso3166/ES>
        a lvont:GeographicRegion ;
        rdfs:label "Spain"@en ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/039> ;
        owl:sameAs 
          <http://sws.geonames.org/2510769> .
        
<http://lexvo.org/id/un_m49/039>
        a lvont:GeographicRegion ;
        rdfs:label "Southern Europe"@en ;
        lvont:hasMember 
          <http://lexvo.org/id/iso3166/ES> ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/150> .
 
<http://lexvo.org/id/un_m49/150>
        a lvont:GeographicRegion ;
        rdfs:label "Europe"@en ;
        lvont:hasMember 
          <http://lexvo.org/id/un_m49/039> ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/001> .

In the BL data one finds lexvo.org language URIs used as objects of dcterms:language (which we do in the LOCAH data too).

The lexvo.org language URIs are the subjects of properties such as lvont:usedIn linking language to place where it is used or spoken and of owl:sameAs triples linking to language URIs in other datasets. And the geographic region URIs are the subjects of properties such as lvont:memberOf linking one region to another region of which it is part and of owl:sameAs triples linking to place URIs in other datasets.

Compare this with the relationships between the SKOS-based "conceptualisations of languages" and "conceptualisations of geographic in the Library of Congress dataset (again, a subset of the actual data):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<http://id.loc.gov/vocabulary/iso639-2/spa>
        a skos:Concept ;
        skos:prefLabel "Spanish | Castilian"@en ;
        skos:altLabel "Spanish"@en ,
                      "Castilian"@en ;
        skos:note "Bibliographic Code"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/languages/spa> ,
          <http://id.loc.gov/vocabulary/iso639-1/es> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/iso639-2> .
                        
<http://id.loc.gov/vocabulary/countries/sp>
        a skos:Concept ;
        skos:prefLabel "Spain"@en ;
        skos:altLabel "Balearic Islands"@en ,
                      "Canary Islands"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/geographicAreas/e-sp> ;
        skos:broadMatch 
          <http://id.loc.gov/vocabulary/geographicAreas/e> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/countries> .
        
<http://id.loc.gov/vocabulary/geographicAreas/e-sp>
        a skos:Concept ;        
        skos:prefLabel "Spain"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        skos:broader 
          <http://id.loc.gov/vocabulary/geographicAreas/e> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/geographicAreas> .

<http://id.loc.gov/vocabulary/geographicAreas/e>
        a skos:Concept ;        
        skos:prefLabel "Europe"@en ;
        skos:narrower 
          <http://id.loc.gov/vocabulary/geographicAreas/e-sp> ;
        skos:narrowMatch 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/geographicAreas> .
        

Here the types of relationships involved are the SKOS "semantic relations" and "mapping properties" between concepts: e.g. Spain-as-concept has-broader-concept Europe-as-concept, and so on.

Conclusions

If anyone has read this far, I can imagine it is not without some rolling of eyes at the pedantic distinctions I seem to be unpicking!

Given my understanding of SKOS, FOAF and Dublin Core, I do think the designers of the BL data have done a sterling job in trying to "get it right", carefully observing the way terms have been described by their owners and seeking to use those terms in ways that are consistent with those descriptions.

At the same time, I admit I can well imagine that to many Dublin Core implementers who see the Dublin Core properties as providing a relatively simple approach, this will seem over-complicated.

And I rather expect that we will see uses of the dcterms:language and dcterms:spatial properties which simply link directly to the concept:

@prefix dcterms: <http://purl.org/dc/terms/> .

<http://example.org/doc/1234>
        a dcterms:BibliographicResource ;
        dcterms:spatial 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        dcterms:language 
          <http://id.loc.gov/vocabulary/iso639-2/spa> .

I think this is particularly likely for the case of dcterms:spatial as it is perceived as "similar" to dcterms:subject - amongst other things, it covers "aboutness" in the case where the topic is a place - particularly since, as in the BL example above, the concept URI may be used with dcterms:subject in the very same graph.

Returning to the recent post by Dan which in part prompted me to start this post, he makes a similar point, focusing on the FOAF properties with which I introduced the post. He recognises that "in some contexts having careful names for all these inter-relations is very important" but suggests

We should consider making foaf:interest vaguer, such that any of those three are acceptable. Publishers aren't going to go looking up obscure distinctions between foaf:interest and foaf:topic_interest and that other pattern using SKOS ... they just want to mark that this is a relationship between someone and a URI characterising one of their interests.

So I suggest we become more tolerant about the URIs acceptable as values of foaf:interest

(See the message in full for Dans examples.)

Part of me is tempted to suggest that similar reasoning might be applied to the Dublin Core Terms properties i.e. that consideration might be given to "relaxing" the range of dcterms:spatial to allow for the use of either the place or the concept, to resolve the dilemna that, with the current design, the patterns are different for dcterms:subject and dcterms:spatial. But where do we stop? Do we also relax dcterms:language? I really don't know. And I think I'd be quite uneasy making the suggestion for the "agent" properties.

The fundamental issue here is that the thesaurus-based approach introduces a layer of abstraction - Dan's "social and technological artifact[s] designed to help interconnect descriptions" - and that is reflected in SKOS's notion of the skos:Concept. Outside the "knowledge organisation" community, however, many RDFS/OWL modellers "model the world directly": they consider the language as system used by a community of speakers (as modelled by lexvo.org), the place as thing in space (as modelled by Geonames), and so on. Constructs such as foaf:focus help us bridge the two approaches - but not without some complexity.

On Twitter on Friday evening, I noticed a (slightly provocative!) comment on Twitter from John Goodwin (@gothwin) which I think is related:

coming to the conclusion that #SKOS is being waaay over used #linkeddata

It attracted some sympathetic responses, e.g. from Rob Styles (@mmmmmmrob) 1, 2:

@gothwin @juansequeda if you're not publishing an existing subject heading scheme on the cheap then SKOS is the wrong tool.

@johnlsheridan @juansequeda @gothwin Paris is not "narrower than" France, it's "capital of". That's the problem.

which I think echoes my examples above of Spain and Europe.

And from Leigh Dodds (@ldodds): 1, 2:

@gothwin that's all it was designed for: converting one specific class of datasets into RDF. It's just been mistaken for a modelling tool

@juansequeda @gothwin @kendall only use #skos if you're CONVERTING a taxonomy. Otherwise, just model the domain

These comments struck home with me, and I think I may have made exactly this sort of mistake in some other work I've been doing recently, and I need to revisit it, to check whether what I've done is really necessary or useful. As John suggests, I think sometimes I reach for SKOS as a tool for something that has a "controlled list" feel to it without thinking hard enough whether it is really the appropriate tool for the task at hand.

Having said that, I do also think we will need to manage the "mixed" cases - particularly in datasets coming from communities such as libraries where the use of thesauri is commonplace, and a growing number are available as SKOS - where we end up needing the sort of bridge which foaf:focus provides, so some of this complexity may be unavoidable.

In any case, I think it's an area where some guidance/best practice notes - with lots of examples! - would be very helpful.

August 15, 2011

Two ends and one start

The end of July formally marked the end of two projects I've been contributing to recently:

There are still some things to finish for LOCAH, particularly the publication of the Copac data.

A few (not terribly original or profound) thoughts, prompted mainly by my experience of working with the Archives Hub data:

  • Sometimes data is "clean" and consistent and "normalised", but more often than not it has content in inconsistent forms or is incomplete: it is "messy". Data aggregated from multiple sources over a long period of time is probably going to be messier. (With an XML format like EAD, with what I think of as its somewhat "hybrid" document/data character, the potential for messiness increases).
  • Doing things with data requires some sort of processing by software, and while there are off-the-shelf apps and various configurable tools and frameworks that can provide some functions, some element of software development is usually required.
  • It may be relatively easy to identify in advance the major tasks where developer effort is required, and to plan for that, but sometimes there are additional tasks which it's more difficult to anticipate; rather they emerge as you attempt some of the other processes (and I think messy data probably makes that more likely).
  • Even with developers on board, development work has to be specified, and that is a task in itself, and one that can be quite complex and time-consuming (all the more so if you find yourself trying to think through and describe what to do with a range of different inputs from a messy data source).

It's worth emphassing that most of the above is not specific to generating Linked Data: it applies to data in all sorts of formats, and it applies to all sorts of tasks, whether you're building a plain old Web site or exposing RSS feeds or creating some other Web application to do something with the data.

Sort of related to all of the above, I was reminded that my limited programming skills often leave me in the position where I can identify what needs doing but I'm not able to "make it happen", and that is something I'd like to try to change. I can "get by" in XSLT, and I can do a bit of PHP, but I'd like to get to the point where I can do more of the simpler data manipulation tasks and/or pull together simple presentation prototypes.

I've enjoyed working on both the projects. I'm pleased that we've managed to deliver some Linked Data datasets, and I'm particularly pleased to have been able to contribute to doing this for archival description data, as it's a field in which I worked quite a long time ago.

Both projects gave me opportunities to learn by actually doing stuff, rather than by reading about it or prodding at other people's data. And perhaps more than anything, I've enjoyed the experience of working together as a group. We had a lot of discussion and exchange of ideas, and I'd like to think that this process was valuable in itself. In an early post on the SALDA blog, Karen Watson noted:

it is perhaps not the end result that is most important but the journey to get there. What we hope to achieve with SALDA is skills and knowledge to make our catalogues Linked Data and use those skills and that knowledge to inform decisions about whether it would be beneficial for make all our data Linked Data.

Of course it wasn't all plain sailing, and particularly near the start there were probably times when we ran up against differences of perception and terminology. Jane Stevenson has written about some of these issues from her point of view on the LOCAH blog (e.g. here, here and here). As the projects progressed, I think we moved closer towards more of a shared understanding - and I think that is a valuable "output" in its own right, even if it is one which it may be rather hard to "measure".

So, a big thank you to everyone I worked with on both projects.

Looking forward, I'm very pleased to be able to say that Jane prepared a proposal for some further work with the Archives Hub data, under JISC's Capital Funded Service Enhancements initiative, and that bid has been successful, and I'll be contributing some work to that project as a consultant. The project is called "Linking Lives" and is focused on providing interfaces for researchers to explore the data (as well as making any enhancements/extensions to the data and the "back-end" processes that are required to enable that). More on that work to come once we get moving.

Finally, as I'm sure many of you are aware, JISC recently issued some new calls for projects, including call 13/11 for projects "to develop services that aim to make content from museums, libraries and archives more visible and to develop innovative ways to reuse those resources". If there are any institutions out there who are considering proposals and think that I could make a useful contribution - despite my lamenting my limitations above, I hope my posts on the LOCAH and SALDA blogs give an idea of the sorts of areas I can contribute to! - , please do get in touch.

May 11, 2011

LOCAH releases Linked Archives Hub dataset

The LOCAH project, one of the two JISC-funded projects to which I've been contributing, this week announced the availability of an initial batch of data derived from a small subset of the Archives Hub EAD data as linked data. The homepage for what we have called the "Linked Archives Hub" dataset is http://data.archiveshub.ac.uk/

The project manager, Adrian Stevenson of UKOLN, provides an overview of the dataset, and yesterday I published a post which provides a few example SPARQL queries.

I'm very pleased we've got this data "out there": it feels as if we've been at the point of it being "nearly ready" for a few weeks now, but a combination of ironing out various technical wrinkles (I really must remember to look at pages in Internet Explorer now and again) and short working weeks/public holidays held things up a little. It is very much a "first draft": we have additional work planned on making more links with other resources, and there are various things which could be improved (and it seems to be one of those universal laws that as soon as you open something up, you spot more glitches...). But I hope it's enough to demonstrate the approach we are taking to the data, and to provide a small testbed that people can poke at and experiment with.

(If you have any specific comments on content of the LOCAH dataset, it's probably better to post them over on the LOCAH blog where other members of the project team can see and respond to them).

March 28, 2011

Waiter, my resource discovery glass is half-empty

Old joke...

A snail goes into a pub and says to the barman, "I've just been mugged by two tortoises". The barman looks a bit shocked and says, "Oh no, that's terrible. What happened?".

The snail responds, "I dunno, it all happened so fast".

Sorry!

I had a bit of a glass half-empty moment last week, listening to the two presentations in the afternoon session of the ESRC Resource Discovery workshop, the first by Joy Palmer about the MIMAS-led Resource Discovery Task Framework Management Framework and the second by Lucy Bell about the UKDA resource discovery project. Not that there was anything wrong with either presentation. But it struck me that they both used phrases that felt very familiar in the context of resource discovery in the cultrual heritage and education space over the last 10 years or so (probably longer) - "content locked in sectoral silos", "the need to work across multiple websites, each with their own search idiosyncracies", "the need to learn and understand multiple vocabularies", and so on.

In a moment of panic I said words to the effect of, "We're all doomed. Nothing has changed in the last 10 years. We're going round in circles here". Clearly rubbish... and, looking at the two presentations now, it's not clear why I reached that particular conclusion anyway. I asked the room why this time round would be different, compared with previous work on initiatives like the UK JISC Information Environment, and got various responses about community readiness, political will, better stakeholder engagement and what not. I mean, for sure, lots of things have changed in the last 10 years - I'm not even sure the alphabet contained the three letters A, P and I back then and the whole environment is clearly very different - but it is also true that some aspects of the fundamental problem remain largely unchanged. Yes, there are a lot more cultural heritage, scientific and educational resources out there (being made available from within those sectors) but it's not always clear the extent to which that stuff is better joined up, open and discoverable than it was at the turn of the century?

There is a glass half-full view of the resource discovery world, and I try to hold onto it most of the time, but it certainly helps to drink from the Google water fountain! Hence the need for initiatives like the UK Resource Discovery Task Force to emphasise the 'build better websites' approach. We're talking about cultural change here, and cultural change takes time. Or rather, the perceived rate of cultural change tends to be relative to the beholder.

March 25, 2011

RDTF metadata guidelines - next steps

A few weeks ago I blogged about the work that Pete and I have been doing on metadata guidelines as part of the JISC/RLUK Resource Discovery Task Force, RDTF metadata guidelines - Limp Data or Linked Data?.

In discussion with the JISC we have agreed to complete our current work in this area by:

  • delivering a single summary document of the consultation process around the current draft guidelines, incorporating the original document and all the comments made using the JISCpress site during the comment period; and
  • suggesting some recommendations about any resulting changes that we would like to see made to the draft guidelines.

For the former, a summary view of the consultation is now available. It's not 100% perfect (because the links between the comments and individual paragraphs are not retained in the summary) but I think it is good enough to offer a useful overview of the draft and the comments in a single piece of text. Furthermore, the production of this summary was automated (by post-processing the export 'dump' from Wordpress), so the good news is that a similar view can be obtained for any future (or indeed past) JISCpress consultations.

For the latter, this blog post forms our recommendations.

As noted previously, there were 196 comments during the comment period (which is not bad!), many of which were quite detailed in terms of particular data models, formats and so on. On the basis that we do not know quite what form any guidelines might take from here on (that is now the responsibility of the RDTF team at MIMAS I think), it doesn't seem sensible to dig into the details too much. Rather, we will make some comments on the overall shape of the document and suggest some areas where we think it might be useful for JISC and RLUK to undertake additional work.

You may recall that our original draft proposed three approaches to exposing metadata, which we refered to as the community formats approach, the RDF data approach and the Linked Data approach. In light of comments (particularly those from Owen Stephens and Paul Walk) we have been putting some thought into how the shape of the whole document might be better conceptualised. The result is the following four-quadrant model:

Rdtf
Like any simple conceptualisation, there is some fuzziness in this but we think it's a useful way of thinking about the space.

Traditionally (in the world of library, museum and archives at least), most sharing of metadata has happened in the bottom-left quadrant - exchanging bulk files of MARC records for example. And, to an extent, this approach continues now, even outside those sectors. Look at the large amount of 'open data' that is shared as CSV files on sites like data.gov.uk for example. Broadly speaking, this is what we refered to as the community formats approach (though I think our inclusion of the OAI-PMH in that area probably muddied the waters a little - see below).

One can argue that moving left to right across the quadrants offers semantically richer metadata in a 'small pieces, loosely joined' kind of way (though this quickly becomes a religious argument with no obvious point of closure! :-) ) and that moving bottom to top offers the ability to work with individual item descriptions rather than whole collections of them - and that, in particular, it allows for the assignment of 'http' URIs to those descriptions and the dereferencing of those URIs to serve them.

Our three approaches covered the bottom-left, bottom-right and top-right quadrants. The web, at least in the sense of serving HTML pages about things of interest in libraries, museums and archives, sits in the top-left quadrant (though any trend towards embedded RDFa in HTML pages moves us firmly towards the top-right).

Interestingly, especially in light of the RDTF mantra to "build better websites", our guidelines managed to miss that quadrant. In their comments, Owen and Paul argued that moving from bottom to top is more important than moving left to right - and, on balance, we tend to agree.

So, what does this mean in terms of our recommendations?

We think that the guidelines need to cover all four quadrants and that, in particular, much greater emphasis needs to be placed on the top-left quadrant. Any guidance needs to be explicit that the 'http' URIs assigned to descriptions served on the web are not URIs for the things being described; that, typically, multiple descriptions may be served for the things being described (an HTML page and an XML document for example, each of which will have separate URIs) and that mechanisms such as '<link rel="alternative" ...>' can be used to tie them together; and that Google sitemaps (on the left) and semantic sitemaps (on the right) can be used to guide robots to the descriptions (either individually or in collections).

Which leaves the issue of the OAI-PMH. In a sense, this sits along-side the top-left quadrant - which is why, I think, it didn't fit particularly well with our previous three approaches. If you think about a typical repository for example, it is making descriptions of the content it holds available as HTML 'splash' pages (sometimes with embedded links to descriptions in other formats). In that sense it is functioning in top-left, "page per thing", mode. What the OAI-PMH does is to give you a protocol mechanism for getting at that those descriptions in a way that is useful for harvesting.

Several people noted that Atom and RSS might be used as an alternative to both sitemaps and the OAI-PMH, and we agree - though it may be that some additional work is needed to specify the exact mechanisms for doing so.

There were some comments on our suggestion to follow the UK government guidelines on assigning URIs. On reflection, we think it would make more sense to recommend only the W3C guidelines on Cool URIs for the Semantic Web, particularly on the separation of things from the desriptions of things, suggesting that it may be sensible to fund (or find) more work in this area making specific recommendations around persistent URIs (for both things and their descriptions).

Finally, there were a lot of comments on the draft guidelines about our suggested models and formats - notably on FRBR, with many commenters suggesting that this was premature given significant discussion around FRBR elsewhere. We think it would make sense to separate out any guidance on conceptual models and associated vocabularies, probably (again) as a separate piece of work.

To summarise then, we suggest:

  • that the four-quadrant model above is used to frame the guidelines - we think all four quadrants are useful, and that there should probably be some guidance on each area;
  • that specific guidance be developed for serving an HTML page description per 'thing' of interest (possibly with associated, and linked, alternative formats such as XML);
  • that guidance be developed (or found) about how to sensibly assign persistent 'http' URIs to everything of interest (including both things and descriptions of things);
  • that the definition of 'open' needs more work (particularly in the context of whether commercial use is allowed) but that this needs to be sensitive to not stirring up IPR-worries in those domains where they are less of a concern currently;
  • that mechanisms for making statement of provenance, licensing and versioning be developed where RDF triples are being made available (possibly in collaboration with Europeana work); and
  • that a fuller list of relevant models that might be adopted, the relationships between them, and any vocabularies commonly associated with them be maintained separately from the guidelines themselves (I'm trying desperately not to use the 'registry' word here!).

March 10, 2011

Term-based thesauri and SKOS (Part 4): Change over time (ii)

This is the fourth in a series of posts (previously: part 1, part 2, part 3) on making a thesaurus available as linked data using the SKOS and SKOS-XL RDF vocabularies. In the previous post, I examined some of the ways the thesaurus can change over time, and problems that arose with my proposed mapping to RDF. Here I'll outline one potential solution to those problems.

The last three cases I described in the previous post, where an existing preferred term loses that status and is "relegated" to a non-preferred term, all present a problem for my suggested simple mapping, because the URI for a concept disappears from the generated RDF graph - and this creates a conflict with the principles of URI stability and reliability I advocated at the start of that post.

My first thoughts on a solution circled around generating concept URIs, not just for the preferred term, but also for all the non-preferred terms, and using owl:sameAs (or skos:exactMatch?) to indicate that the concept URIs derived from the terms associated with a single preferred term were synonyms, i.e. each of them identified the same concept. That way the change from preferred term to non-preferred term would not result in the loss of a concept URI. But the proliferation of URIs here feels fundamentally flawed - the problem is not one that is solved by having multiple URIs for a single concept; the issue is the persistence of a single URI. Introducing the multiple URIs also seems like a recipe for a lot of practical difficulties in managing the impact of changes on external applications, particularly if URIs which were once synonyms cease to be so.

After some searching, I found a couple of useful pages on the W3C wiki: some notes on versioning (which as far as I know didn't make it into the final SKOS specifications) and particularly this page on "Concept Evolution" in SKOS. The latter is rather more a collection of pointers than the concrete set of examples and guidelines I was hoping for, but one of those pointers is to a thread on the W3C public-esw-thes mailing list, starting with this message from Rob Tice, which I think describes (in his point 2) exactly the situation I'm dealing with in the problem cases in the previous post:

How should we identify and manage change between revisions of concept schemes as this 'seems' to result in imprecision. e.g. a concept 'a' is currently in thes 'A' and only has a preferred label. A new revision of thes 'A' is published and what was concept 'a' is now a non preferred concept and thus becomes simply a non preferred label for a new concept 'b'.

It seems to me that this operation loses some of the semantic meaning of the change as all references to the concept id of 'concept a' would be lost as it now is only a non preferred label of a different concept with a different id (concept 'b').

The suggested approach emerging from that discussion has two elements:

  1. A notion that a concept can be marked as "deprecated" (using e.g. a "status" property with a value of "deprecated" or a "deprecated" property with Boolean (yes/no) values) or as being "valid" or "applicable" only for a specified bounded period of time (see the messages from Johan De Smedt and from Margarita Sini)
  2. Such a "deprecated" concept can be the subject of a "replaced by" relationship linking it to the "preferred term" concept (see the message from Michael Panzer)

The application of these two elements in combination is illustrated in this example by Joachim Neubert (again, I think, addressing the same scenario).

I wasn't aware of the owl:deprecated property before, but as far as I can tell, it would be appropriate for this case.

Joachim's message highlights the question of what to do about skos:prefLabel/skosxl:prefLabel or skos:altLabel/skosxl:altLabel properties for the deprecated concept. In the term-based thesaurus, the term has become a non-preferred term for another term: in the SKOS model, the term is now the alternate label for a different concept, and the preferred label for no concept. So on that basis, I'm inclined to follow Joachim's suggestion that the deprecated concept should be the subject of neither skos:prefLabel/skosxl:prefLabel nor skos:altLabel/skosxl:altLabel properties, though it could, as Joachim's example shows, retain an rdfs:label property. And similarly it is no longer the subject or object of semantic relationships.

I did wonder about the option of introducing a set of properties, parallel to the SKOS ones, to indicate those former relationships, e.g. ex:hadPrefLabel, ex:hadAltLabel, ex:hadRelated, ex:hadBroader, ex:hadNarrower, essentially as documentation. But I'm really not sure how useful this is: the semantic relationships in which those other target concepts are involved may themselves change. And I suppose in principle (though it seems unlikely in practice) a single concept may itself go through several status changes (e.g. from active to deprecated to active to deprecated) and accrue various different "former" relationships in the course of that. If this level of information is required, then I think it probably has to be provided using some other approach - like the use of a set of date-stamped graphs/documents that reflect the state of a concept at a point in time.

So applying Joachim's approach to Case 8 from the examples in the previous post, where the current preferred term "Political violence" is to become a non-preferred term for "Collective violence", we end up with the concept con:C2 as a "deprecated" concept with a Dublin Core dcterms:isReplacedBy relationship to concept con:C6 (and the inverse from con:C6 to con:C2):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C6 .
       
term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2; 
       dcterms:replaces con:C2 .       

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

Using this approach then, the full output graph for Case 8 would be as follows (the highlighting indicates the difference between this graph and that for Case 8 in the previous post):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3; 
       dcterms:replaces con:C2 .       

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C6 .
       
term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C6 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

Now our graph retains the URI con:C2 and provides a description of that resource as a "deprecated concept".

And for Case 9 (again the highlighting indicates the difference from the initial graph for Case 9):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3; 
       dcterms:replaces con:C2 .       
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C1 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C1 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

In the (unlikely?) event that the (previously preferred) non-preferred term is once again restored to the state of preferred term, then the concept con:C2 loses its deprecated status and the dcterms:isReplacedBy relationship, and acquires skos:prefLabel/skos:altLabel properties as normal.

Generating these graphs does, however, imply a change to the process of generating the RDF representation. As I noted at the start of the previous post, my first cut at this was based on being able to process a snaphot of the thesaurus "stand-alone" without knowledge of previous versions. But the capacity to detect deprecated concepts depends on knowledge of the current state of the thesaurus, i.e. when the transformation process encounters a non-preferred term x, it needs to behave differently depending on whether:

  1. concept con:Cx exists in the current thesaurus dataset (as either an "active" or "deprecated" concept), in which case a "deprecated concept" con:Cx should be output, as well as term:Tx (as alternate label for some other concept, con:Cy); or
  2. concept con:Cx does not exist in the current thesaurus dataset, in which case only term:Tx (as alternate label for a concept con:Cy) is required

I think that test has to be made against the current RDF thesaurus dataset rather than using the previous XML snapshot in time, as the "deprecation" may have taken place several snapshots ago.

I have to admit this does make the transformation process rather more complicated than I had hoped. The only way alternative would be if it is somehow possible to distinguish the "deprecation" case from the "static" non-preferred term case from the input data alone, but as far as I know this isn't possible.

Summary

The previous post highlighted that for one particular category of change, where an existing preferred term is "relegated" to the status of a non-preferred term, the results of the suggested simple mapping into SKOS had problematic consequences.

Based on some investigation of how others approach similar scenarios (and here I should note I'm very grateful to the contributors to the wiki page on concept evolution and to those discussions linked from it, as I was struggling to see clearly how to deal with these scenarios), I've sketched above an approach to representing a concept which has been "deprecated", or is no longer applicable, and is replaced by another concept. I'm sure it isn't the only way of addressing the problem, but it seems a reasonable one to try.

I think this creates new challenges for implementing this approach in the transformation process and I need to work on that to test it, but I think it is achievable. But I would also be very grateful for any comments, particularly if there are gaping holes in this which I haven't spotted!

Term-based thesauri and SKOS (Part 3): Change over time (i)

This is the third in a series of posts (previously: part 1, part 2) on making a thesaurus available as linked data using the SKOS and SKOS-XL RDF vocabularies. In this post, I'll examine some of the ways the thesaurus can change over time, and how such changes are reflected when applying the mapping I described earlier.

A note on "workflow"

In the case I'm working on, the term-based thesaurus is managed in a purpose-built application, from which a snapshot is exported (as an XML document) at regular intervals. These XML documents are the inputs to a transformation process which generates an SKOS/SKOS-XL RDF version, to be exposed as linked data.

Currently at least, each "run" of that transformation operates on a single snaphot of the thesaurus "stand-alone" i.e. the transform process has no "knowledge" of the previous snapshot, and the expectation is that the output generated from processing will replace the output of the previous run (either in full, or through a process of establishing the differences and then removing some triples and adding others). This "stand-alone" approach may be something I have to revisit.

The mapping

To summarise the transformation described in the previous post, a single preferred term and its set of zero or more non-preferred terms are treated as labels for a single concept. For each such set:

  • a single SKOS concept is created with a URI based on the term number of the preferred term
  • the concept is related to the literal form of the preferred term by an skos:prefLabel property
  • an SKOS-XL label is created with a URI based on the term number of the preferred term
  • the label is related to the literal form of the preferred term by an skosxl:literalForm property
  • the concept is related to the label by an skosxl:prefLabel property
  • the "hierarchical" (broader term, narrower term) and "associative" (related term) relationships between preferred terms are represented as "semantic" relationships between concepts
  • And for each non-preferred term in the set
    • the concept is related to the literal form of the non-preferred term by an skos:altLabel property
    • an SKOS-XL label is created with a URI based on the term number of the non-preferred term
    • the label is related to the literal form of the preferred term by an skosxl:literalForm property
    • the concept is related to the label by an skosxl:altLabel property

In the discussion below, I'll take the following "snapshot" of a notional thesaurus - it's another version of the example used in the previous posts, extended with an additional preferred term - as a starting point:

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

Using the mapping above, it is represented as follows in RDF using SKOS/SKOS-XL:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.
Fig1

"Versioning" and change over time

Once our resource URIs are generated and published, they will be used/cited by other agencies in their data - in other linked data datasets, in other thesauri, or in simple Web documents which reference terms or concepts using those URIs. From the linked data perspective, it is important that once generated and published the resource URIs, which will be http: URIs, remain stable and reliable. I'm using the terms "stable" and "reliable" as they are used by Henry Thompson and Jonathan Rees in their note Guidelines for Web-based naming, which I've found very helpful in breaking down the various aspects of what we tend to call "persistence". And for "stability", I'm thinking particularly of what they call "resource stability". So

  • once a URI is created, we should continue to use that URI to denote/identify the same resource
  • it should continue to be possible to obtain some information "about" the identified resource using the HTTP protocol - though that information obtained may change over time

For our particular case, the requirement is only that the "current version" of the thesaurus is available at any point in time, i.e. for each concept and for each term/label, at any point in time, it is necessary to serve only a description of the current state of that resource.

So, in my previous post, I mentioned that the Cabinet Office guidelines Designing URI Sets for the UK Public Sector allow for the case of creating a set of "date-stamped" document URIs, to provide variant descriptions of a resource at different points in time. I don't think that is required for this case, so for each term and concept, we'll have a URI for the that "thing", a "Document URI" for a "generic document" (current) description of that thing, and "Representation URIs" for each "specific document" in a particular format.

The formats provided will include a human-readable HTML version, an RDF/XML version and possibly other RDF formats. Over time, additional formats can be added as required through the addition of new "Representation URIs".

My primary focus here is the changes to the thesaurus content. Over time, various changes are possible. New terms may be added, and the relationships between terms may change. Terms are not deleted from the theasurus, however.

The most common type of change is the "promotion" of an existing non-preferred term to the status of a preferred term, but all of the following types of change can occur, even if some are infrequent:

  1. Addition of new semantic relationships between existing preferred terms
  2. Removal of existing semantic relationships between existing preferred terms
  3. Addition of new preferred terms
  4. Addition of new non-preferred terms (for existing preferred terms)
  5. An existing non-preferred term becomes a new preferred term
  6. An existing non-preferred term becomes a non-preferred term for a different existing preferred term
  7. An existing non-preferred term becomes a non-preferred term for a newly-added preferred term
  8. An existing preferred term becomes a non-preferred term for another existing preferred term
  9. An existing preferred term become a non-preferred term for a term which is currently a non-preferred term for it (and vice versa)
  10. An existing preferred term becomes a non-preferred term for a newly added preferred term

Below, I'll try to walk through an example of each of those changes in turn, starting from the example thesaurus above, showing the results using the mapping suggested above, and examining any issues which arise.

Case 1: Addition of new semantic relationship

The addition of new broader term (BT), narrower term (NT) or related term (RT) relationships is straightforward, as it involves only the creation of additional assertions of relationships between concepts, using the skos:broader, skos:narrower or skos:related properties, not the creation of new resources.

So if the example above is extended to add a BT relation between the "Collective violence" (term no 6) and "Violence" (term no 4) terms (and the inverse NT relation):

Civil violence
USE Political violence
TNR 1

Collective violence
BT Violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
NT Collective violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the RDF graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:broader con:C4 .

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 ;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

i.e. two new triples are added to the RDF graph

con:C6 skos:broader con:C4 .
con:C4 skos:narrower con:C6 .

The addition of the triples means that, from a linked data perspective, the graphs served as descriptions of the resources con:C6 and con:C4 change. They each include one additional triple for the concise bounded description case; two triples for the symmetric bounded description case (see the previous post for the discussion of different forms of bounded description). So the contents of the representations of documents http://example.org/doc/concept/polthes/C4 and http://example.org/doc/concept/polthes/C6 change - but no new resources are generated, and no new URIs required.

Case 2: Removal of existing semantic relationship

The removal of existing broader term (BT), narrower term (NT) or related term (RT) relationships is similarly straightforward, as it involves only the deletion of assertions of relationships between concepts, using the skos:broader, skos:narrower or skos:related properties, without the removal of existing resources.

I won't bother writing out an example in full for this case, but imagine the case of the previous example reverting to its initial state.

Again, from a linked data perspective, the graphs served as descriptions of the resources con:C6 and con:C4 change, with each containing one triple less for the CBD case or two triples less for the SCBD case, but we still have the same set of term URIs and concept URIs.

Case 3: Addition of new preferred terms

The addition of a new preferred term is again a matter of extending the graph with new information, though in this case some new URIs are also introduced.

Suppose a new preferred term "Revolution" (term no 7) is added to our initial example:

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Revolution
TNR 7

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the following graph:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C7 a skos:Concept;
       skos:prefLabel "Revolution"@en;
       skosxl:prefLabel term:T7 .

term:T7 a skosxl:Label;
        skosxl:literalForm "Revolution"@en.
        
con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following triples are added:

con:C7 a skos:Concept;
       skos:prefLabel "Revolution"@en;
       skosxl:prefLabel term:T7.

term:T7 a skosxl:Label;
        skosxl:literalForm "Revolution"@en.

The RDF representation now includes an additional concept and label, each with a new URI. So now there are two new resources, with new URIs (con:C7 and term:T7), and a corresponding set of new Document URIs and Representation URIs for descriptions of those resources.

It is quite probable that the addition of a new preferred term is accompanied by the assertion of semantic relationships with other existing preferred terms. This is the equivalent of following this step, then a second step of the type shown in case 1.

Case 4: Addition of new non-preferred term (for existing preferred term)

The addition of a new non-preferred term is, again, a matter of adding new information, and new URIs.

Suppose a new term "Assault" (term no 8) is added as a new non-preferred term for "Violence" (term no 4):

Assault
USE Violence
TNR 8

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
UF Assault
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

which results in the graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T8 a skosxl:Label;
        skosxl:literalForm "Assault"@en.

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:altLabel "Assault"@en;
       skosxl:altLabel term:T8;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

i.e. the following triples are added:

term:T8 a skosxl:Label;
        skosxl:literalForm "Assault"@en.

con:C4 skos:altLabel "Assault"@en;
       skosxl:altLabel term:T8 .

So from a linked data perspective, there is a new resource with a new URI (term:T8) (and its own new description with a new Document URI), and the existing URI con:C4 is the subject of two new triples, an skos:altLabel for the literal, and an skosxl:altLabel link to the new label, so the graph served as description of that existing resource changes to include additional triples.

Case 5: Existing non-preferred term becomes new preferred term

Suppose the existing term "Civil violence", initially a non-preferred term for "Political violence" is "promoted" and made a preferred term in its own right

Civil violence
USE Political violence
BT Violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Civil violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the following graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:broader con:C4 .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

For this case, the following new triples are added

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:broader con:C4 .

con:C4 skos:narrower con:C1 .

and also the following existing triples are removed

con:C2 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1 .

So from a linked data perspective, there is a new resource with a new URI (concept:C1) (and its own new description with a new Document URI), and the graph served as description of the existing resources con:C2 and con:C4 both change: the former loses the skos:altLabel and skosxl:altLabel triples and the latter includes a new skos:narrower triple. If symmetric bounded descriptions are used, the description of term:T1 changes too.

Case 6: Existing non-preferred term becomes non-preferred term for a different existing preferred term

Suppose we decide that "Civil violence", initially a non-preferred term for "Political violence", is to become a non-preferred term for "Collective violence".

Civil violence
USE Political violence
USE Collective violence
TNR 1

Collective violence
UF Civil violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

This generates the following graph:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

For this case, the following new triples are added

con:C6 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1.

and also the following existing triples are removed

con:C2 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1 .

The graphs served as descriptions of the existing resources con:C2 and con:C6 both change: the former loses the skos:altLabel and skosxl:altLabel triples and the latter gains skos:altLabel and skosxl:altLabel triples. If symmetric bounded descriptions are used then the description of term:T1 also changes.

Case 7: Existing non-preferred term becomes non-preferred term for a newly added preferred term

I think this case is just a combination of Case 3 (addition of new preferred term) and Case 6 (existing non-preferred term becomes non-preferred term for a different existing preferred term) in sequence. We've seen above that these changes can be made without problems, so the "composite" case should be OK too, and I won't bother working through a full example here.

Case 8: An existing preferred term becomes a non-preferred term for another existing preferred term

Suppose the current preferred term "Political violence" is to be "relegated" to become a non-preferred term for "Collective violence", with the latter becoming the participant in hierarchical relations previously involving the former. (I appreciate that these two terms probably don't constitute a great example, but let’s suppose it works, for the sake of the discussion!)

Civil violence
USE Political violence
USE Collective violence
TNR 1

Collective violence
UF Civil violence
UF Political violence
UF Violent protest
BT Violence
NT Terrorism
TNR 6

Political violence
USE Collective violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
BT Collective violence
TNR 3

Violence
NT Political violence
NT Collective violence
TNR 4

Violent protest
USE Political violence
USE Collective violence
TNR 5

This maps to the rather substantially changed RDF graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C6 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following RDF triples have been added

con:C6 skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C6 .

con:C4 skos:narrower con:C6 .

And the following RDF triples have been removed

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .

So the graphs served as descriptions of the concepts con:C3 and con:C4 change (with the removal of a triple and the addition of a new one); and that for concept con:C6 changes with the addition of several triples.

So far, so good.

However, the URI con:C2 has now completely disappeared from the graph. If this new graph simply replaces the previous graph, then there will be no description available for resource con:C2.

Case 9: An existing preferred term become a non-preferred term for a term which is currently a non-preferred term for it (and vice versa)

Suppose that the current non-preferred term "Civil violence" is to become preferred to "Political violence", and the latter is to become a non-preferred term for the former - both "relegation" and "promotion" taking place together, if you like.

Civil violence
USE Political violence
UF Political violence
UF Violent protest
BT Violence
NT Terrorism
TNR 1

Collective violence
TNR 6

Political violence
USE Civil violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
BT Civil violence
TNR 3

Violence
NT Political violence
NT Civil violence
TNR 4

Violent protest
USE Political violence
USE Civil violence
TNR 5
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C1 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following RDF triples have been added

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
       
con:C3 skos:broader con:C1 .

con:C4 skos:narrower con:C1 .

And the following RDF triples have been removed

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .

The outcome here is similar to that of the previous case.

The graphs served as descriptions of the concepts con:C3 and con:C4 change (with the removal of a triple and the addition of a new one). A new concept con:C1 is created. But again the URI con:C2 has completely disappeared from the graph, with the same consequences that no description will be available.

Case 10: An existing preferred term becomes a non-preferred term for a newly added preferred term

I think this case is just a combination of Case 3 (addition of new preferred term) and Case 8 (existing preferred term becomes a non-preferred term for another existing preferred term) in sequence.

The same problem will arise with the URI of the existing concept disappearing from the new output graph.

Summary

I've walked through in detail the different types of changes which can occur to the content of the thesaurus. This highlighted that for one particular category of change, where an existing preferred term is "relegated" to the status of a non-preferred term, exemplified by my cases 8, 9 and 10 above, the results of the suggested simple mapping into SKOS had problematic consequences: the URI for a concept disappears from the generated RDF graph - and this creates a conflict with the principles of URI stability and reliability I advocated at the start of this post.

In the next post, I'll suggest one way of (I hope!) addressing this problem.

March 01, 2011

Term-based thesauri and SKOS (Part 2): Linked Data

In my previous post on this topic I outlined how I was approaching making a thesaurus available using the SKOS and SKOS-XL RDF vocabularies. In that post I focused on:

  • how the thesaurus content is modelled using a "concept-based" approach - what are the "things of interest", their attributes, and the relationships between them;
  • how those things (concepts, terms/labels) are named/identified using http URIs;
  • how those things can be "described" using the simple "triple" statement model of RDF, and using the SKOS and SKOS-XL RDF vocabularies;
  • an example of how an expresssion of the thesaurus using the term-based model is mapped or transformed into an SKOS RDF expression

What I didn't really address in that post is how that resulting RDF data is made available and accessed on the Web - which is more specifically where the "Linked Data" principles articulated by Tim Berners-Lee come into play.

(A good deal of the content of this post is probably familiar stuff for those of you already working with Linked Data, but I thought it was worth documenting it both to fill out the picture of some of the "design choices" to be made in this particular example, and to provide some more background to others less familiar with the approaches.)

Linked Data, URIs, things, documents and HTTP

The use of http URIs as identifiers provides two features:

  • a global naming system, and a set of processes by which authority for assigning names can be delegated/distributed;
  • through the HTTP protocol, a well understood and widely deployed mechanism for providing access to information "about", or descriptions of, the things identified by those URIs (in our case, the concepts and terms/labels).

As a user/consumer of an http URI, given only the URI, I can "look up" that URI using the HTTP protocol, i.e. I can provide it to a tool (like a Web browser) and that tool can issue a request to obtain a description of the thing identified by the URI. And conversely as the owner/provider of a URI, I can configure my server to respond to such requests with a document providing a description of the thing.

And the HTTP protocol incorporates features which we can use to "optimise" this process. So, for example, the "content negotiation" feature allows a client to specify a preference for the format in which it wishes to receive data, and allows a server to select - from amongst several it may have available - format which the server determines is most appropriate for the client. In the terminology of the Web Architecture, the description can have multiple "representations", each of which can vary by format (or by other criteria). In the context of Linked Data, this technique is typically used to support the provision of document representations in formats suitable for a human reader (HTML, XHTML) and in one or more RDF syntaxes (usually, at least as RDF/XML). (The emergence of the RDFa syntax, which enables the embedding of RDF data in HTML/XHTML documents, and the growing support for RDFa in RDF tools, offers the possibility, in principle at least, of a single format serving both purposes.)

The widespread use of the HTTP protocol and tools that support it mean that these techniques are widely available (in theory at least; experience suggests that in practice the ability (or authority) to set up things like HTTP redirects (more below) can create something of a barrier). It also means that the "Web of Linked Data" is part of the existing Web of documents that we are accustomed to navigating using a Web browser.

One of the principles underpinning RDF's use of URIs as names is that we should try to avoid ambiguity in our use of those names, i.e. we should use different URIs for different things, and avoid using the same URI as a name for two different things. One of the issues I've slightly glossed over in the last few paragraphs is the distinction between a "thing" and a document describing that thing as two different resources. After all, if I provide a page describing the Mona Lisa, both the page and the Mona Lisa have creators, creation dates, terms of use, but they have different creators, creation dates and terms of use. And if I want to provide such information in RDF, then I need to take care to avoid confusing the two objects - by using two different URIs, one for my document and one for the painting, and citing those URIs in my RDF statements accordingly.

However, as I emphasise above, we also want to be in a position where, given only a "thing URI", I can obtain a document describing that thing: I shouldn't need in advance information about a second URI, the URI of a document about that thing.

The W3C Note Cool URIs for the Semantic Web describes some possible approaches to addressing this issue, broadly using two different techniques:

  • the use of URIs containing "fragment identifiers" ("hash URIs") (i.e. URIs of the form http://example.org/doc/123#thing). In this case, the "fragment identifier" part of the URI is always "trimmed" from the URI when the client makes a request to the server, and this allows the use of the URI with fragment identifier as "thing URI", leaving the trimmed URI without fragment id as a document URI.
  • the use of a convention of HTTP "redirects". In this case, when a server receives a request for a URI which it "knows" is a "thing URI" rather than a document URI, it returns a response which provides a second URI as a source of information about the thing, and the client then sends a second request for that second URI. Formally, the initial response uses the HTTP "303 See Also" status code, which sometimes leads to these being called colloquially "303 URIs", even though there's nothing special about the URIs themselves.

I'm conscious that I'm skipping over some of the details here; for a more detailed description, particularly of the "flow" of the interactions involved, and some consideration of the pros and cons of the two approaches, see Cool URIs for the Semantic Web.

URI Sets for the UK Public Sector

The Cool URIs note focuses mainly on the patterns of "interaction" for handling the two approaches to moving from "thing URI" to document URI. Its examples include example URIs, but the exact form of those URIs is intended to be illustrative rather than prescriptive. I think it's important to note that in the redirect case, it is the server's notification to the client of the second URI that provides the client with that information. There is no technical requirement for a structural similarity in the forms of the "thing URI" and the document URI, and consumers of the URIs should rely on the information provided to them by the server rather than making assumptions about URI structure.

Having said that, the use of a shared, consistent set of URI patterns within a community can provide some useful "cues" to human readers of those URIs. It can also simplify the work of data providers - for example by facilitating the use of similar HTTP server configurations or the reuse of scripts/tools for serving "Linked Data" documents. With this (and other factors such as URI stability) in mind, the UK Cabinet Office has provided a set of guidelines, Designing URI Sets for the UK Public Sector which build on the W3C Cool URIs note, but offer more specific guidance, particularly on the design of URIs.

For the purposes of this discussion, of particular interest is the document's specification (in the "Definitions, frameworks and principles" section) of several distinct "types of URI", or perhaps more accurately, URIs for different categories of resource, and (in the "The path structure for URIs" section) of suggested structural patterns for each:

  • Identifier URIs (what I have been calling above "thing URIs") name "real-world things" and should use patterns like:
    • http://{domain}/{concept}/{reference}#id or
    • http://{domain}/id/{concept}/{reference}
    where:
    • {concept} is "a word or string to capture the essence of the real-world 'Thing' that the set names e.g. school". (I think this is roughly what I think of as the name of a "resource type" - note this is a more generic use of the word "concept" than that of the SKOS concept)
    • {reference} is "a string that is used by the set publisher to identify an individual instance of concept".
    The document allows for the use of a hierarchy of concept-reference pairs in a single URI if appropriate, so for a specific class within a specific school, the path might be /id/school/123/class/5
  • Document URIs name the documents that provide information about, descriptions of, "real-world things". The suggested pattern is
    • http://{domain}/doc/{concept}/{reference}
    These documents are, I think, what Berners-Lee calls Generic Resources. For each such document, multiple representations may be available, each in different formats, and each of those multiple "more specific" concrete formats may be available as a separate resource in its own right (see "Representation URIs" below). If descriptions vary over time, and those variants are to be exposed then a series of "date-stamped" URIs can be used, with the pattern
    • http://{domain}/doc/{concept}/{reference}/{yyyy-mm-dd}
  • Representation URIs name a document in a specific format. The suggested pattern is
    • http://{domain}/doc/{concept}/{reference}/{doc.file-extension}
    This can also be applied to a date-stamped version:
    • http://{domain}/doc/{concept}/{reference}/{yyyy-mm-dd}/{doc.file-extension}

The guidelines also distinguish a category of "Ontology URIs" which use the pattern http://{domain}/def/{concept}. I had interpreted "Ontology URIs" as applying to the identification of classes and properties, and I was treating the terms/concepts of a thesaurus as "conceptual things" which would fall under the /id/ case. But I do notice that in an example in which she refers to these guidelines, Jeni Tennison uses the /def/ pattern for an SKOS example. I don't think it's really much of an issue - and pretty much all the other points I discuss apply anyway - but any advice on this point would be appreciated.

So, applying these general rules for the thesaurus case, where, as I discussed in the previous post, the primary types of thing of interest in our SKOS-modelled thesaurus are "concepts" and "terms":

  • Term URI Pattern: http://example.org/id/term/T{termid}
  • Concept URI Pattern: http://example.org/id/concept/C{termid}

However, if we bear in mind that within the URI space of the example.org domain, we're likely to want to represent, and coin URIs for the components of, multiple thesauri, and our "termid" references (drawn from the term numbers in the input) are unique only within the scope of a single thesaurus, then we should include some sort of thesaurus-specific component in the path to "qualify" those term numbers. Let's use the token "polthes" for this example:

  • Term URI Pattern: http://example.org/id/term/{schemename}/T{termid}
    Example: http://example.org/id/term/polthes/T2
  • Concept URI Pattern: http://example.org/id/concept/{schemename}/C{termid}
    Example: http://example.org/id/concept/polthes/C2

We should also include a URI for the thesaurus as a whole. The SKOS model provides a generic class of "concept scheme" to cover aggregations of concepts:

  • Concept Scheme URI Pattern: http://example.org/id/concept-scheme/{schemename}
    Example: http://example.org/id/concept-scheme/polthes

where each concept and term in the thesaurus is linked to this concept scheme by a triple using the skos:inScheme property. (I omitted this from the example in the previous post so that it was easier to focus on the concept-term and concept-concept relationships, and to try to keep the already rather complex diagrams slightly readable!)

Aside: An alternative for the concept and term URI patterns would be to use the "hierarchical concept-reference" approach and use patterns like:

  • Term URI Pattern: http://example.org/id/concept-scheme/{schemename}/term/T{termid}
    Example: http://example.org/id/concept-scheme/polthes/term/T2
  • Concept URI Pattern: http://example.org/id/concept-scheme/{schemename}/concept/C{termid}
    Example: http://example.org/id/concept-scheme/polthes/concept/C2

My only slight misgiving about this approach is that (bearing in mind that strictly speaking the URIs should be treated as opaque and such information obtained from the descriptions provided by the server) in the (non-hierarchical) form I suggested initially, the string indicating the resource type ("concept", "term") is fairly clear to the human reader from its position following the "/id/" component in the URI (e.g. http://example.org/id/concept/polthes/C2). But with the hierarchical form, it perhaps becomes slightly less clear (e.g. http://example.org/id/concept-scheme/polthes/concept/C2). But that is a minor gripe, and really the hierarchical form would serve just as well. For the remainder of this document, in the examples, I'll continue with the initial non-hierarchical pattern I suggested above, but it may be something to revisit if the hierarchical form is more in line with the intent - and current usage - of the guidelines. (So again, comments are welcome on this point.)

For each of these "Identifier URIs", there should be a corresponding "Document URI" naming a document describing the thing, and following the /doc/ pattern:

  • Description of Concept Scheme: http://example.org/doc/concept-scheme/polthes
  • Description of Term: http://example.org/doc/term/polthes/T{termid}
  • Description of Concept: http://example.org/doc/concept/polthes/C{termid}

And for each format in which the description is available, a corresponding "Representation URI":

  • Description of Concept Scheme (HTML): http://example.org/doc/concept-scheme/polthes/doc.html
  • Description of Concept Scheme (RDF/XML): http://example.org/doc/concept-scheme/polthes/doc.rdf
  • Description of Concept (HTML): http://example.org/doc/concept/polthes/C{termid}/doc.html
  • Description of Concept (RDF/XML): http://example.org/doc/concept/polthes/C{termid}/doc.rdf
  • Description of Term (HTML): http://example.org/doc/term/polthes/T{termid}.html
  • Description of Term (RDF/XML): http://example.org/doc/term/polthes/T{termid}.html

Descriptions and "boundedness"

The three documents I've mentioned so far (Berners-Lee's Linked Data Design Issues note, the W3C Cool URIs document, or the Cabinet Office URI patterns document) don't have a great deal to say on the topic of the content of the document which is returned as a description of a "thing". This is discussed briefly in the "Linked Data Tutorial" document by Chris Bizer, Richard Cyganiak and Tom Heath, How to Publish Linked Data on the Web.

In principle at least, it is quite possible to provide a single document which describes several resources. This approach has been quite common in association with the use of "hash URIs" in a pattern where a number of "thing URIs" differ only by fragment identifier, and share the same "non-fragment" part (http://example.org/school#1, http://example.org/school#2, ... http://example.org/school#99), and a number of common ontologies make use of this sort of approach. One consequence is that a client interested only in a single resource always retrieves the full set of descriptions. If my thesaurus really did consist only of the half-dozen concepts and terms I described in the example in my previous post, retrieving a document describing them all would probably not be a problem, but for the "real world" case where there are several thousand terms involved, it would represent a significant overhead if every request results in the transfer of several megabytes of data.

Generally, the approach taken is for the data provider to generate some set of "useful information" "about" the requested resource - though saying that rather begs the question of what constitutes "useful" (and whether there is a single answer to that question that is applicable across different datasets dealing with different resource types). Typically the generation of a description is based on some set of rules which, for a specified node in the dataset RDF graph (a specified "thing URI"), selects a "nearby" subgraph of the graph, representing a "bounded description" made up of triples/statements "about" the thing itself and maybe also "about" closely related resources.

Various such algorithms for generating such descriptions are possible and I don't intend to attempt any sort of rigorous analysis or comparison of them here - for further discussion see e.g. Patrick Stickler's CBD - Concise Bounded Description or Bounded Descriptions in RDF from the Talis Platform wiki. But there is one aspect which I think it is worth mentioning in the context of the thesaurus example. One of the key differences between the algorithms used to generate descriptions is how they treat the "directionality" of arcs in the RDF graph, i.e. whether they base the description only on arcs "outbound from" the selected node (RDF triples with that URI as subject), or whether they take into account both arcs "outbound from" and "inbound to" the node (triples with the URI as either subject or object).

That probably sounds like a very abstract point, and the significance is perhaps best illustrated through a concrete example. Let's take the graph for the example from my previous post (tweaked to use the slightly amended URI patterns above - I've continued to leave out the concept scheme links to keep things simple) and suppose this is the dataset to which I'm applying the rules.

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

And in graphical form (as before with the rdf:type triples omitted):

Fig1

(In the figures below, I've tried to represent the idea that a subgraph is being selected by "fading out" the parts which aren't selected, and leaving the selected part fully visible. I hope the images are sufficiently clear for this to be effective!)

Let's first take the approach known as the "concise bounded description (CBD)" - formally defined here, but essentially based on "outbound" links. For the concept C2 (http://example.org/id/concept/polthes/C2), the CBD would consist of the following subgraph (i.e. the document http://example.org/doc/concept/polthes/C2 would contain this data):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
Fig2

For the term T2 (http://example.org/id/term/polthes/T2), corresponding to the "preferred label" (i.e. the document http://example.org/doc/term/polthes/T2 would contain):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.
Fig3

For the term T5 (http://example.org/id/term/polthes/T5), corresponding to the "alternate label" (i.e. the document http://example.org/doc/term/polthes/T5 would contain):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.
Fig4

Note that for the two terms, the "concise bounded description" is quite minimal (though remember I've simplified it a bit): in particular, it does not include the relationship between the term and the concept. This is because using the SKOS-XL vocabulary that relationship is expressed as a triple in which the concept URI is the subject and the term URI is the object - an "inbound arc" to the term URI in the graph - which the CBD approach does not take into account when constructing the description of the term.

But the fact that the relationship is represented only in this way - a link from concept to term, without an inverse term to concept link - is arguably slightly arbitrary.

An alternative approach, the "symmetric bounded description" seeks to address this sort of issue, by taking into account both "outbound" and "inbound". For the same three cases, it produces the following results:

Concept C2:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .
Fig5

Term T2:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C2 skosxl:prefLabel term:T2 .
Fig6

Term T5:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

con:C5 skosxl:altLabel term:T5 .
Fig7

For the concept case, the difference is relatively minor (for the skos:broader and skos:narrower relationships, the inverse triples are ow also included). But for the term cases, the relationship between concept and term is now included.

So (rather long-windedly, I fear!), I'm trying to illustrate that it's worth thinking a little bit about the content of descriptions and how they work as "stand-alone" documents (albeit linked to others). And for this dataset, I think there's an argument that generating "symmetric" descriptions that include inbound links as well as outbound ones probably results in more "useful information" for the consumer of the data.

(Again, I'm simpifying things slightly here to illustrate the point: I've omitted type information and the links to indicate concept scheme membership. Typically the descriptions might (depending on the algorithm) include labels for related resources mentioned, rather than just the URIs, and would include some metadata about the document - its publisher, last modification date, licence/rights information, a link to the dataset of which it is a member, and so on.)

Summary

What I've tried to do in this post is expand on some of the "linked data"-specific aspects of the project, and to examine some of the design choices to be made in applying some of those general rules to this particular case, shaped both by external factors (like the Cabinet Office guidelines on URIs) and by characteristics of the data itself (like the directionality of links made using SKOS-XL). In the next post, I'll move on, as promised, to the questions of how the data changes over time, and any implications of that.

February 25, 2011

RDTF metadata guidelines - Limp Data or Linked Data?

Having just finished reading thru the 196 comments we received on the draft metadata guidelines for the UK RDTF I'm now in the process of wondering where we go next. We (Pete and I) have relatively little effort to take this work forward (a little less than 5 days to be precise) so it's not clear to me how best we use that effort to get something useful out for both RDTF and the wider community.

By the way... many thanks to everyone who took the time to comment. There are some great contributions and, if nothing else, the combination of the draft and the comments form a useful snapshot of the current state of issues around library, museum and archival metadata in the context of the web.

Here's my brief take on what the comments are saying...

Firstly, there were several comments asking about the target audience for the guidelines and whether, as written, they will be meaningful to... well... anyone I guess! It's worth pointing out that my understanding is that any guidelines we come up with thru the current process will be taken forward as part of other RDTF work. What that means is that the guidelines will get massaged into a form (or forms) that are digestable by the target audience (or audiences), as determined by other players with the RDTF activity. What we have been tasked with are the guidelines themselves - not how they are presented. We perhaps should have made this clearer in the draft guidelines. In short, I don't think the document, as written, will be put directly in-front of anyone who doesn't go to the trouble of searching it out explicitly.

Secondly, there were quite a number of detailed comments on particular data formats, data models, vocabularies and so on. This is great and I'm hopeful that as a result we can either extend the list of examples given at various points in the guidelines (or, in some cases, drop back to not having examples but simply say, "do whatever is the emerging norm here in your community").

Thirdly, the were some concerns about what we meant by "open". As we tried to point out in the draft, we do not consider this to be our problem - it is for other activity in RDTF to try and work out what "open" means - we just felt the need to give that word a concrete definition, so that people could understand where we were coming from for the purposes of these guidelines.

Finally, there were some bigger concerns - these are the things that are taxing me right now - that broadly fell into two, related, camps. Firstly, that the step between the community formats approach and the RDF data approach is too large (though no-one really suggested what might go in the middle). And secondly, that we are missing a trick by not encouraging the assignment of 'http' URIs to resources as part of the community formats approach.

As it stands, we have, on the one hand, what one might call Limp Data (MARC records, XML files, CVS, EAD and the rest) and, on the other, Linked Data and all that entails, with a rather odd middle ground that we are calling RDF data (in the current guidelines).

I was half hoping that someone would simply suggest collapsing our RDF data and Linked Data approaches into one - on the basis that separating them into two is somewhat confusing (but as far as I can tell no-one did... OK, I'm doing it now!). That would leave a two-pronged approach - community formats and Linked Data - to which we could add a 'community formats with http URIs' middle ground. My gut feel is that there is some attraction in such an approach but I'm not sure how feasible it is given the characteristics of many existing community formats.

As part of his commentary around encouraging http URIs (building a 'better web' was how he phrased it), Owen Stephens suggested that every resource should have a corresponding web page. I don't disagree with this... well, hang on... actually I do (at least in part)! One of the problems faced by this work is the fundamental difference between the library world and museums and archives. The former is primarily dealing with non-unique resources (at the item level), the latter with unique resources. (I know that I'm simplifying things here but bear with me). Do I think that resource discovery will be improved if every academic library in the UK (or indeed in the world) creates an http URI for every book they hold at which they serve a human-readable copy of their catalogue record? No, I don't. If the JISC and RLUK really want to improve web-scale resource discovery of books in the library sector, they would be better off spending their money encouraging libraries to sign up to OCLC WorldCat and contributing their records there. (I'm guessing that this isn't a particular popular viewpoint in the UK - at least, I'm not sure that I've ever heard anyone else suggest it - but it seems to me that WorldCat represents a valuable shared service approach that will, in practice, be hard to beat in other ways.) Doing this would both improve resource discovery (e.g. thru Google) and provide a workable 'appropriate copy' solution (for books). Clearly, doing this wouldn't help build a more unified approach across the GLAM domains but, as at least one commenter pointed out, it's not clear that the current guidelines do either. Note: I do agree with Owen that every unique museum and archival resource should have an http URI and a web page.

All of which, as I say, leaves us with a headache in terms of how we take these guidelines forward. Ah well... such is life I guess.

February 22, 2011

SALDA

As I've mentioned here before, I'm contributing to a project called LOCAH, funded by the JISC under its 02/10 call Deposit of research outputs and Exposing digital content for education and research (JISCexpo), working with MIMAS and UKOLN on making available bibliographic and archival metadata as linked data.

Partly as a result of that work, I was approached by Chris Keene from the University of Sussex to be part of a bid they were preparing to another recent JISC call, 15/10: Infrastructure for education and research programme, under the "Infrastructure for resource discovery" strand, which seeks to implement some of the approaches outlined by the Resource Discovery Task Force.

The proposal was to make available metadata records from the Mass Observation Archive, data currently managed in a CALM archival management system, as linked data. I'm pleased to report that the bid was successful, and the project, Sussex Archive Linked Data Application (SALDA), has been funded. It's a short project, running for six months from now (February) to July 2011. There's a brief description on the JISC Web site here, a SALDA project blog has just been launched, and the project manager Karen Watson provides more details of the planned work in her initial post there.

I'm looking forward to working with the Sussex team to adapt and extend some of the work we've done with LOCAH for a new context. I expect most information will appear over on the SALDA blog, but I'll try to post the occcasional update on progress here, particularly on any aspects of general interest.

February 11, 2011

Term-based thesauri and SKOS (Part 1)

I'm currently doing a piece of work on representing a thesaurus as linked data. I'm working on the basis that the output will make use of the SKOS model/RDF vocabulary. Some of the questions I'm pondering are probably SKOS Frequently Asked Questions, but I thought it was worth working through my proposed solution and some of the questions I'm pondering here, partly just to document my own thought processes and partly in the hope that SKOS implementers with more experience than me might provide some feedback or pointers.

SKOS adopts a "concept-based" approach (i.e. the primary focus is on the description of "concepts" and the relationships between them); the source thesaurus uses a "term-based" approach based on the ISO 2788 standard. I found the following sources provided helpful summaries of the differences between these two approaches:

Briefly (and simplifying slightly for the purposes of this discussion - I've omitted discussion of documentary attributes like "scope notes"), in the term-based model (and here I'm dealing only with the case of a simple monolingual thesaurus):

  • the only entities considered are "terms" (words or sequences of words)
  • terms can be semantically equivalent, each expressing a single concept, in which case they are distinguished as "preferred terms"/descriptors or "non-preferred terms"/non-descriptors, using USE (non-preferred to preferred) and UF (use for) (preferred to non-preferred) relations. Each non-preferred term is related to a single preferred term; a preferred term may have many non-preferred terms
  • a preferred term can be related by ("hierarchical") BT (broader term) and NT (narrower term) relations to indicate that it is more specific in meaning, or more general in meaning, than another term
  • a preferred term can also be related by an ("associative") RT (related term) relation to a second term, where there is some other relationship which may be useful for retrieval (e.g. overlapping meanings)

In the SKOS (concept-based) model:

  • the primary entities considered are "concepts", "units of thought"
  • a concept is associated with labels, of which at most one (in the monolingual case) is the preferred label, the others alternative labels or hidden labels
  • a concept can be related by "has broader", "has narrower" and "has related" relations to other concepts

The concept-based model, then, is explicit that the thesaurus "contains" two distinct types of thing: concepts and labels. In the base SKOS model, the labels are modelled as RDF literals, so the expression of relationships between labels is not supported. SKOS-XL provides an extension to SKOS which models labels as RDF resources in their own right, which can be identified, described and linked.

To represent a term-based thesaurus in SKOS, then, it is necessary to make a mapping from the term-based model with its term-term relationships to the concept-based model with its concept-label and concept-concept relationships. (An alternative approach would be to represent the thesaurus in RDF using a model/vocabulary which expresses directly the "native" "term-based" model of the source thesaurus. I'm working on the basis that using the SKOS/concept-based approach will facilitate interoperability with other thesauri/vocabularies published on the Web.

Core Mapping

The key principle underlying the mapping is the notion that, in the term-based approach, a preferred term and its multiple related non-preferred terms are acting as labels for a single concept.

Each set of terms related by USE/UF relations in the term-based approach, then, is mapped to a single SKOS concept, with the single "preferred term" becoming the "preferred label" for that concept and the (possibly multiple) "non-preferred terms" each becoming "alternative labels" for that same concept.

Consider the following example, where the term "Political violence" is a preferred term for "Civil violence" and "Violent unrest", and has a broader term "Violence" and narrower term "Terrorism":


Civil violence
USE Political violence

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrrorism

Terrorism
BT Political violence

Violence
NT Political violence

Violent protest
USE Political violence

(Leaving aside for a moment the question of how the URIs might be generated) an SKOS representation takes the form:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix con: <http://example.org/id/concept/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skos:altLabel "Civil violence"@en;
       skos:altLabel "Violent protest"@en;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skos:broader con:C2 .

con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skos:narrower con:C2 .

A graphical representation perhaps makes the links clearer (to keep things simple, I've omitted the arcs and nodes corresponding to the rdf:type triples here):

Fig1

One of the consequences of this approach is that some of the "direct" relationships between terms are reflected in SKOS as "indirect" relationships. In the term-based model, a non-preferred term is linked to a preferred term by a simple "USE" relationship. In the SKOS example, to find the "preferred term" for the term "Violent protest", one finds the node for the concept of to which it is linked via the skos:altLabel property, and then locates the literal which is linked to that concept via an skos:prefLabel property.

SKOS-XL and terms as Labels

As mentioned above, the SKOS specification defines an optional extension to SKOS which supports the representation of labels as resources in their own right, typically alongside the simple literal representation.

Using this extension, then the example above might be expressed as:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

And in graphical form (as above with the rdf:type triples omitted):

Fig2

That may be rather a lot to take in, so below I've shown a subgraph, including only the labels related to a single concept:

Fig3

Using the SKOS-XL extension in this way does introduce some additional complexity. On the other hand:

  1. it introduces a set of resources (of type skosxl:Label) that have a one-to-one correspondence to the terms in the source thesaurus, so perhaps makes the mapping between the two models more explicit and easier for a human reader to understand.
  2. it makes the labels themselves URI-identifiable resources which can be referenced in this data and in data created by others. So, it becomes possible to make assertions about relationships between the labels, or between labels and other things.

Coincidentally, as I was writing this post, I note that Bob duCharme has posted on the use of SKOS-XL for providing metadata "about" the individual labels, distinct from the concepts. So I might add a triple to indicate, say, the date on which a particular label was created. I don't think there is an immediate requirement for that in the case I'm working on, but there may be in the future.

Another possible use case is the ability for other parties to make links specifically to a label, rather than to the concept which it labels. A document could be "tagged", for example, by association with a particular label from a particular thesaurus, rather than just with the plain literal.

Relationships between Labels

The SKOS-XL vocabulary defines only a single generic relationship between labels (the skosxl:labelRelation property) with the expectation that implementers define subproperties to handle more specific relationship types.

The example given of such a relationship in the SKOS spec is that of a case where one label is an acronym for another label.

One thing I wondered about was introducing properties here to reflect the USE/UF relations in the term-based model e.g. in the above example:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .
@prefix ex: <http://example.org/prop/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en;
        ex:use term:T2 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en;
        ex:useFor term:T1 ;
        ex:useFor term:T5 .
       
term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en;
        ex:use term:T2 .

This wouldn't really add any new information; rather it just provides a "shortcut" for the information that is already present in the form of the "indirect" preferred label/alternative label relationships, as noted above.

I hesitate slightly about adding this data, partly on the grounds that it seems slightly redundant, but also on the grounds that this seems like a rather different "category" of relationship from the "is acronym for" relationship. This point is illustrated in figure 4 in the SWAD-E Review of Thesaurus Work document, which divides the concept-based model into three layers:

  • the (upper in the diagram) "conceptual" layer, with the broader/narrower/related relations between concepts
  • the (lower) "lexical" layer, with terms/labels and the relations between them
  • and a middle "indication"/"implication" layer for the preferred/non-preferred relations between concepts and terms/labels

In that diagram, the example lexical relationships exist independently of the preferred/non-preferred relationships e.g. "AIDS" is an abbreviation for "Acquired Immunodeficiency Syndrome", regardless of which is considered the preferred term in the thesaurus. With the use/useFor relationships here, this would not be the case; they would indeed vary with the preferred/non-preferred relationships.

So, having thought initially it might be useful to include these relationships, I'm becoming more rather less sure.

And without them, I'm not sure whether, for this case, the use of the SKOS-XL would be justified - though it may well be, for the other purposes I mentioned above. So, any thoughts on practice in this area would be welcome.

URI Patterns

Following the linked data principles, I want to assign http URIs for all the resources, so that descriptions of those resources can be provided using HTTP. If we go with the SKOS-XL approach, that means assigning http URIs for both concepts and labels.

The input data includes, for each term within the thesaurus, an identifier code, unique within the thesaurus, and which remains stable over time, i.e. as changes are made to the thesaurus a single term remains associated with the same code. (I'll explore the question of how the thesaurus changes over time in a follow-up post.) So in fact the example above looks something like:


Civil violence
USE Political violence
TNR 1

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

The proposal is to use that identifier as the basis of URIs, for both labels/terms and concepts:

  • Term URI Pattern: http://example.org/id/term/T{termid}
  • Concept URI Pattern: http://example.org/id/concept/C{termid}

The generation of label/term URIs, then, is straightforward, as each term (with identifier code) in the input data maps to a single label/term in the SKOS RDF data. The concept case is a little more complicated. As I noted above, the mapping between the term-based model and the concept-based model means that there is a many-to-one relationship between terms and concepts: several terms are related to a single concept, one as preferred label, others as alternative labels. In my first cut at this at least (more on this in a follow-up post), I've generated the URI of the concept based on the identifier code associated with the preferred/descriptor term.

So in my example above, the three terms "Civil violence, "Political violence", and "Violent protest" are mapped to labels for a single concept, and the URI of that concept is constructed from the identifier code of the preferred/descriptor term ("Political violence").

Summary

I've tried to cover the basic approach I'm taking to generating the SKOS/SKOS-XL RDF data from the term-based thesaurus input. In this post, I've focused only on the thesaurus as a static entity. In a follow-up post, I'll look in some detail at the changes which can occur over time, and examine any implications for the choices made here, particularly for our choices of URI patterns.

February 03, 2011

Metadata guidelines for the UK RDTF - please comment

As promised last week, our draft metadata guidelines for the UK Resource Discovery Taskforce are now available for comment in JISCPress. The guidelines are intended to apply to UK libraries, museums and archives in the context of the JISC and RLUK Resource Discovery Taskforce activity.

The comment period will last two weeks from tomorrow and we have seeded JISCPress with a small number of questions (see below) about issues that we think are particularly worth addressing. Of course, we welcome comments on all aspects of the guidelines, not just where we have raised issues. (Note that you don't have to leave public comments in JISCPress if you don't want to - an email to me or Pete will suffice. Or you can leave a comment here.)

The guidelines recommend three approaches to exposing metadata (to be used individually or in combination), referred to as:

  1. the community formats approach;
  2. the RDF data approach;
  3. the Linked Data approach.

We've used words like 'must' and 'should' but it is worth noting that at this stage we are not in a position to say how these guidelines will be applied - if at all. Nor whether there will be any mechanisms for compliance put in place. On that basis, treat phrases like 'must do this' as meaning, 'you must do this for your metadata to comply with one or other approach as recommended by these guidelines' - no more, no less. I hope that's clear.

When we started this work, we we began by trying to think about functional requirements - always a good place to start. In this case however, that turned out not to make much sense. We are not starting from a green field here. Lots of metadata formats are already in use and we are not setting out with the intent of changing current cataloguing practice across libraries, museums and archives. What we can say is that:

  1. we have tried to keep as many people happy as possible (hence the three approaches), and
  2. we want to help libraries, museums and archives expose existing metadata (and new metadata created using existing practice) in ways that support the development of aggregator services and that integrate well with the web (of data).

As mentioned previously, the three approaches correspond roughly to the 3-star, 4-star and 5-star ratings in the W3C's Linked Data Star Ratings Scheme. To try and help characterise them, we prepared the following set of bullet points for a meeting of the RDTF Technical Advisory Group earlier this week:

The community data approach

  • the “give us what you’ve got” bit
  • share existing community formats (MARC, MODS, BibTeX, DC, SPECTRUM, EAD, XML, CSV, JSON, RSS, Atom, etc.) over RESTful HTTP or OAI-PMH
  • for RESTful HTTP, use sitemaps and robots.txt to advertise availability and GZip for compression
  • for CSV, give us a column called ‘label’ or ‘title’ so we’ve got something to display and a column called 'identifier' if you have them
  • provide separate records about separate resources
  • simples!

The RDF data approach

  • use RDF
  • model according to FRBR, CIDOC CRM or EAD and ORE where you can
  • re-use existing vocabularies where you can
  • assign URIs to everything of interest
  • make big buckets of RDF (e.g. as RDF/XML, N-Tuples, N-Quads or RDF/Atom) available for others to play with
  • use Semantic Sitemaps and the Vocabulary of Interlinked Datasets (VoID) to advertise availability of the buckets

The Linked Data approach

Dunno if that is a helpful summary but we look forward to your comments on the full draft. Do your worst!

For the record, the issues we are asking questions about mainly fall into the following areas:

  • is offering a choice of three approaches helpful?
  • for the community formats approach, are the example formats we list correct, are our recommendations around the use of CSV useful and are JSON and Atom significant enough that they should be treated more prominently?
  • does the suggestion to use FRBR and CIDOC CRM as the basis for modeling in RDF set the bar too high for libraries and museums?
  • where people are creating Linked Data, should we be recommending particular RDF datasets/vocabularies as the target of external links?
  • do we need to be more prescriptive about the ways that URIs are assigned and dereferenced?

Note that a printable version of the draft is also available from Google Docs.

January 26, 2011

Metadata guidelines for the UK Resource Discovery Taskforce

We (Pete and I) have been asked by the JISC and RLUK to develop some metadata guidelines for use by the UK Resource Discovery Taskforce as it rolls out its vision [PDF].

This turns out to be a non-trivial task. The vision covers libraries, museums and archives and is intended to:

focus on defining the requirements for the provision of a shared UK infrastructure for libraries, archives, museums and related resources to support education and research. The focus will be on catalogues/metadata that can assist in access to objects/resources. With a special reference to serials, books, archives/special collections, museum collections, digital repository content and other born digital content. This will interpret the shared UK infrastructure as part of global information provision.

(Taken from the Resource Discovery Taskforce Term of Reference)

The vision itself talks of a "collaborative, aggregated and integrated resource discovery and delivery framework" which implies an approach based on harvesting metadata (and other content) rather than cross-searching.

If the last 15 years or so have taught me anything, it's not to expect much coming together of metadata practice across those three sectors! Add to that a wide spectrum of attitudes to Linked Data and its potential value in this space, an unclear picture about the success of Europeana and its ESE [PDF] and EDM [PDF] metadata formats, the apparent success of somewhat "permissive" metadata-based initiatives such as Digital NZ and you are left with a a range of viewpoints from "Keep calm and carry on" thru to "Throw it all away and use Linked Data" and everything in between.

At this point in time, we are taking the view that Tim Berners-Lee's star rating system for linked open data provides a useful framework for this work. However, as I have indicated elsewhere, Mugging up on the linked open data star ratings, it is rather unhelpful that the definition of each of the stars seems to be somewhat up for grabs at the moment (more or less in line with the ongoing, and quite probably endless, debate about the centrality of RDF and SPARQL to Linked Data). On that basis, we will almost certainly have to provide our own definitions for the purposes of these guidelines. Note that using this star rating system does not mean that everything has to use RDF.

Anyway... all of that is currently our problem, so I won't burden you with it :-)

The real purpose of this post is simply to say that we hope to make a draft of our metadata guidelines available during next week (I'm not willing to commit to a specific day at this point in time!), at which point we hope that people will share their thoughts on what we've come up with. That said, time is reasonably tight so I don't expect to be able to give people more than a couple of weeks (at most) to comment.

November 29, 2010

Still here & some news from LOCAH

Contrary to appearances, I haven't completely abandoned eFoundations, but recently I've mostly been working on the JISC-funded LOCAH project which I mentioned here a while ago, and my recent scribblings have mostly been over on the project blog.

LOCAH is working on making available some data from the Archives Hub (a collection of archival finding-aids i.e. metadata about archival collections and their constituent items) and from Copac (a "union catalogue" of bibliographic metadata from major research & specialist libraries) as linked data.

So far, I've mostly been working with the EAD data, with Jane Stevenson and Bethan Ruddock from the Archives Hub. I've posted a few pieces on the LOCAH blog, on the high-level architecture/workflow, on the model for the archival description data (also here), and most recently on the URI patterns we're using for the archival data.

I've got an implementation of this as an XSLT transform that reads an EAD XML document and outputs RDF/XML, and have uploaded the results of applying that to a small subset of data to a Talis Platform instance. We're still ironing out some glitches but there'll be information about that on the project blog coming in the not too distant future.

On a personal note, I'm quite enjoying the project. It gives me a chance to sit down and try to actually apply some of the principles that I read about and talk about, and I'm working through some of the challenges of "real world" data, with all its variability and quirks. I worked in special collections and archives for a few years back in the 1990s, when the institutions where I was working were really just starting to explore the potential of the Web, so it's interesting to see how things have changed (or not! :-)), and to see the impact of and interest in some current technological (and other) trends within those communities. It also gives me a concrete incentive to explore the use of tools (like the Talis Platform) that I've been aware of but have only really tinkered with: my efforts in that space inevitably bring me face to face with the limited scope of my development skills, though it's also nice to find that the availability of a growing range of tools has enabled me to get some results even with my rather stumbling efforts.

It'a also an opportunity for me to discuss the "linked data" approach with the archivists and librarians within the project - in very concrete ways based on actual data - and to try to answer their questions and to understand what aspects are perceived as difficult or complex - or just different from existing approaches and practices.

So while some of my work necessarily involves me getting my head down and analysing input data or hacking away at XSLT or prodding datasets with SPARQL queries, I've been doing my best to discuss the principles behind what I'm doing with Jane and Bethan as I go along, and they in turn have reflected on some of the challenges as they perceive them in posts like Jane's here.

One of the project's tasks is to:

Explore and report on the opportunities and barriers in making content structured and exposed on the Web for discovery and use. Such opportunities and barriers may coalesce around licensing implications, trust, provenance, sustainability and usability.

I think we're trying to take a broad view of this aspect of the project, so that it extends not just to the "technical" challenges in cranking out data and how we address them, but also incorporates some of these "softer" elements of how we, as individuals with backgrounds in different "communities", with different practices and experiences and perspectives, share our ideas, get to grips with some of the concepts and terminology and so on. Where are the "pain points" that cause confusion in this particular context? Which means of explaining or illustrating things work, and which don't? What (if any!) is the value of the "linked data" approach for this sort of data? How is that best demonstrated? What are the implications, if any, for information management practices within this community? It may not be the case that SPARQL becomes a required element of archivists' training any time soon, but having these conversations, and reflecting on them, is, I think, an important part of the LOCAH experience.

October 26, 2010

Attending meetings remotely - a cautionary tale

In these times of financial constraints and environmental concerns, attending meetings remotely (particularly those held overseas) is becoming increasingly common. Such was the case, for me, at 7pm UK time on Friday night last week... I should have been eating pizza in front of the TV with my family but instead was messing about with Skype, my house phone, IRC and Twitter in an attempt to join the joint meeting of the DC Architecture Forum and the W3C Library Linked Data Incubator Group (described in Pete's recent post, The DCMI Abstract Model in 2010).

The meeting started with Tom Baker summarising the history and current state of the DCMI Abstract Model (the DCAM) - a tad long perhaps but basically a sound introduction and overview. Unfortunately, my Skype connection dropped a couple of times during his talk (I have no idea why) and I resorted to using my house phone instead - using the W3C bridge in Bristol. This proved more stable but some combination of my phone and the microphone positioning in the meeting meant that sound, particularly from speakers in the audience, was rather poor.

By the time we got to the meat of the discussion about the future of the DCAM I was struggling to keep up :-(. I made a rather garbled contribution, trying to summarise my views on the possible ways forward but emphasising that all the interesting possibilities had the same end-game - that DCMI would stop using the language of its own world-view, the DCAM, and would instead work within the more widely accepted language of the RDF model and Linked Data - and that the options were really about how best we get there, rather than about where we want to go.

Unfortunately, this is a view that is open to some confusion because the DCAM itself uses the RDF model. So when I say that we should stop using the DCAM and start using RDF and Linked Data its not like saying that we should stop using model A and start using model B. Rather, it's a case of carrying on with the current model (i.e. the RDF model) but documenting it and talking about it using the same language as everyone else, thus joining forces with more active communities elsewhere rather than silo'ing ourselves on the DC mailing lists by having a separate way of talking.

So, anyway, I don't know how well I put my point of view across - one of the problems of being remote is that the only feedback you get is from the person taking minutes, in this case in the W3C IRC channel:

18:50:56 [andypowe11] ok, i'd like to speak at some point

18:52:34 [markva] andypowe11: options 2b, 3 and 4: all work to RDF, which is where we want to get to

18:52:55 [markva] ... which of these is better to get to that end game, wrt time available

18:53:23 [markva] ... 4 seems not ideal, but less effort

18:54:01 [markva] ... lean to 3; 2b has political value by taking along community; but 3 better given time

Stu Weibel spoke after me - a rather animated contribution (or so it seemed from afar). No problem with that... DCMI could probably do with a bit more animation to be honest. I understood him to be saying that we should adopt the Web model and that Linked Data offered us a useful chance to re-align ourselves with other Web activity. As I say, I was struggling to hear properly, so I may have mis-understood him completely. I glanced at the IRC minutes:

18:54:56 [markva] Stu Weibel: frustrated; no productive outcomes all these years

18:55:10 [markva] ... adopt Web as the model

18:55:37 [markva] ... nobody understands DCAM

18:56:03 [markva] ... W3C published architecture document after actual implementation

18:56:45 [markva] ... revive effort: develop reference software; easily drop in data, generate linked data 

I responded positively, trying to make it clear that I was struggling to hear and that I may have mis-interpreted him but noting the reference to 'linked data', which I'd heard as 'Linked Data':

18:57:12 [markva] andypowe11: support Stu 

The minute is factually correct - I did support Stu - but in an 'economical with the truth' kind of way because I only really supported what I thought I'd heard him say - and quite possibly not what he actually said! With hindsight, I wonder if the minute-taker's use of 'linked data' (lower-case) actually reflected some subtlety in what Stu said that I didn't really pick up on at the time. If nothing else, this exchange highlights for me the potential problems caused by those who want to distinguish 'linked data' (lower-case) from 'Linked Data' (upper-case) - there is no mixed-case in conversation, particularly not where it is carried out over a bad phone connection.

So anyway... the meeting moved on to other things and, feeling somewhat frustrated by the whole experience, I dropped off the call and found my cold pizza.

My point here is not about DCMI at all, though I still have no real idea whether I was right or wrong to agree with Stu. My gut feeling is that I probably agreed with some of what he said and disagreed with the rest - and the lesson, for me, is that I should be more careful before opening my mouth! My point is really about the practical difficulties of engaging in quite challenging intellectual debates in the un-even environment of a hybrid meeting where some people are f2f in the same room and others are remote. Or, to mis-quote William Gibson:

The future of hybrid events is here, it's just not evenly distributed yet.

:-(

(Note: none of this is intended to be critical of the minute-taker for the meeting who actually seems to have done a fantastic job of capturing a complex exchange of views in what must have been a difficult environment).

October 19, 2010

The DCMI Abstract Model in 2010

The Dublin Core Metadata Initiative's 2010 conference, DC-2010, takes place this week in Pittsburgh. I won't be attending, but Tom Baker and I have been working on a paper, A review of the DCMI Abstract Model with scenarios for its future for the meeting of the DCMI Architecture Forum - actually, a joint meeeting with the W3C Library Linked Data Incubator Group.

This is a two-part meeting, the first part looking at the position of the DCMI Abstract Model in 2010, five years on from its becoming a DCMI Recommendation, from the perspective of a new context in which the emergence of the "Linked Data" approach has brought a wider understanding and take-up of the RDF model.

The second part of the meeting looks at the question of what the DCMI community calls "application profiles", descriptions of "structural patterns" within data, and "validation" against such patterns. Within the DCMI context, work in this area has built on the DCAM, in the form of the draft Description Set Profile specification. But, as I've mentioned before, there is interest in this topic within some sectors of the "Linked Data" community.

Our paper tries to outline the historical factors which led to the development of the DCAM, to take stock of the current position, and suggest a number of possible paths forward. The aim is to provide a starting point for discussions at the face-to-face meeting, and the suggestions for ways forward are not intended to be an exhaustive list, but we felt it was important to have some concrete choices on the table:

  1. DCMI carries on developing DCAM as before, including developing the DSP specification and concrete syntaxes based on DCAM
  2. DCMI develops a "DCAM 2" specification (initial rough draft here), simplified and better aligned with RDF, and with a cleaner sepration of syntax and semantics, and either:
    1. develops the DSP specification and concrete syntaxes based on DCAM; or
    2. treats "DCAM 2" as a clarification and a transitional step towards promoting the RDF model and RDF abstract syntax
  3. DCMI deprecates the DCAM and henceforth promotes the RDF abstract syntax (and examines the question of "structural constraints" within this framework)
  4. DCMI does nothing to change the statuses of existing DCAM-related specifications

For my own part, in 2010, I do rather tend to look at the DCAM as an artefact "of its time". The DCAM was created during a period when the DCMI community was between two world views, one, which I tend to think of as a "classical view", reflected in Tom's "A Grammar of Dublin Core" 2000 article for Dlib, and based on the use of "appropriate literals" - character strings - as values, and a second based on the RDF model, emphasising the use of URIs as global names and supported by a formal semantics. In developing the DCAM, we tried to do two things:

  • To provide a formalisation of that "classical" view, the "DCMI community" metadata model, if you like: in 2003, DCMI had "a typology of terms" but little consensus on the nature of the data structure(s) in which those terms were referenced.
  • To provide a "bridge" between that "classical" model and the RDF model, through the use of RDF concepts, and the provision of a mapping to the RDF abstract syntax in Expressing Dublin Core metadata using the Resource Description Framework (RDF).

If I'm honest, I think we've had limited success in these immediate aims. In creating the DCAM "description set model" we may have achieved the former in theory, but in practice people coming to the DCAM from a "classical Dublin Core" viewpoint found that model complicated, and difficult to reconcile with their own conceptualisations. So as a "community model" I suspect the "buy-in" from that community isn't as high as we might like to imagine! People coming to the Dublin Core vocabularies with some familiarity with the (much simpler) RDF model, on the other hand, were confused by, and/or didn't see the need for, the description set model. And a third (and perhaps larger still) constituency was engaged primarily in the use of XML-based metadata schemas (like MODS), with little or no notion of an abstract syntax distinct from the XML syntax itself.

However, I think the existence of the DCAM has perhaps provided some more positive outcomes in other areas.

First, I think the very existence of the DCAM helped advance discussions around comparing metadata standards from different communities, particularly in the initiatives championed by Mikael Nilsson in comparing Dublin Core and the IEEE Learning Object Metadata standard, by drawing attention to the importance of articulating the "abstract models" in use in standards when making such comparisons and when trying to establish conditions for "interoperability" between applications based on them. (This work is nicely summarised in a paper for the ProLEARN project Harmonization of Metadata Standards).

Second, while implementation of the Description Set Profile specification itself has been limited, it has provided a focus for exploring the question of describing structural patterns and performing structural validation, based not on concrete syntaxes and on e.g. XML schema technologies, but on the abstract syntax. A recent thread on the Library Linked Data Incubator Group mailing list, starting with Mikael Nilsson's post, provides a very interesting discussion of current thinking, and this area will be the focus of the second part of the Pittsburgh meeting.

And the Singapore Framework's separation of "vocabulary" from patterns for, or constraints on, the use of that vocabulary - leaving aside for a moment the actual techniques for realising that distinction - has received some attention as a general basis for metadata schema development (see, for example, the comments by Scott Wilson in his contribution to the recent JISC CETIS meeting on interoperability standards.

Finally, it's probably stating the obvious that any choice of path forward needs to take into account that DCMI, like many similar organisations, finds itself in an environment in which resources, both human and financial, are extremely limited. Many individuals who devoted time and energy to DCMI activities in the past have shifted their energy to other areas - and while I continue to maintain some engagement with DCMI, mainly through the vocabulary management activity of the Usage Board, I include myself in this category. Many of the DCMI "community" mailing lists show little sign of activity, and what few postings there are seem to receive little response. And some organisations which in the past supported staff to work in this area are choosing to focus their resources elsewhere.

Against this background, more than ever, it seems to me, it is important for DCMI not to try to tackle problems in isolation, but rather to (re)align its approaches firmly with those of the Semantic Web community, to capitalise on the momentum - and the availability of tools, expertise and experience (and good old enthusiasm!) - being generated by the wider take-up of the "Linked Data" approach, and to explore solutions to what might appear to be "DC-specific" problems (but probably aren't) within that broader community. The fact that the Architecture meeting in Pittsburgh is a joint one seems like a good first step in this direction.

September 17, 2010

On the length and winding nature of roads

I attended, and spoke at, the ISKO Linked Data - the future of knowledge organization on the Web event in London earlier this week. My talk was intended to have a kind of "what 10 years of working with the Dublin Core community has taught me about the challenges facing Linked Data" theme but probably came across more like "all librarians are stupid and stuck in the past". Oh well... apologies if I offended anyone in the audience :-).

Here are my slides:

They will hopefully have the audio added to them in due course - in the meantime, a modicum of explanation is probably helpful.

My fundamental point was that if we see Linked Data as the future of knowledge organization on the Web (the theme of the day), then we have to see Linked Data as the future of the Web, and (at the risk of kicking off a heated debate) that means that we have to see RDF as the future of the Web. RDF has been on the go for a long time (more than 10 years), a fact that requires some analysis and explanation - it certainly doesn't strike me as having been successful over that period in the way that other things have been successful. I think that Linked Data proponents have to be able to explain why that is the case rather better than simply saying that there was too much emphasis on AI in the early days, which seemed to be the main reason provided during this particular event.

My other contention was that the experiences of the Dublin Core community might provide some hints at where some of the challenges lie. DC, historically, has had a rather librarian-centric make-up. It arose from a view that the Internet could be manually catalogued for example, in a similar way to that taken to catalogue books, and that those catalogue records would be shipped between software applications for the purposes of providing discovery services. The notion of the 'record' has thus been quite central to the DC community.

The metadata 'elements' (what we now call properties) used to make up those records were semantically quite broad - the DC community used to talk about '15 fuzzy buckets' for example. As an aside, in researching the slides for my talk I discovered that the term fuzzy bucket now refers to an item of headgear, meaning that the DC community could quite literally stick it's collective head in a fuzzy bucket and forget about the rest of the world :-). But I digress... these broad semantics (take a look at the definition of dcterms:coverage if you don't believe me) were seen as a feature, particularly in the early days of DC... but they become something of a problem when you try to transition those elements into well crafted semantic web vocabularies, with domains, ranges and the rest of it.

Couple that with an inherent preference for "strings" vs. "things", i.e. a reluctance to use URIs to identify the entities at the value end of a property relationship - indeed, couple it with a distinct scepticism about the use of 'http' URIs for anything other than locating Web pages - and a large dose of relatively 'flat' and/or fuzzy modelling and you have an environment which isn't exactly a breeding ground for semantic web fundamentalism.

When we worked on the original DCMI Abstract Model, part of the intention was to come up with something that made sense to the DC community in their terms, whilst still being basically the RDF model and, thus, compatible with the Semantic Web. In the end, we alienated both sides - librarians (and others) saying it was still too complex and the RDF-crowd bemused as to why we needed anything other than the RDF model.

Oh well :-(.

I should note that a couple of things have emerged from that work that are valuable I think. Firstly, the notion of the 'record', and the importance of the 'record' as a mechanism for understanding provenance. Or, in RDF terms, the notion of bounded graphs. And, secondly, the notion of applying constraints to such bounded graphs - something that the DC community refers to as Application Profiles.

On the basis of the above background, I argued that some of the challenges for Linked Data lie in convincing people:

  • about the value of an open world model - open not just in the sense that data may be found anywhere on the Web, but also in the sense that the Web democratises expertise in a 'here comes everybody' kind of way.
  • that 'http' URIs can serve as true identifiers, of anything (web resources, real-world objects and conceptual stuff).
  • and that modelling is both hard and important. Martin Hepp, who spoke about GoodRelations just before me (his was my favorite talk of the day), indicated that the model that underpins his work has taken 10 years or so to emerge. That doesn't surprise me. (One of the things I've been thinking about since giving the talk is the extent to which 'models build communities', rather than the other way round - but perhaps I should save that as the topic of a future post).

There are other challenges as well - overcoming the general scepticism around RDF for example - but these things are what specifically struck me from working with the DC community.

I ended my talk by reading a couple of paragraphs from Chris Gutteridge's excellent blog post from earlier this month, The Modeller, which seemed to go down very well.

As to the rest of the day... it was pretty good overall. Perhaps a tad too long - the panel session at the end (which took us up to about 7pm as far as I recall) could easily have been dropped.  Ian Mulvany of Mendeley has a nice write up of all the presentations so I won't say much more here. My main concern with events like this is that they struggle to draw a proper distinction between the value of stuff being 'open', the value of stuff being 'linked', and the value of stuff being exposed using RDF. The first two are obvious - the last less-so. Linked Data (for me) implies all three... yet the examples of applications that are typically shown during these kinds of events don't really show the value of the RDFness of the data. Don't get me wrong - they are usually very compelling examples in their own right but usually it's a case of 'this was built on Linked Data, therefore Linked Data is wonderful' without really making a proper case as to why.

August 24, 2010

Resource discovery revisited...

...revisited for me that is!

Last week I attended an invite-only meeting at the JISC offices in London, notionally entitled a "JISC IE Technical Review" but in reality a kind of technical advisory group for the JISC and RLUK Resource Discovery Taskforce Vision [PDF], about which the background blurb says:

The JISC and RLUK Resource Discovery Taskforce was formed to focus on defining the requirements for the provision of a shared UK resource discovery infrastructure to support research and learning, to which libraries, archives, museums and other resource providers can contribute open metadata for access and reuse.

The morning session felt slightly weird (to me), a strange time-warp back to the kinds of discussions we had a lot of as the UK moved from the eLib Programme, thru the DNER (briefly) into what became known (in the UK) as the JISC Information Environment - discussions about collections and aggregations and metadata harvesting and ... well, you get the idea.

In the afternoon we were split into breakout groups and I ended up in the one tasked with answering the question "how do we make better websites in the areas covered by the Resource Discovery Taskforce?", a slightly strange question now I look at it but one that was intended to stimulate some pragmatic discussion about what content providers might actually do.

Paul Walk has written up a general summary of the meeting - the remainder of this post focuses on the discussion in the 'Making better websites' afternoon breakout group and my more general thoughts.

Our group started from the principles of Linked Data - assign 'http' URIs to everything of interest, serve useful content (both human-readable and machine-processable (structured according to the RDF model)) at those URIs, and create lots of links between stuff (internal to particular collections, across collections and to other stuff). OK... we got slightly more detailed than that but it was a fairly straight-forward view that Linked Data would help and was the right direction to go in. (Actually, there was a strongly expressed view that simply creating 'http' URIs for everything and exposing human-readable content at those URIs would be a huge step forward).

Then we had a discussion about what the barriers to adoption might be - the problems of getting buy-in from vendors and senior management, the need to cope with a non-obvious business model (particularly in the current economic climate), the lack of technical expertise (not to mention semantic expertise) in parts of those sectors, the endless discussions that might take place about how to model the data in RDF, the general perception that Semantic Web is permanently just over the horizon and so on.

And, in response, we considered the kinds of steps that JISC (and its partners) might have to undertake to build any kind of political momentum around this idea.

To cut a long story short, we more-or-less convinced ourselves out of a purist Linked Data approach as a way forward, instead preferring a 4 layer model of adoption, with increasing levels of semantic richness and machine-processability at each stage:

  1. expose data openly in any format available (.csv files, HTML pages, MARC records, etc.)
  2. assign 'http' URIs to things of interest in the data, expose it in any format available (.csv files, HTML pages, etc.) and serve useful content at each URI
  3. assign 'http' URIs to things of interest in the data, expose it as XML and serve useful content at each URI
  4. assign 'http' URIs to things of interest in the data and expose Linked Data (as per the discussion above).

These would not be presented as steps to go thru (do 1, then 2, then 3, ...) but as alternatives with increasing levels of semantic value. Good practice guidance would encourage the adoption of option 4, laying out the benefits of such an approach, but the alternatives would provide lower barriers to adoption and offer a simpler 'sell' politically.

The heterogeneity of data being exposed would leave a significant implementation challenge for the aggregation services attempting to make use of it and the JISC (and partners) would have to fund some pretty convincing demonstrators of what might usefully be achieved.

One might characterise these approaches as 'data.glam.uk' (echoing 'data.gov.uk' but where 'glam' is short for 'galleries, libraries, archives and museums') and/or Digital UK (echoing the pragmatic approaches being successfully adopted by the Digital NZ activity in New Zealand).

Despite my reservations about the morning session, the day ended up being quite a useful discussion. That said, I remain somewhat uncomfortable with its outcomes. I'm a purest at heart and the 4 levels above are anything but pure. To make matters worse, I'm not even sure that they are pragmatic. The danger is that people will adopt only the lowest, least semantic, option and think they've done what they need to do - something that I think we are seeing some evidence of happening within data.gov.uk. 

Perhaps even more worryingly, having now stepped back from the immediate talking-points of the meeting itself, I'm not actually sure we are addressing a real user need here any more - the world is so different now than it was when we first started having conversations about exposing cultural heritage collections on the Web, particularly library collections - conversations that essentially pre-dated Google, Google Scholar, Amazon, WorldCat, CrossRef, ... the list goes on. Do people still get agitated by, for example, the 'book discovery' problem in the way they did way back then? I'm not sure... but I don't think I do. At the very least, the book 'discovery' problem has largely become an 'appropriate copy' problem - at least for most people? Well, actually, let's face it... for most people the book 'discovery' and 'appropriate copy' problems have been solved by Amazon!

I also find the co-location of libraries, museums and archives, in the context of this particular discussion, rather uncomfortable. If anything, this grouping serves only to prolong the discussion and put off any decision making?

Overall then, I left the meeting feeling somewhat bemused about where this current activity has come from and where it is likely to go.

 

July 29, 2010

legislation.gov.uk

I woke up this morning to find a very excited flurry of posts in my Twitter stream pointing to the launch by the UK National Archives of the legislation.gov.uk site, which provides access to all UK legislation, including revisions made over time. A post on the data.gov.uk blog provides some of the technical background and highlights the ways in which the data is made available in machine-processable forms. Full details are provided in the "Developer Zone" documents.

I don't for a second pretend to have absorbed all the detail of what is available, so I'll just highlight a couple of points.

First and foremost, this is being delivered with an eye firmly on the Linked Data principles. From the blog post I mentioned above:

For the web architecturally minded, there are three types of URI for legislation on legislation.gov.uk. These are identifier URIs, document URIs and representation URIs. Identifier URIs are of the form http://www.legislation.gov.uk/id/{type}/{year}/{number} and are used to denote the abstract concept of a piece of legislation - the notion of how it was, how it is and how it will be. These identifier URIs are designed to support the use of legislation as part of the web of Linked Data. Document URIs are for the document. Representation URIs are for the different types of possible rendition of the document, so htm, pdf or xml.

(Aside: I admit to a certain squeamishness about the notion of "representation URIs" and I kinda prefer to think in terms of URIs for Generic Documents and for Specific Documents, along the lines described by Tim Berners-Lee in his "Generic Resources" note, but that's a minor niggle of terminology on my part, and not at all a disagreement with the model.)

A second aspect I wanted to highlight (given some of my (now slightly distant) past interests) is that, on looking at the RDF data (e.g. http://www.legislation.gov.uk/ukpga/2010/24/contents/data.rdf), I noticed that it appears to make use of a FRBR-based model to deal with the challenge of representing the various flavours of "versioning" relationships.

I haven't had time to look in any detail at the implementation, other than to observe that the data can get quite complex - necessarily so - when dealing with a lot of whole-part and revision-of/variant-of/format-of relationships. (There was one aspect where I wondered if the FRBR concepts were being "stretched" somewhat, but I'm writing in haste and I may well be misreading/misinterpreting the data, so I'll save that question for another day.)

It's fascinating to see the FRBR approach being deployed as a practical solution to a concrete problem, outside of the library community in which it originated.

Pretty cool stuff, and congratulations to all involved in providing it. I look forward to seeing how the data is used.

July 21, 2010

Getting techie... what questions should we be asking of publishers?

The Licence Negotiation team here are thinking about the kinds of technical questions they should be asking publishers and other content providers as part of their negotiations with them. The aim isn't to embed the answers to those questions in contractual clauses - rather, it is to build up a useful knowledge base of surrounding information that may be useful to institutions and others who are thinking about taking up a particular agreement.

My 'starter for 10' set of questions goes like this:

  • Do you make any commitment to the persistence of the URLs for your published content? If so, please give details. Do you assign DOIs to your published content? Are you members of CrossRef?
  • Do you support a search API? If so, what standard(s) do you support?
  • Do you support a metadata harvesting API? If so, what standard(s) do you support?
  • Do you expose RSS and/or Atom feeds for your content? If so, please describe what feeds you offer?
  • Do you expose any form of Linked Data about your published content? If so, please give details.
  • Do you generate OpenURLs as part of your web interface? Do you have a documented means of linking to your content based on bibliographic metadata fields? If so, please give details.
  • Do you support SAML (Service Provider) as a means of controlling access to your content? If so, which version? Are you a member of the UK Access Management Federation? If you also support other methods of access control, please give details.
  • Do you grant permission for the preservation of your content using LOCKSS, CLOCKSS and/or PORTICO? If so, please give details.
  • Do you have a statement about your support for the Web Accessibility Initiative (WAI)? If so, please give details?

Does this look like a reasonable and sensible set of questions for us to be asking of publishers? What have I missed? Something about open access perhaps?

July 16, 2010

Finding e-books - a discovery to delivery problem

Some of you will know that we recently ran a quick survey of academic e-book usage in the UK - I hope to be able to report on the findings here shortly. One of the things that we didn't ask about in the survey but that has come up anecdotally in our discussions with librarians is the ease (or not) with which it is possible to find out if a particular e-book title is available.

A typical scenario goes like this. "Lecturer adds an entry for a physical book to a course reading list. Librarian checks the list and wants to know if there is an e-book edition of the book, in order to offer alternatives to the students on that course". Problemo. Having briefly asked around, it seems (somewhat surprisingly?) that there is no easy solution to this problem.

If we assume that the librarian in question knows the ISBN of the physical book, what can be done to try and ease the situation? Note that in asking this question I'm conveniently ignoring the looming, and potentially rather massive, issue around "what the hell is an e-book anyway?" and "how are we going to assign identifiers to them once we've worked out what they are?" :-). For some discussion around this see Eric Hellman's recent piece, What IS an eBook, anyway?

But, let's ignore that for now... we know that OCLC's xISBN service allows us to navigate different editions of the same book (I'm desperately trying not to drop into FRBR-speak here). Taking a quick look at the API documentation for xISBN yesterday, I noticed that the metadata returned for each ISBN can include both the fact that something is a 'Book' and that it is 'Digital' (form == 'BA' && form == 'DA') - that sounds like the working definition of an e-book to me (at least for the time being) - as well as listing the ISBNs for all the other editions/formats of the same book. So I knocked together a quick demonstrator. The result is e-Book Finder and you are welcome to have a play. To get you started, here are a couple of examples:

Of course, because e-Book Finder is based on xISBN, which is in turn based on WorldCat, you can only use it to find e-books that are listed in the catalogues of WorldCat member libraries (but I'm assuming that is a big enough set of libraries that the coverage is pretty good). Perhaps more importantly, it also only represents the first stage of the problem. It allows you to 'discover' that an e-book exists - but it doesn't get the thing 'delivered' to you.

Wouldn't it be nice if e-Book Finder could also answer questions like, "is this e-book covered by my existing institutional subscriptions?", "can I set up a new institutional subscription that would cover this e-book?" or simply "can I buy a one-off copy of this e-book?". It turns out that this is a pretty hard problem. My Licence Negotiation colleagues at Eduserv suggested doing some kind of search against myilibrary, dawsonera, Amazon, eBrary, eblib and SafariBooksOnline. The bad news is that (as far as I can tell), of those, only Amazon and SafariBooksOnline allow users to search their content before making them sign in and only Amazon offer an API. (I'm not sure why anyone would design a website that has the sole purpose of selling stuff such that people have to sign in before they can find out what is on offer, nor why that information isn't available in a openly machine-readable form but anyway...). So in this case, moving from discovery to delivery looks to be non-trivial. Shame. Even if each of these e-book 'aggregators' simply offered a list1 of the ISBNs of all the e-books they make available, it would be a step in the right direction.

On the other hand, maybe just pushing the question to the institutional OpenURL resolver would help answer these questions. Any suggestions for how things could be improved?

1. It's a list so that means RSS or Atom, right?

July 08, 2010

Going LOCAH: a Linked Data project for JISC

Recently I worked with Adrian Stevenson of UKOLN and Jane Stevenson and Joy Palmer of MIMAS, University of Manchester on a bid for a project under the JISC O2/10 call, Deposit of research outputs and Exposing digital content for education and research, and I'm very pleased to be able to say that the proposal has been accepted and the project has been funded.

The project is called "Linked Open Copac Archives Hub" (LOCAH). It aims to address the "expose" section of the call, and focuses on making available data hosted by the Copac and Archives Hub services hosted by MIMAS - i.e. library catalogue data and data from archival finding aids - in the form of Linked Data; developing some prototype applications illustrating the use of that data; and analysing some of the issues arising from that work. The main partners in the work are UKOLN and MIMAS, with contributions from Eduserv, OCLC and Talis. The Eduserv contribution will take the form of some input from me, probably mostly in the area of working with Jane on modelling some of the archival finding aid data, currently held in the form of EAD-encoded XML documents, so that it can be represented in RDF - though I imagine I'll be sticking my oar in on various other aspects along the way.

UKOLN is managing the project and hosting a project weblog. I'm not sure at the moment how I'll divide up thoughts between here and there; I'll probably end up with a bit of duplication along the way.

May 05, 2010

RDFa for the Eduserv Web site

Another post that I've been intermittently chiselling away at in the draft pile for a while... A few weeks ago, I was asked by Lisa Price, our Website Communications Manager, to make some suggestions of how Eduserv might make use of the RDFa in XHTML syntax to embed structured data in pages on the Eduserv Web site, which is currently in the process of being redesigned. I admit this is coming mostly from the starting point of wanting to demonstrate the use of the technology rather than from a pressing use case, but OTOH there is a growing interest from RDFa amongst some of Eduserv's public sector clients so a spot of "eating our own dogfood" would be a Good Thing, and furthermore there are signs of a gradual but significant adoption of RDFa by some major Web service providers.

It seems to me Eduserv might use RDFa to describe, or make assertions about:

  • (Perhaps rather trivially) Web pages themselves i.e. reformulating the (fairly limited) "document metadata" we supply as RDFa.
  • (Perhaps rather more interestingly) some of the "things" that Eduserv pages "are about", or that get mentioned in those pages (e.g. persons, organisations, activities, events, topics of interest, etc).

Within that category of data about "things", we need to decide which data it is most useful to expose. We could:

  • look at those classes of data that are processed by tools/services that currently make use of RDFa (typically using specified RDF vocabularies); or
  • focus on data that we know already exists in a "structured" form but is currently presented in X/HTML either only in human-readable form or using microformats (or even new data which isn't currently surfaced at all on the current site)

Another consideration was the question of whether data was covered by existing models and vocabularies or required some analysis and modelling.

To be honest, there's a fairly limited amount of "structured" information on the site currently. There is some data on licence agreements for software and data, currently made available as HTML tables and Excel spreadsheets. While I think some of the more generic elements of this might be captured using a product/service ontology such as Good Relations, the license-specific aspects would require some additional modelling. For the short term at least, we've taken a somewhat "pragmatic" approach and focused mainly on that first class of data for which there are some identifiable consuming applications, based on the use of specified RDF vocabularies - and more specifically on data that Google and Yahoo make particular reference to in their documentation for creators/publishers of Web pages.

That's not to say there won't be more use of RDFa on the site in the future: at the moment, this is something of a "dipping toes in the water" exercise, I think.

The following is by best effort to summarize Google and Yahoo support for RDFa at the time of writing. Please note that this is something which is evolving - as I was writing up this post, I just noticed that the Google guidelines have changed slightly since I sent my initial notes to Lisa. And I'm still not at all sure I've captured the complete picture here, so please do check their current documentation for content providers to get an idea of the current state of play.

Google and RDFa

Google's support for RDFa is part of a larger programme of support for structured data embedded in X/HTML that they call "rich snippets" (announced here), which includes support for RDFa, microformats and microdata. (The latter, I think, is a relatively recent addition).

Google functionality extends to extracting specified categories of RDFa data in (some) pages it indexes, and displaying that in search result sets (and in place pages in Google Maps). It also provides access to the data in its Custom Search platform.

Initially at least, Google required the use of its own RDF vocabularies, which attracted some criticism (see e.g. Ian Davis' response), but it appears to have fairly quietly introduced some support for other RDF vocabularies. "In addition to the Person RDFa format, we have added support for the corresponding fields from the FOAF and vCard vocabularies for all those of you who asked for it." And Martin Hepp has pointed to Google displaying data encoded using the Good Relations product/service ontology.

The nature of the RDFa syntax is such that it is often fairly straightforward to use multiple RDF vocabularies in RDFa e.g. triples using the same subject and object but different predicates can be encoded using a single RDFa attribute with multiple white-space-separated CURIEs - though things do tend to get more messy if the vocabularies are based on different models (e.g. time periods as literals v time periods as resources with properties of their own).

Google provides specific recommendations to content creators on the embedding of data to describe:

Yahoo and RDFa

Yahoo's support for RDFa is through its SearchMonkey platform. Like Google, it provides a set of "standard" result set enhancements, based on the use of specified RDF vocabularies for a small set of resource types:

In addition, my understanding is that although Yahoo defines some RDF vocabularies of its own, and describes the use of specified vocabularies in the guidelines for the resource types above, it exposes any RDFa data in pages it indexes to developers on its SearchMonkey platform, to allow the building of custom search enhancements. Several existing vocabularies are discussed in the SearchMonkey guide and the FAQ in Appendix D of that document notes "You may use any RDF or OWL vocabulary".

Linked Data

The decentralised extensibility built into RDF means that a provider can choose to extend what data they expose beyond that specified in the guidelines mentioned above.

In addition, I tried to take into account some other general "good practice" points that have emerged from the work of the Linked Data community, captured in sources such as:

So in the Eduserv case, for example (I hope!) URIs will be assigned to "things" like events, distinct from the pages describing them, with suitable redirects put in place on the HTTP server and syitable triples in the data linking those things and the corresponding pages.

Summary

Anyway, on the basis of the above sources, I tried to construct some suggestions, taking into acccount both the Google and Yahoo guidelines, for descriptions of people, organisations and events, which I'll post here in the next few entries.

Postscript: Facebook

Even more recently, of course, has come the news of Facebook's announcement at the f8 conference of their Open Graph Protocol. This makes use of RDFa embedded in the headers of XHTML pages using meta elements to provide (pretty minimal) metadata "about" things described by those pages (films, songs, people, places, hotels, restaurants etc - see the Facebook page for a full (and I imagine, growing) list of resource types supported).

Facebook makes use of the data to drive its "Like" application: a "button" can be embedded in the page to allow a Facebook user to post the data to their Fb account to signal an "I like this" relationship with the thing described. Or as Dare Obasanjo expresses it, an Fb user can add a node for the thing to their Fb social graph, making it into a "social object". This results in the data being displayed at appropriate points in their Fb stream, while the button displays, as a minimum, a count of the "likers" of the resource on the source page itself; logged-in Fb users would, I think, see information about whether any of their "friends" had liked it.

My reporting of these details of the interface is somewhat "second-hand" as I no longer use Facebook - I deleted my account some time ago because I was concerned about their approaches to the privacy of personal information (see these three recent posts by Tony Hirst for some thoughts on the most recent round of changes in that sphere).

Perhaps unsurprisingly given the popularity of Fb and its huge user base, the OGP announcement seems to have attracted a very large amount of attention within a very short period of time, and it may turn out to be a significant milestone for the use of XHTML-embedded metadata in general and of RDFa in particular. The substantial "carrot" of supporting the Fb "Like" application and attracting traffic from Fb users is likely to be the primary driver for many providers to generate this data, and indeed some commentators (see e.g. this BBC article) have gone as far as to suggest that this represents a move by Facebook to challenge Google as the primary filter of resources for people searching and navigating the Web.

However, I also think it is important to distinguish between the data on the one hand and that particular Facebook app on the other. Having this data available, minimal as it may be, also opens up the possibility of other applications by other parties making use of that same data.

And this is true also, of course, for the case of data constructed following the Google and Yahoo guidelines.

The future of UK Dublin Core application profiles

I spent yesterday morning up at UKOLN (at the University of Bath) for a brief meeting about the future of JISC-funded Dublin Core application profile development in the UK.

I don't intend to report on the outcomes of the meeting here since it is not really my place to do so (I was just invited as an interested party and I assume that the outcomes of the meeting will be made public in due course). However, attending the meeting did make me think about some of the issues around the way application profiles have tended to be developed to date and these are perhaps worth sharing here.

By way of background, the JISC have been funding the development of a number of Dublin Core application profiles in areas such as scholarly works, images, time-based media, learning objects, GIS and research data over the last few years.  An application profile provides a model of some subset of the world of interest and an associated set of properties and controlled vocabularies that can be used to describe the entities in that model for the purposes of some application (or service) within a particular domain. The reference to Dublin Core implies conformance with the DCMI Abstract Model (which effectively just means use of the RDF model) and an inherent preference for the use of Dublin Core terms whenever possible.

The meeting was intended to help steer any future UK work in this area.

I think (note that this blog post is very much a personal view) that there are two key aspects of the DC application profile work to date that we need to think about.

Firstly, DC application profiles are often developed by a very small number of interested parties (sometimes just two or three people) and where engagement in the process by the wider community is quite hard to achieve. This isn't just a problem with the UK JISC-funded work on application profiles by the way. Almost all of the work undertaken within the DCMI community on application profiles suffers from the same problem - mailing lists and meetings with very little active engagement beyond a small core set of people.

Secondly, whilst the importance of enumerating the set of functional requirements that the application profile is intended to meet has not been underestimated, it is true to say that DC application profiles are often developed in the absence of an actual 'software application'. Again, this is also true of the application profile work being undertaken by the DCMI. What I mean here is that there is not a software developer actually trying to build something based on the application profile at the time it is being developed. This is somewhat odd (to say the least) given that they are called application profiles!

Taken together, these two issues mean that DC application profiles often take on a rather theoretical status - and an associated "wouldn't it be nice if" approach. The danger is a growth in the complexity of the application profile and a lack of any real business drivers for the work.

Speaking from the perspective of the Scholarly Works Application Profile (SWAP) (the only application profile for which I've been directly responsible), in which we adopted the use of FRBR, there was no question that we were working to a set of perceived functional requirements (e.g. "people need to be able to find the latest version of the current item"). However, we were not driven by the concrete needs of a software developer who was in the process of building something. We were in the situation where we could only assume that an application would be built at some point in the future (a UK repository search engine in our case). I think that the missing link to an actual application, with actual developers working on it, directly contributed to the lack of uptake of the resulting profile. There were other factors as well of course - the conceptual challenge of basing the work on FRBR and that fact that existing repository software was not RDF-ready for example - but I think that was the single biggest factor overall.

Oddly, I think JISC funding is somewhat to blame here because, in making funding available, JISC helps the community to side-step the part of the business decision-making that says, "what are the costs (in time and money) of developing, implementing and using this profile vs. the benefits (financial or otherwise) that result from its use?".

It is perhaps worth comparing current application profile work and other activities. Firstly, compare the progress of SWAP with the progress of the Common European Research Information Format (CERIF), about which the JISC recently reported:

EXRI-UK reviewed these approaches against higher education needs and recommended that CERIF should be the basis for the exchange of research information in the UK. CERIF is currently better able to encode the rich information required to communicate research information, and has the organisational backing of EuroCRIS, ensuring it is well-managed and sustainable.

I don't want to compare the merits of these two approaches at a technical level here. What is interesting however, is that if CERIF emerges as the mandated way in which research information is shared in the UK then there will be a significant financial driver to its adoption within systems in UK institutions. Research information drives a significant chunk of institutional funding which, in turn, drives compliance in various applications. If the UK research councils say, "thou shalt do CERIF", that is likely what institutions will do.  They'll have no real choice. SWAP has no such driver, financial or otherwise.

Secondly, compare the current development of Linked Data applications within the UK data.gov.uk initiative with the current application profile work. Current government policy in the UK effectively says, 'thou shalt do Linked Data' but isn't really any more prescriptive. It encourages people to expose their data as Linked Data and to develop useful applications based on that data. Ignoring any discussion about whether Linked Data is a good thing or not, what has resulted is largely ground-up. Individual developers are building stuff and, in the process, are effectively developing their own 'application profiles' (though they don't call them that) as part of exposing/using the Linked Data. This approach results in real activity. But it also brings with it the danger of redundancy, in that every application developer may model their Linked Data differently, inventing their own RDF properties and so on as they see fit.

As Paul Walk noted at the meeting yesterday, at some stage there will be a huge clean-up task to make any widespread sense of the UK government-related Linked Data that is out there. Well, yes... there will. Conversely, there will be no clean up necessary with SWAP because nobody will have implemented it.

Which situation is better!? :-)

I think the issue here is partly to do with setting the framework at the right level. In trying to specify a particular set of application profiles, the JISC is setting the framework very tightly - not just saying, "you must use RDF" or "you must use Dublin Core" but saying "you must use Dublin Core in this particular way". On the other hand, the UK government have left the field of play much more open. The danger with the DC application profile route is lack of progress. The danger with the government approach is too little consistency.

So, what are the lessons here? The first, I think, is that it is important to lobby for your prefered technical solution at a policy level as well as at a technical level. If you believe that a Linked Data-compliant Dublin Core application profile is the best technical way of sharing research information in the UK then it is no good just making that argument to software developers and librarians. Decisions made by the research councils (in this case) will be binding irrespective of technical merit and will likely trump any decisions made by people on the ground.

The second is that we have to understand the business drivers for the adoption, or not, of our technical solutions rather better than we do currently. Who makes the decisions? Who has the money? What motivates the different parties? Again, technically beautiful solutions won't get adopted if the costs of adoption are perceived to outweigh the benefits, or if the people who hold the purse strings don't see any value in spending their money in that particular way, or if people simply don't get it.

Finally, I think we need to be careful that centralised, top-down, initiatives (particularly those with associated funding) don't distort the environment to such an extent that the 'real' drivers, both financial and user-demand, can be ignored in the short term, leading to unsustainable situations in the longer term. The trick is to pump-prime those things that the natural drivers will support in the long term - not always an easy thing to pull off.

April 27, 2010

RDFa 1.1 drafts available from W3C

Last week, the W3C RDFa Working Group announced the availability of two new "First Public Working Drafts" which it is circulating for comment:

Ivan Herman, the W3C Semantic Web Activity Lead, and a co-editor of these documents, has provided a very helpful summary of their main features, and particularly of some of the differences they introduce whem compared with the current W3C Recommendation for RDFa RDFa in XHTML: Syntax and Processing: A collection of attributes and processing rules for extending XHTML to support RDF. I think the intent is that the new drafts maintain compatibility with the current recommendation, in the sense that all the features used in XHTML+RDFa 1.0 are also present in RDFa 1.1. I should reiterate what Ivan says at the start of his piece: these are drafts and features may change based on feedback received.

Some of the most interesting features in these drafts, at least for data creators, are those which enable a more concise/compact style of RDFa markup. One of the criticisms of the initial version of RDFa, particularly from communities unfamiliar with RDF syntaxes, was the dependency on the use of prefixed names, in the form of CURIEs, two-part names made up of a "prefix" and a "reference", mapped to URIs by associating the prefix with a "base" URI, and concatenating the reference part of the CURIE with that URI. In XHTML the prefix-URI association was made through an XML Namespace Declaration. In particular, arguments against this approach focused on problems of "copy-and-paste", where a document fragment including RDFa markup was extracted from a source document without also copying the in-scope XML Namespace declarations, and as a result the RDF interpretation of the fragment in the context of a different (paste target) document was changed. More generally, there were some concerns that the use of prefixes was difficult to explain and understand, at least when compared with the "unprefixed name" styles typically adopted in approaches like microformats.

The new drafts introduce several mechanisms which can simplify markup for authors.

I should emphasise that my examples below are based on my fairly rapid reading of the drafts, and any errors and misrepresentations are mine!

The @vocab attribute

@vocab is a new RDFa attribute which provides a means for defining a "default" "vocabulary URI" to which the "terms" in attribute values are appended to construct a URI.

Aside: I should note here that I'm using the word "term" in the sense it is used in the RDFa 1.1 draft, where it refers to a datatype for a string used in an attribute value; this differs from usage by e.g. DCMI where "term" typically refers to a property, class, vocabulary encoding scheme or syntax encoding scheme i.e. to the "conceptual resource" identified by a DCMI URI, rather than to a syntactic component. In RDFa 1.1, "terms" have the syntactic constraints of the NCName production in the XML Namespaces specification.

This mechanism provides an alternative to the use of a CURIE (with prefix, reference and namespace declaration) to represent a URI.

Consider an example based on those from my recent post about RDFa (1.0) and document metadata (This is a "hybrid" of the examples 1.1.5, 1.3.5, and 2.1.5 in that post):

XHTML+RDFa 1.0:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <h1 property="dc:title">My World Cup 2010 Review</h1>

    <p>About: 
      <a rel="dc:subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        The 2010 World Cup
      </a>
    </p>

    <p>Date last modified: 
      <span property="dc:modified"
        datatype="xsd:date">2010-07-04</span>
    </p>

  </body>
  
</html>

This represents the following three triples (in Turtle):

@prefix dc: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <http://example.org/resource/> .

<> dc:title "My World Cup 2010 Review" .
<> dc:subject ex:2010_FIFA_World_Cup .
<> dc:modified "2010-07-04"^^xsd:date .

XHTML+RDFa 1.1 using @vocab:

Using the @vocab attribute on the body element to set http://purl.org/dc/terms/ as the default vocabulary URI, I could write this as:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.1">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body vocab="http://purl.org/dc/terms/">
    <h1 property="title">My World Cup 2010 Review</h1>

    <p>About: 
      <a rel="subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        The 2010 World Cup
      </a>
    </p>

    <p>Date last modified: 
      <span property="modified"
        datatype="xsd:date">2010-07-04</span>
    </p>

  </body>
  
</html>

In that case, where just three properties are referenced, the reduction in the number of characters is minimal, but if several properties from the same vocabulary were referenced, then the saving could be more substantial.

The @vocab approach provides limited help where, as is often the case, terms from multiple RDF vocabularies are used in combination (e.g. the example above continues to use a CURIE for the URI of the XML Schema date datatype), but other features of RDFa 1.1 are useful in those cases.

RDFa Profiles and the @profile attribute

Perhaps more powerful than the @vocab attribute is the new RDFa 1.1 feature known as the RDFa profile, and the @profile attribute:

RDFa Profiles are optional external documents that define collections of terms and/or prefix mappings. These documents must be defined in an approved RDFa Host Language (currently XHTML+RDFa [XHTML-RDFA]). They may also be defined in other RDF serializations as well (e.g., RDF/XML [RDF-SYNTAX-GRAMMAR] or Turtle [TURTLE]). RDFa Profiles are referenced via @profile, and can be used by document authors to simplify the task of adding semantic markup.

Let's take each of these two functions - defining terms and defining prefix mappings - in turn.

Defining term mappings in an RDFa profile

An RDFa profile can provide mappings between "terms" and URIs. The following example provides four such "term mappings", for the URIs of three properties from the DC Terms RDF vocabulary and for the URI of one XML Schema datatype:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:rdfa="http://www.w3.org/ns/rdfa#"
      version="XHTML+RDFa 1.1">
      
  <head>
    <title>My RDFa Profile for a few DC and XSD terms</title>
  </head>

  <body>

    <h1>My RDFa Profile for a few DC and XSD terms</h1>

    <ul>

      <li typeof="rdfa:TermMapping">
        <span property="rdfa:term">title</span> :
        <span property="rdfa:uri">http://purl.org/dc/terms/title</span>
      </li>

      <li typeof="rdfa:TermMapping">
        <span property="rdfa:term">about</span> :
        <span property="rdfa:uri">http://purl.org/dc/terms/subject</span>
      </li>

      <li typeof="rdfa:TermMapping">
        <span property="rdfa:term">modified</span> :
        <span property="rdfa:uri">http://purl.org/dc/terms/modified</span>
      </li>

      <li typeof="rdfa:TermMapping">
        <span property="rdfa:term">xsddate</span> :
        <span property="rdfa:uri">http://www.w3.org/2001/XMLSchema#date</span>
      </li>

    </ul>

  </body>
  
</html>

Note that - in contrast to the case of CURIE references - the content of the "term" doesn't have to match the trailing characters of the URI; so for example, here I've mapped the term "about" to the URI http://purl.org/dc/terms/subject. So sets of "terms" corresponding to various community-specific or domain=specific lexicons could be mapped to a single set of URIs.

Also a single RDFa profile might provide mappings for URIs from different URI owners - the example above reference three DCMI-owned URIs for properties and a W3C-owned URI for a datatype. Conversely, different subsets of URIs owned by a single agency may be referenced in different RDFa profiles.

If the URI of this RDFa profile is http://example.org/profile/terms/, then I can reference it in an XHTML+RDFa 1.1 document, and make use of the term mappings it defines. So taking the example above again, and now using @profile to reference the profile and its term mappings:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      version="XHTML+RDFa 1.1">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body profile="http://example.org/profile/terms/">
    <h1 property="title">My World Cup 2010 Review</h1>

    <p>About: 
      <a rel="about"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        The 2010 World Cup
      </a>
    </p>

    <p>Date last modified: 
      <span property="modified"
        datatype="xsddate">2010-07-04</span>
    </p>

  </body>
  
</html>

The @profile attribute may appear on any XML element, so it is possible that an element with a @profile attribute referencing profile A may contain as a child element with a @profile attribute referencing profile B.

  <body profile="http://example.org/profile/a/">
    <h1 property="title">My World Cup 2010 Review</h1>

    <div profile="http://example.org/profile/b/">
 
      <p>About: 
        <a rel="about"
          href="http://example.org/resource/2010_FIFA_World_Cup">
          The 2010 World Cup
        </a>
      </p>

    </div>

  </body>
  

And the value of a single @profile attribute may be a whitespace-separated list of URIs.

  <body profile="http://example.org/profile/a/
    http://example.org/profile/b/">

  </body>

One of the questions I'm not quite sure about is what happens if the same "term" is mapped to different URIs in different profiles. I think, but I'm not 100% sure, only a single mapping is used and a single triple is generated, but I'm not sure about the precedence rules for determining which mapping is to be used.

As Ivan notes, probably the most common pattern for deploying RDFa profiles will be for the owners/publishers of RDF vocabularies (such as DCMI) to publish profiles for their vocabularies, and for data providers to simply reference those profiles, rather than creating their own.

Defining prefix mappings in an RDFa profile

RDFa 1.1 continues to support the use of XML Namespace Declarations to associate CURIE prefixes with URIs (see my first example above and the use of the XML Schema datatype) but it also introduces other mechanisms for achieving this. One of these is the ability to supply CURIE prefix to URI mappings in RDFa profiles.

The following example provides four such "prefix mappings", for the URIs of three DCMI vocabularies and for the URI of the XML Schema datatype vocabulary:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:rdfa="http://www.w3.org/ns/rdfa#"
      version="XHTML+RDFa 1.1">
      
  <head>
    <title>My RDFa Profile for DC and XSD prefixes</title>
  </head>

  <body>

    <h1>My RDFa Profile for DC and XSD prefixes</h1>

    <ul>

      <li typeof="rdfa:PrefixMapping">
        <span property="rdfa:prefix">dc</span> :
        <span property="rdfa:uri">http://purl.org/dc/terms/</span>
      </li>

      <li typeof="rdfa:PrefixMapping">
        <span property="rdfa:prefix">dcam</span> :
        <span property="rdfa:uri">http://purl.org/dc/dcam/</span>
      </li>

      <li typeof="rdfa:PrefixMapping">
        <span property="rdfa:prefix">dcmitype</span> :
        <span property="rdfa:uri">http://purl.org/dc/dcmitype/</span>
      </li>

      <li typeof="rdfa:PrefixMapping">
        <span property="rdfa:prefix">xsd</span> :
        <span property="rdfa:uri">http://www.w3.org/2001/XMLSchema#</span>
      </li>

    </ul>

  </body>
  
</html>

If the URI of this RDFa profile is http://example.org/profile/prefixes/, then I can reference it in an XHTML+RDFa 1.1 document, and make use of the prefix mappings it defines. Taking the example above again, and using @profile to reference this second profile and its prefix mappings:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      version="XHTML+RDFa 1.1">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body profile="http://example.org/profile/prefixes/">
    <h1 property="dc:title">My World Cup 2010 Review</h1>

    <p>About: 
      <a rel="dc:subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        The 2010 World Cup
      </a>
    </p>

    <p>Date last modified: 
      <span property="dc:modified"
        datatype="xsd:date">2010-07-04</span>
    </p>

  </body>
  
</html>

As in the case of term mappings, the issue arises of what happens in the case that two profiles provide different prefix-URI mappings for the same prefix. I think the CURIE datatype is based on the notion that at a point in a document, for prefix p, a single prefix-URI mapping is in force for that prefix, so I assume there are precedence rules for establishing which of the profile prefix mappings is to be applied.

Access to profiles and changes to triples?

Although the RDFa 1.1 profile mechanism is a powerful mechanism, it also introduces a new element of complexity for consumers of RDFa. In RDFa 1.0, an XHTML+RDFa document is "self-contained", by which I mean an RDFa processor can construct an interpretation of the document as a set of RDF triples using only the content of the document itself. In RDFa 1.1, however, the interpretation of terms and prefixes may be determined by the term mappings and prefix mappings specified in profiles external to the document containing the RDFa markup.

Consider my last example above. When the processor encounters the @profile attribute it retrieves the profile and obtains a list of prefix-URI mappings to be applied in subsequent processing, and when it encounters the CURIE "dc:title" it generates the URI http://purl.org/dc/terms/title

But if for some reason, the processor is unable to dereference the URI, and doesn't have a cached copy of the referenced profile, then it does not have those mappings available. In that case, for my example above, when the processor encounters the CURIE "dc:title" it would not have a mapping for the "dc" prefix, and (I think?) would instead (with the new "URI everywhere" rules in force) treat the string "dc:title" as a URI? (See e.g. the section on CURIE and URI Processing)

In the case where two profiles are referenced, and both provide a mapping for the same prefix, then it seems possible that the prefix mapping in force might change depending on the availability of access to the profiles.

I lurk on the RDFa WG list, and I've seen various discussions of how these sort of issues should be handled - see, for example, this thread on "What happens when you can't dereference a profile document?", though related issues surface in other discussions too. I suspect the current draft is far from the "last word" in this area, and these are the sort of issues on which the authors are seeking feedback.

Summary

I've focused here only on a few "highlights" of the RDFa 1.1 drafts, and Ivan's post covers a couple more which I won't discuss here (the use of the @prefix attribute to provide CURIE prefix mappings and the ability to use URIs in contexts where previously CURIEs were required), but I hope they give a flavour of the sort of functionality which is being introduced. The examples here are based on my understanding of the current drafts, but I may have made mistakes, so please do check out the drafts rather than relying on my interpretations.

It seems to me the WG is trying hard to address some of the criticisms made of RDFa 1.0, and to provide mechanisms that make the provision of RDFa markup simpler while retaining the power and flexibility of the syntax and ensuring that RDFa 1.0 data remains compatible. In particular, it seems to me the "term mapping" feature of RDFa profiles may be very useful in "shielding" data providers from some of the complexity of name-URI mappings and prefixed names, especially once the owners of commonly used RDF vocabularies start to make such profiles available.

However, such flexibility doesn't come without its own challenges. and it also seems that the profile mechanism in particular introduces some complexity which I imagine will become a focus of some discussion during the comment period for these drafts. Comments on the drafts themselves should be sent to the RDFa Working Group list.

April 22, 2010

Document metadata using DC-HTML and using RDFa

In the context of various bits and pieces of work recently (more of which I'll write about in some upcoming posts), I've been finding myself describing how document metadata that can be represented using DCMI's DC-HTML meta data profile, described in Expressing Dublin Core metadata using HTML/XHTML meta and link elements, might also be represented using RDFa. (N.B. Here I'm considering only the current RDFa in XHTML W3C Recommendation, not the newly announced drafts for RDFa 1.1). So I thought I'd quickly list some examples here. Please note: I don't intend this to be a complete tutorial on using RDFa. Far from it; here I focus only on the case of "document metadata" whereas of course RDFa can be used to represent data "about" any resources. And these are really little more than a few rough notes which one day I might reuse somewhere else.

I really just wanted to illustrate that:

  • in terms of its use with the XHTML meta and link elements, RDFa has many similarities to the DC-HTML profile - unsurprisingly, as the RDF model underlies both; and
  • RDFa also provides the power and flexibility to represent data that can not be expressed using the DC-HTML profile.

The main differences between using RDFa in XHTML and using the DC-HTML profile are:

  • RDFa supports the full RDF model, not just the particular subset supported by DC-HTML
  • RDFa introduces some new XML attributes (@about, @property, @resource, @datatype, @typeof)
  • RDFa uses a datatype called CURIE for the abbreviation of URIs; DC-HTML uses a prefixed name convention which is essentially specific to that profile (though it was also adopted by the Embedded RDF profile)
  • Perhaps most significantly, RDFa can be used anywhere in an XHTML document, so the same syntactic conventions can be used both for document metadata and for data ("about" any resources) embedded in the body of the document

I'm presenting these examples following the description set model of the DCMI Abstract Model, and in more or less the same order that the DC-HTML specification presents the same set of concepts.

For each example, I present the data:

  • using DC-Text
  • using Turtle
  • in XHTML using DC-HTML
  • in XHTML+RDFa, using meta and link elements
  • in XHTML+RDFa, using block and inline elements (to illustrate that the same data could be embedded in the body of an XHTML document, rather than only in the head)

As an aside, it is possible to use the DC-HTML profile alongside RDFa in the same document, but I haven't bothered to show that here.

Footnote: Hmmm. Considering that I said to myself at the start of the year that I was rather tired of thinking/writing about syntax, I still seem to be doing an awful lot of it! Will try to write about other things soon....

1. Literal Value Surrogates

See DC-HTML 4.5.1.2.

1.1 Plain Value String

1.1.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:title )
      LiteralValueString ( "My World Cup 2010 Review" )
    )
  )
)

1.1.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
<> dc:title "My World Cup 2010 Review" .

1.1.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <meta name="DC.title" content="My World Cup 2010 Review" />
  </head>

</html>

1.1.4 XHTML+RDFa using meta and link

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <meta property="dc:title" content="My World Cup 2010 Review" />
  </head>

</html>

In this example, it would also be possible to simply add an attribute to the title element, instead of introducing the meta element:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title property="dc:title">My World Cup 2010 Review</title>
  </head>

</html>

1.1.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <h1 property="dc:title">My World Cup 2010 Review</h1>
  </body>
  
</html>

1.2 Plain Value String with Language Tag

1.2.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:title )
      LiteralValueString ( "My World Cup 2010 Review" 
        Language ( en )
      )
    )
  )
)

1.2.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
<> dc:title "My World Cup 2010 Review"@en .

1.2.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <meta name="DC.title" 
      xml:lang="en" content="My World Cup 2010 Review" />
  </head>

</html>

1.2.4 XHTML+RDFa using meta and link

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <meta property="dc:title"
      xml:lang="en" content="My World Cup 2010 Review" />
  </head>

</html>

1.2.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <h1 property="dc:title" xml:lang="en">My World Cup 2010 Review</h1>
  </body>
  
</html>

1.3 Typed Value String

1.3.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:modified )
      LiteralValueString ( "2010-07-04"
        SyntaxEncodingSchemeURI ( xsd:date )
      )
    )
  )
)

1.3.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<> dc:modified "2010-07-04"^^xsd:date .

1.3.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <link rel="schema.XSD" href="http://www.w3.org/2001/XMLSchema#" >
    <meta name="DC.modified" 
      scheme="XSD.date" content="2010-07-04" />
  </head>

</html>

1.3.4 XHTML+RDFa using meta and link

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <meta property="dc:modified" 
      datatype="xsd:date" content="2010-07-04" />
  </head>

</html>

1.3.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>Date last modified: 
      <span property="dc:modified"
        datatype="xsd:date">2010-07-04</span>
    </p>
  </body>
  
</html>

2. Non-Literal Value Surrogates

See DC-HTML 4.5.2.2.

2.1 Value URI

2.1.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:subject )
      ValueURI ( ex:2010_FIFA_World_Cup )
      )
    )
  )
)

2.1.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
<> dc:subject ex:2010_FIFA_World_Cup .

2.1.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <link rel="DC.subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
  </head>

</html>

2.1.4 XHTML+RDFa using meta and link

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
  </head>

</html>

2.1.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>About: 
      <a rel="dc:subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        The 2010 World Cup
      </a>
    </p>
  </body>
  
</html>

2.2 Value URI with Plain Value String

2.2.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:subject )
      ValueURI ( ex:2010_FIFA_World_Cup )
      ValueString ( "2010 FIFA World Cup" )
      )
    )
  )
)

2.2.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<> dc:subject ex:2010_FIFA_World_Cup .
ex:2010_FIFA_World_Cup rdf:value "2010 FIFA World Cup" .

2.2.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <link rel="DC.subject"
      href="http://example.org/resource/2010_FIFA_World_Cup"
      title="2010 FIFA World Cup" />
  </head>

</html>

2.2.4 XHTML+RDFa using meta and link

Here the single DCAM statement is made up of two RDF triples, and in RDFa both a link and a meta element are used:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/resource/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
    <meta about="[ex:2010_FIFA_World_Cup]"
      property="rdf:value" content="2010 FIFA World Cup" />
  </head>

</html>

2.2.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/resource/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>About: 
      <a rel="dc:subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        <span property="rdf:value">2010 FIFA World Cup</span>
      </a>
    </p>
  </body>
  
</html>

2.3 Value URI with Plain Value String with Language Tag

2.3.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:subject )
      ValueURI ( ex:2010_FIFA_World_Cup )
      ValueString ( "2010 FIFA World Cup" 
        Language ( en )
        )
      )
    )
  )
)

2.3.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<> dc:subject ex:2010_FIFA_World_Cup .
ex:2010_FIFA_World_Cup rdf:value "2010 FIFA World Cup"@en .

2.3.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <link rel="DC.subject"
      href="http://example.org/resource/2010_FIFA_World_Cup"
      xml:lang="en" title="2010 FIFA World Cup" />
  </head>

  </html>

2.3.4 XHTML+RDFa using meta and link

Again, the single DCAM statement is made up of two RDF triples, and in RDFa both a link and a meta element are used:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/resource/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
    <meta about="[ex:2010_FIFA_World_Cup]"
      property="rdf:value" 
      xml:lang="en" content="2010 FIFA World Cup" />
  </head>

</html>

With RDFa, multiple value strings might be provided, using multiple meta elements (which is not supported in DC-HTML):

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/resource/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
    <meta about="[ex:2010_FIFA_World_Cup]"
      property="rdf:value" 
      xml:lang="en" content="2010 FIFA World Cup" />
    <meta about="[ex:2010_FIFA_World_Cup]"
      property="rdf:value" 
      xml:lang="es" content="Copa Mundial de Fútbol de 2010" />
  </head>

</html>

2.3.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>About: 
      <a rel="dc:subject" 
        href="http://example.org/resource/2010_FIFA_World_Cup">
        <span property="rdf:value" 
	  xml:lang="en">2010 FIFA World Cup</span>
      </a>
    </p>
  </body>
  
</html>

2.4 Value URI with Typed Value String

2.4.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:language )
      ValueURI ( ex:English )
      ValueString ( "en"
        SyntaxEncodingSchemeURI ( xsd:language )
      )
    )
  )
)

2.4.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<> dc:language ex:English .
ex:English rdf:value "en"^^xsd:language .

2.4.3 XHTML using DC-HTML:

Not supported by DC-HTML.

2.4.4 XHTML+RDFa using meta and link

Again, the single DCAM statement is made up of two RDF triples, and in RDFa both a link and a meta element are used:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/resource/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:language"
      href="http://example.org/resource/English" />
    <meta about="[ex:English]"
      property="rdf:value" datatype="xsd:language" content="en" />
  </head>

</html>

2.4.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>Language:
      <a rel="dc:language"
        href="http://example.org/resource/English">
        <span property="rdf:value" 
          datatype="xsd:language" content="en">English</span>
      </a>
    </p>
  </body>

</html>

2.5 Value URI with Vocabulary Encoding Scheme URI

2.5.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:subject )
      ValueURI ( ex:2010_FIFA_World_Cup )
      VocabularyEncodingSchemeURI ( ex:MyScheme )
      )
    )
  )
)

2.5.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix dcam: <http://purl.org/dc/dcam/> .
@prefix ex: <http://example.org/resource/> .
<> dc:subject ex:2010_FIFA_World_Cup .
ex:2010_FIFA_World_Cup dcam:memberOf ex:MyScheme .

2.5.3 XHTML using DC-HTML:

Not supported by DC-HTML.

2.5.4 XHTML+RDFa using meta and link

Again, the single DCAM statement is made up of two RDF triples, and in XHTML using RDFa two link elements are used:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:dcam="http://purl.org/dc/dcam/"
      xmlns:ex="http://example.org/resource/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
    <link about="[ex:2010_FIFA_World_Cup]"
      rel="dcam:memberOf"
      href="http://example.org/resource/MyScheme" />
  </head>

</html>

2.5.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:dcam="http://purl.org/dc/dcam/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>About: 
      <a rel="dc:subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
	<span rel="dcam:memberOf"
	  resource="http://example.org/resource/MyScheme" />
	The 2010 World Cup
      </a>
    </p>
  </body>
  
</html>

April 13, 2010

A small GRDDL (XML, really) gotcha

I've written previously here about DCMI's use of an HTML meta data profile for document metadata, and the use of a GRDDL profile transformation to extract RDF triples from an XHTML document. DCMI has had made use of an HTML profile for many years, but providing a "GRDDL-enabled" version is a more recent development - and it is one which I admit I was quietly quite pleased to see put in place, as I felt it illustrated rather neatly how DCMI was trying to implement some of the "follow your nose" principles of Web Architecture.

A little while ago, I noticed that the Web-based tools which I usually use to test GRDDL processing (the W3C GRDDL service and the librdf parser demonstrator) were generating errors when I tried to process documents which reference the profile. I've posted a more detailed account of my investigations to the dc-architecture Jiscmail list, and I won't repeat them all here, but in short it comes down to the use of the entity references (&nbsp; and &copy;) in the profile document, which itself is subject to a GRDDL transformation to extract the pointer to the profile transformation.

The problem arises because XHTML defines those entity references in the XHTML DTD, i.e. externally to the document itself, and a non-validating XML processor is not required to read that DTD when parsing the document, with the consequence that it fails to resolve the references - and there's no guarantee that a GRDDL processor will employ a validating parser. There's a more extended discussion of these issues in a post by Lachlan Hunt from 2005 which concludes:

Character entity references can be used in HTML and in XML; but for XML, other than the 5 predefined entities, need to be defined in a DTD (such as with XHTML and MathML). The 5 predefined entities in XML are: &amp;, &lt;, &gt;, &quot; and &apos;. Of these, you should note that &apos; is not defined in HTML. The use of other entities in XML requires a validating parser, which makes them inherently unsafe for use on the web. It is recommended that you stick with the 5 predefined entity references and numeric character references, or use a Unicode encoding.

And the GRDDL specification itself cautions :

Document authors, particularly XHTML document authors, who wish their documents to be unambiguous when used with GRDDL should avoid dependencies on an external DTD subset; specifically:

  • Explicitly include the XHTML namespace declaration in an XHTML document, or an appropriate namespace in an XML document.
  • Avoid use of entity references, except those listed in section 4.6 of the XML specification.
  • And, more generally, follow the rules listed for the standalone document validity constraint.

A note will be added to the DC-HTML profile document to emphasise this point (and the offending references removed).

I guess I was surprised that no-one else had reported the error, particularly as it potentially affects the processing of all instance documents. The fact that they hadn't does rather lends weight to the suspicion that I voiced here a few weeks ago that it may well be that few implementers are actually making use of the DC-HTML GRDDL profile transformation.

February 15, 2010

VCard in RDF

Via a post to the W3C Semantic Web Interest Group mailing list from Renato Iannella, I noticed that the W3C has published an updated specification for expressing vCard data in RDF, Representing vCard Objects in RDF.

A comment from the W3C Team provides some of the background to the document, and explains that it represents a consolidation of an earlier (2001) formulation of vCard in RDF by Renato published by the W3C and a subsequent ontology created by Norman Walsh, Brian Suda and Harry Halpin, the latter also used, at least in part, by Yahoo SearchMonkey (see Local, Event, Product, Discussion).

The new document also provides an answer to a question which I've been unsure about for years: whether the class of vCards was disjoint with the class of "agents" (e.g. persons or organisations). Or, to put it another way, I wasn't sure whether the properties of a vCard could be used to describe persons or organisations, e.g. "this document was created by the agent with the vCard name 'John Smith'":

@prefix dc: <http://purl.org/dc/terms/> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
 <> dc:creator
  [
    v:fn "John Smith"
  ] .

(This, I think, is the pattern recommended by Yahoo SearchMonkey: see, for example, the extended "product" example where the manufacturer of a product is an instance of both the v:VCard class and the commerce:Business class.)

The alternative would be that those properties had to be used to describe a vCard-as-"document" (or as something other than agent, at least), which was in turn related to a person or organisation, e.g. "this document has created by the agent who "has a vCard" with the vCard name 'John Smith'" (I invented an example "hasVCard" property here):

@prefix dc: <http://purl.org/dc/terms/> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix ex: <http://example.org/terms/> .
 <> dc:creator
  [
    ex:hasVCard
     [
       v:fn "John Smith"
     ]
 ] .

To keep the examples brief I used blank nodes in the above examples, but URIs might have been used to refer to the vCard and agent resources too.

In its description of the VCard class, the new document says: Resources that are vCards and the URIs that denote these vCards can also be the same URIs that denote people/orgs. That phrasing seems a bit awkward, but the intent seems pretty clear: the classes are not disjoint, a single resource can be both a vCard and a person, and the properties of a vCard can be applied to a person. So I can use the pattern in my first example above without creating any sort of contradiction, and the second pattern is also permitted.

One consequence of this is that consumers of data need to allow for both patterns - in the general case, at least, though it may be that they have additional information that within a particular dataset only a single pattern is in use.

In the example above, I used an example.org URI for the property that relates an agent to a vCard. The discussion on the list highlights that there are a couple of contenders for properties to meet this requirement: (ov:businessCard from Michael Hausenblas and hcterms:hasCard from Toby Inkster. A proposal to add such a property to the FOAF vocabulary has been made on the FOAF community wiki.

February 01, 2010

HTML5, document metadata and Dublin Core

I recently received a query about the encoding of Dublin Core metadata in HTML5, the revision of the HTML language being developed jointly by the W3C HTML Working Group and the Web Hypertext Application Technology Working Group (WHATWG). It has also been the topic of some recent discussion on the dc-general mailing list. While I've been aware of some of the discussions around metadata features in HTML5, until now I haven't looked in much detail at the current drafts.

There are various versions of the specification(s), all drafts under development and still changing (at times, quite quickly):

  • The WHATWG has a working draft titled HTML5 (including next generation additions still in development). This document is constantly changing; the content at the time I'm writing is dated 30 January 2010, but will no doubt have changed by the time you read this. Of this spec, the WHATWG says: This draft is a superset of the HTML5 work that is being published at the W3C: everything that is in HTML5 is also in the WHATWG HTML spec. Some new experimental features are being added to the WHATWG HTML draft, to continue developing extensions to the language while the W3C work through their milestones for HTML5. In other words, the WHATWG HTML specification is the next generation of the language, while HTML5 is a more limited subset with a narrower scope.
  • The W3C has a "latest public version" of HTML 5: A vocabulary and associated APIs for HTML and XHTML currently the version dated 25 August 2009. (The content of that "date-stamped" version should continue to be available.)
  • The W3C always has a "latest editor's draft" of that document, which at the time of writing is dated 30 January 2010, but also continues to change at frequent intervals. Note that, compared to the previous "latest public version", this draft incorporates some element of restructuring of the content, with some content separated out into "modules".

I can't emphasise too strongly that HTML5 is still a draft and liable to change; as the spec itself says in no uncertain terms: Implementors should be aware that this specification is not stable. Implementors who are not taking part in the discussions are likely to find the specification changing out from under them in incompatible ways..

For the purposes of this discussion I've looked primarily at the third document above, the W3C latest editor's draft. This post is really an attempt to raise some initial questions (and probably to expose my own confusion) rather than to provide any definitive answers. It is based on my (incomplete and very probably imperfect) reading of the drafts as they stand at this point in time - and it represents a personal view only, not a DCMI view.

1. Dublin Core metadata in HTML4 and XHTML

(This section covers DCMI's current recommendations for embedding metadata in X/HTML, so feel free to skip it if you are already familiar with this.)

To date, DCMI's specifications for embedding metadata in X/HTML documents have concerned themselves with representing metadata "about" the document as a whole, "document metadata", if you like. And in HTML4/XHTML, the principal source of document metadata is the head element (HTML4, 7.4). Within the head element:

  • the meta element (HTML4, 7.4.4.2) provides for the representation of "property name" (the value of the @name attribute)/"property value" (the value of the @content attribute) pairs which apply to the document. It also offers the ability to supplement the value with the name of a "scheme" (the value of the @scheme attribute) which is used "to interpret the property's value".
  • the link element (HTML4, 12.3) provides a means of representing a relationship between the document and another resource. It also - in attributes like @hreflang, @title, - suppports the provision of some metadata "about" that second resource.

(I should note here that the above text uses the terminology of the HTML4 specification, not of RDF or the DCMI Abstract Model (DCAM).)

The current DCMI recommendation for embedding document metadata in X/HTML, Expressing Dublin Core metadata using HTML/XHTML meta and link elements - which from here on I'll just refer to as "DC-HTML". Although the current recommendation is dated 2008, that version is only a minor "modernisation" of conventions that DCMI has recommended since the late 1990s. The specification describes a convention for representing what the DCAM calls a description (of the document) - a set of RDF triples - using the HTML meta and link elements and their attributes (and conversely, for interpreting a sequence of HTML meta and link elements as a set of RDF triples/DCAM description set). Contrary to some misconceptions, the convention is not limited to the use of DCMI-owned "terms"; indeed it does not assume the use of any DCMI-owned terms at all.

Consider the example of the following two RDF triples:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<> dc:modified "2007-07-22"^^xsd:date ;
   ex:commentsOn <http://example.org/docs/123> .

Aside: from the perspective of the DCMI Abstract Model, these would be equivalent to the following description set, expresssed using the DC-Text syntax, but for the rest of this post, to keep things simple, I'll just refer to the RDF triples.

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:modified )
      LiteralValueString ( "2007-07-22"
        SyntaxEncodingSchemeURI ( xsd:date )
      )
    )
    Statement (
      PropertyURI ( ex:commentsOn )
      ValueURI ( <http://example.org/docs/123> )
      )
    )
  )
)

Following the conventions of DC-HTML, those triples are represented in XHTML as:

Example 1: DC-HTML profile in XHTML

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>Document 001</title>
    <link rel="schema.DC"
          href="http://purl.org/dc/terms/" />
    <link rel="schema.EX"
          href="http://example.org/terms/" />
    <link rel="schema.XSD"
          href="http://www.w3.org/2001/XMLSchema#" />
    <meta name="DC.modified"
          scheme="XSD.date" content="2007-07-22" />
    <link rel="EX.commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

A few points to highlight:

  • The example is provided in XHTML but essentially the same syntax would be used in HTML4.
  • The triple with literal object is represented using a meta element.
  • The triple with the URI as object is represented using a link element
  • The predicate (property URI) may be any URI; the DC-HTML convention is not limited to DCMI-owned URIs, i.e. DC-HTML seeks to support the sort of URI-based vocabulary extensibility provided by RDF. There is no "registry" of a bounded set of terms to be used in metadata represented using DC-HTML; or, rather, "the Web is the registry". All an implementer requires to introduce a new property is the authority to assign a URI in some URI space they own (or in which they have been delegated rights to assign URIs).
  • A convention for representing property URIs and datatype URIs as "prefixed names" is used, and in this example three other link elements (with @rel="schema.{prefix})" are introduced to act as "namespace declarations" to support the convention. When a document using DC-HTML is processed, no RDF triples are generated for those link elements (Aside: I have occasionally wondered whether this is abusing the rel attribute, which is intended to capture the type of relationship between the document and the target resource, i.e. it is using a mechanism which does carry semantics for an essentially syntactic end (the abbreviation of URIs). But I'll suspend those misgivings for now...).
  • The prefixes used in these "prefixed names" are arbitrary, and DC-HTML does not specify the use/interpretation of a fixed set of @name or @rel attribute values. In the example above, I chose to associate the "DC" prefix with the "namespace URI" http://purl.org/dc/terms/, though "traditionally" it has been more commonly associated with the "namespace URI" http://purl.org/dc/elements/1.1/. Another document creator might associate the same prefix with a quite different URI again.
  • The DC-HTML profile generates triples only for those meta and link elements where the values of the @name and @rel attributes contain a prefixed name with a prefix for which there is a corresponding "namespace declaration".
  • The datatype of the typed literal is represented by the value of the meta/@scheme attribute.
  • There is no support for RDF blank nodes.

For the purposes of this discussion, perhaps the main point to make is that this use/interpretation of meta and link elements is specific to DC-HTML, not a general interpretation defined by the HTML4 specification. The mapping of prefixed names to URIs using link[@rel="schema.ABC"] "namespace declarations" is a DCMI convention, not part of X/HTML. And this is made possible through the use of a feature of HTML4 and XHTML called a "meta data profile": the document creator signals (by providing a specific URI as value of the head/@profile attribute) that they are applying the DC-HTML conventions and the presence of that attribute value licences a consumer to apply that interpretation of the data in a document. And, further, under that profile, as I noted for the example of the "DC" prefix, there is no single "meaning" assigned to meta/@name or link/@rel values.

In the XHTML case, the profile-dependent interpretation is made accessible in machine-processable form through the use of GRDDL, more specifically of a GRDDL profile transformation. i.e. a GRDDL processor uses the profile URI to access an XHTML "profile document" which provides a pointer to an XSLT transform which, when applied to an XHTML document using the DC-HTML profile, generates an RDF/XML document representing the appropriate RDF triples.

It may also be worth noting at this point that the profile attribute actually supports not just a single URI as value but a space-separated list of URIs i.e. within a single document, multiple profiles may be "applicable". And, potentially, those multiple profiles may specify different interpretations of a single @name or @rel value. I think the intent is that in that case all the interpretations should be applied - and in the case that multiple GRDDL profile transformations are provided, the output should be the result of merging the RDF graphs output from each individual transformation.

Now then, having laboured the point about the importance of the concept of the profile, I strongly suspect - though I don't have concrete evidence to support my suspicion - that it is not widely used by applications that provide and consume data using the other conventions described in the DC-HTML document.

It is certainly easy to find many providers of document metadata in X/HTML that follow some of the syntactic conventions of DC-HTML but do not include the @profile attribute. This is (currently, at least) the case even for many documents on DCMI's own site. And I suspect only a (small?) minority of applications consuming/processing DC metadata embedded in X/HTML documents do so by applying the DC-HTML GRDDL profile transform in this way. I suspect the majority of DC metadata embedded in X/HTML documents is processed without reference to the GRDDL transform, probably without using the @profile attribute value as a "trigger", possibly without generating RDF triples, and perhaps even without applying the "prefixed name"-to-URI mapping at all - i.e. these applications are "on level 1" in terms of the DC "interoperability levels" document. I suspect there are tools which use meta elements to generate simple property/(literal) value indexes, and do so on the basis of a fixed set of meta/@name values, i.e. they index on the basis that the expected values of the meta/@name attribute are "DC.title", "DC.date" (etc) and those tools would ignore values like "ABC.title", even if the "ABC" prefix was associated (via a link[@rel="schema.ABC"] "namespace declaration") with the URI http://purl.org/dc/elements/1.1/ (or http://purl.org/dc/terms/). But yes, I'm entering the realms of speculation here, and we really need some concrete evidence of how applications process such data.

2. RDFa in XHTML and HTML4

Since that DCMI document was published, the W3C has published the RDFa in XHTML specification, RDFa in XHTML: Syntax and Processing. as a W3C Recommendation. RDFa provides a syntax for embedding RDF triples in an XHTML document using attributes (a combination of pre-existing XHTML attributes and additional RDFa-specific attributes.) Unlike the conventions defined by DC-HTML, RDFa supports the representation of any RDF triple, not only triples "about" the document (i.e. with the document URI as subject), and RDFa attributes can be used anywhere in an XHTML document.

Any "document metadata" that could be encoded using the DC-HTML profile could also be represented using RDFa. DCMI has not yet published any guidance on the use of RDFa - not because it doesn't consider RDFa important, I hasten to add, but only because of a lack of available effort. Having said that, (it seems to me) it isn't an area where DCMI would need a new "recommendation", but it may be useful to have some primer-/tutorial-style materials and examples highlighting the use of common constructs used in Dublin Core metadata.

I don't intend to provide a full summary of RDFa, but it is worth noting that, at the syntax level, RDFa introduces the use of a datatype called CURIE which supports the abbreviation of URI references as prefixed names. In XHTML, at least, the prefixes are associated with URIs via XML Namespace declarations. The use of CURIEs in RDFa might be seen as a more generalised, standardised approach to the problem that DC-HTML seeks to address through its own "prefixed name"/"namespace declaration" convention.

It is perhaps worth highlighting one aspect of the RDFa in XHTML processing model here. In RDFa the XHTML link/@rel attribute is treated as supporting both XHTML link types and CURIEs, and any value that matches an entry in the list of link types in the section The rel attribute, MUST be treated as if it was a URI within the XHTML vocabulary, and all other values must be CURIEs. So, the XHTML link types are treated as "reserved keywords", if you like, and a @rel attribute value of "next" is mapped to an RDF predicate of http://www.w3.org/1999/xhtml/vocab#next. For the case of XHTML, those "reserved keywords" are defined as part of the XHTML+RDFa document. They are also listed in the "namespace document" http://www.w3.org/1999/xhtml/vocab, which itself is an XHTML+RDFa document (though, N.B., there are other terms "in that namespace" which are not intended for use as link/@rel values). For a @rel value that is neither a member of the list nor a valid CURIE (e.g. rel="foobar" or rel="DC.modified" or rel="schema.DC"), no RDF triple is generated by an RDFa processor. As a consequence, RDFa "co-exists" well with the DC-HTML profile, by which I mean that an RDFa processor should generate no unanticipated triples from DC-HTML data in an XHTML+RDFa document.

Using RDFa in XHTML, then, the two example triples above could be represented as follows:

Example 2: RDFa in XHTML

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" />
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

And of course document metadata could be embedded elsewhere in the XHTML+RDFa document, e.g. instead of using the meta and link elements, the data above could be represented in the body of the document:

Example 3: RDFa in XHTML (2)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
  </head>
  <body>
    <p>
      Last modified on:
      <span property="dc:modified"
            datatype="xsd:date" content="2007-07-22">22 July 2007</span>
    </p>
    <p>
      Comments on:
      <a rel="ex:commentsOn"
          href="http://example.org/docs/123">Document 123</a>
    </p>
  </body>
</html>

These examples do not make use of a head/@profile attribute. According to Appendix C of the RDFa in XHTML specification, the use of @profile is optional: a @profile value of http://www.w3.org/1999/xhtml/vocab may be included to support a GRDDL-based transform, but it is not required by an RDFa processor. (Having said that, looking at the profile document http://www.w3.org/1999/xhtml/vocab, I can't see a reference to a GRDDL profile transformation in that document.)

The initial RDFa in XHTML specification covered the case of XHTML only. But RDFa is intended as an approach to be used with other markup languages too, and recently a working draft HTML+RDFa has been published. Again, this is a draft which is liable to change. This document describes how RDFa can be used in HTML5 (in both the XML and non-XML syntax), but the rules are also intended to be applicable to HTML4 documents interpreted through the HTML5 parsing rules. For the most part, it provides a set of minor changes to the syntax and processing rules specified in the RDFa in XHTML document.

I think (but I'm not sure!) the above example in HTML4 would look like the following, the only differences (for this example) being the change in the empty element syntax and the use of a different DTD for validation:

Example 4: RDFa in HTML4

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/html401-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="HTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" >
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" >
  </head>
</html>

3. HTML5

The document HTML5 differences from HTML4 offers a summary of the principal differences between HTML4 and HTML5. One general point to note here is that HTML5 is defined as an "abstract language" - it is defined in terms of the HTML Document Object Model - which can be serialised in a format which is compatible with HTML4 and also in an XML format. The "differences" document has little to say on issues specifically related to "document metadata", but it does highlight the removal from the language of some elements and attributes, a topic I'll return to below.

As I mentioned above, the current editor's draft version of HTML5 separates some content out into modules. In the current drafts, three items would seem to be of interest when considering conventions for representing metadata "about" a document:

I'll discuss each of these sources in turn (though I think there is some interdependency in the first two).

3.1. Document Metadata in HTML5

The "Document metadata" section defines the meta and link elements in HTML5. In terms of evaluating how the DC-HTML conventions might be used within HTML5, the following points seem significant:

  • For the @name attribute of the meta element, the spec defines some values, and it provides for a wiki-based registry of other values (HTML5ED, 4.2.5.2).
  • The @scheme attribute of the meta element has been made obsolete and "must not be used by authors".
  • In the property/value pairs represented by meta elements, the value must not be a URI.
  • For the @rel attribute of the link element, the spec defines some values - strictly speaking, tokens that can occur , and it provides for a wiki-based registry of other values (HTML5ED, 5.12.3.19).
  • The @profile attribute of the head element has been made obsolete and "must not be used by authors"

On the validation of values for the meta/@name attribute, HTML5 says:

Conformance checkers must use the information given on the WHATWG Wiki MetaExtensions page to establish if a value is allowed or not: values defined in this specification or marked as "proposed" or "ratified" must be accepted, whereas values marked as "discontinued" or not listed in either this specification or on the aforementioned page must be rejected as invalid. Conformance checkers may cache this information (e.g. for performance reasons or to avoid the use of unreliable network connectivity).

When an author uses a new metadata name not defined by either this specification or the Wiki page, conformance checkers should offer to add the value to the Wiki, with the details described above, with the "proposed" status.

So I think this means that, in order to pass this conformance check as valid, all values of the meta/@name attribute must be registered. The registry currently contains an entry (with status "proposed") for all names beginning "DC.", though I'm not sure whether the registration process is really intended to support such "wildcard" entries. The entry does not indicate whether the intent is that the names correspond to properties of the Dublin Core Metadata Element Set (i.e. with URIs beginning http://purl.org/dc/elements/1.1/) or of the DC Terms collection (i.e. with URIs beginning http://purl.org/dc/terms/). Further, as noted above, the current DC-HTML spec does not prescribe a bounded set of @name values; rather it allows for an open-ended set of prefixed name values, not just names referring to the "terms" owned by DCMI. In HTML5, the expectation seems to be that all such values should be registered. So, for example, when DCMI worked with the Library of Congress to make available a set of RDF properties corresponding to the MARC Relator Codes, identified by LoC-owned URIs, I think the expectation would be that, for data using those properties to be encoded, a corresponding set of @name values would need to be registered. Similarly if an implementer group coins a new URI for a property they require, a new @name value would be required.

If the registration process for HTML5 turns out to be relatively "permissive" (which the text above suggests it may be), it may be that this is not an issue, but it does seem to create a new requirement not present in HTML4/XHTML. However, I note that the registration page currently includes a note that suggests a "high bar" for terms to be "Accepted": For the "Status" section to be changed to "Accepted", the proposed keyword must be defined by a W3C specification in the Candidate Recommendation or Recommendation state. If it fails to go through this process, it is "Unendorsed".

Having said that, the microdata specification refers to the possibility that @name values are URIs, and I think that the implication is that such URI values are exempt from the registration requirement (though this does not seem clear from the discussion of registration in the "core" HTML5 spec).

The meta/@scheme attribute, used in DC-HTML to represent datatype URIs for typed literals, is no longer permitted in HTML5. Section 10.2, which offers suggestions for alternatives for some of the features that have been made obsolete, suggests Use only one scheme per field, or make the scheme declaration part of the value., which I think is suggesting either using a different meta/@name value for each potential scheme value (e.g. "date-W3CDTF", "date-someOtherDateFormat") or using some sort of structured string for the @content value with the scheme name embedded (e.g. "2007-07-22|http://purl.org/dc/terms/W3CDTF")

The section on the registration of meta/@name attribute values includes the paragraph:

Metadata names whose values are to be URLs must not be proposed or accepted. Links must be represented using the link element, not the meta element

This constraint appears to rule out the use of meta/@name to represent the property in cases where (in RDF terms) the object is a literal URI. (This is different from the case where the object is an RDF URI reference, which in DC-HTML is covered by the use of the link element.) For example, the DCMI dc:identifier and dcterms:identifier properties may be used in this way to provide a URI which identifies the document - that may be the document URI itself, or it may be another URI which identifies the same document.

A similar issue to that above for the registration of meta/@name attribute values arises for the case of the link/@rel attribute, for which HTML5 says:

Conformance checkers must use the information given on the WHATWG Wiki RelExtensions page to establish if a value is allowed or not: values defined in this specification or marked as "proposed" or "ratified" must be accepted when used on the elements for which they apply as described in the "Effect on..." field, whereas values marked as "discontinued" or not listed in either this specification or on the aforementioned page must be rejected as invalid. Conformance checkers may cache this information (e.g. for performance reasons or to avoid the use of unreliable network connectivity).

When an author uses a new type not defined by either this specification or the Wiki page, conformance checkers should offer to add the value to the Wiki, with the details described above, with the "proposed" status.

AFAICT, the registry currently contains no entries related specifically to DC-HTML or the DCMI vocabularies.

As for the case of name, the microdata specification refers to the possibility that @rel values are URIs, and again I think the implication is that such URI values are exempt from the registration requirement (though, again, this does not seem clear from the discussion in the "core" HTML5 spec).

Finally, the head/@profile attribute is no longer available in HTML5. and Section 10.2 says:

When used for declaring which meta terms are used in the document, unnecessary; omit it altogether, and register the names.

When used for triggering specific user agent behaviors: use a link element instead.

I think DC-HTML's use of head/@profile places it into the second of these categories: the profile doesn't "declare" a bounded set of terms, but it specifies how a (potentially "open-ended") set of attribute values are to be interpreted/processed.

Furthermore, the draft HTML+RDFa document proposes the (optional) use of a link/@rel value of "profile", and there is a corresponding entry in the registry for @rel values. This seems to be a mechanism for (re-)introducing the HTML4 concept of the meta data profile, using a different syntactic form i.e. using link/@rel in place of the profile attribute. I'm not clear about the extent to which this has support within the wider HTML5 community. If it was adopted, I imagine the GRDDL specification would also evolve to use this mechanism, but that is guesswork on my part.

Julian Retschke summarises most of these issues related to DC-HTML in a recent message to the public-html mailing list here.

3.2. Microdata

Microdata is a new specification, specific to HTML5. The "latest editors draft" version is described as "a module that forms part of the HTML5 series of specifications published at the W3C". The content was previously a part of the "core" HTML5 specification, but the decision was taken recently to separate it from the main spec.

Microdata offers similar functionality to that offered by RDFa in that it allows for the embedding of data anywhere in an HTML5 document. Like RDFa, microdata is a generalised mechanism, not one tied to any particular set of terms, and also like RDFa, microdata introduces a new set of attributes, to be used in combination with existing HTML5 attributes. The syntactic conventions used in microdata are inspired principally by the conventions used in various microformats.

As for the case of RDFa, my purpose here is not to provide a full description of microdata, but to examine whether and how microdata can express the data that in HTML4/XHTML is expressed using the conventions of the DC-HTML profile.

The model underlying microdata is one of nested lists of name-value pairs:

The microdata model consists of groups of name-value pairs known as items.

Each group is known as an item. Each item can have an item type, a global identifier (if the item type supports global identifiers for its items), and a list of name-value pairs. Each name in the name-value pair is known as a property, and each property has one or more values. Each value is either a string or itself a group of name-value pairs (an item).

The microdata model is independent of the RDF model, and is not designed to represent the full RDF model. In particular, microdata does not require the use of URIs as identifiers for properties, though it does allow for the use of URIs. Microdata does not offer - as many RDF syntaxes, including RDFa, do - a mechanism for abbreviating property URIs. But the microdata spec does include an algorithm that provides a (partial, I think?) mapping from microdata to a set of RDF triples.

Probably the main feature of RDF that has no correspondence in microdata is literal datatyping - see the discussion by Jeni Tennison here - though there is a distinct element/attribute for date-time values.

Given this constraint, I don't think it is possible to express the first triple of my example above using microdata. If the typed literal is replaced with a plain literal (i.e. the object is "2007-07-22", rather than "2007-07-22"^^xsd:date), then I think the two triples could be encoded (using the XML serialisation) as follows, i.e. the property URIs appear in full as attribute values:

Example 5: Microdata in HTML5

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
    <meta name="http://purl.org/dc/terms/modified"
          content="2007-07-22" />
    <link rel="http://example.org/terms/commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

As for the case of RDFa, microdata supports the embedding of data in the body of the document, so the triples could (I think!) also be represented as:

Example 6: Microdata in HTML5 (2)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
  </head>
  <body>
    <div itemscope="" itemid="http://example.org/doc/001">
      <p>
        Last modified on:
        <time itemprop="http://purl.org/dc/terms/modified"
              datetime="2007-07-22">22 July 2007</time>
      </p>
      <p>
        Comments on:
        <a rel="http://example.org/terms/commentsOn"
           href="http://example.org/docs/123">Document 123</a>
      </p>
  </body>
</html>

My understanding is that the itemid attribute is necessary to set the subject of the triple to the URI of the document, but I could be wrong on this point.

Also I think it's worth noting that the microdata-to-RDF algorithm specifies an RDF interpretation for some "core" HTML5 elements and attributes. For example:

  • the head/title element is mapped to a triple with the predicate http://purl.org/dc/terms/title and the element content as literal object
  • meta elements with name and content attributes are mapped to triples where the predicate is either (if the name attribute value is a URI) the value of the name attribute, or the concatenation of the string "http://www.w3.org/1999/xhtml/vocab#" and the value of the name attribute. So, e.g., a name attribute value of "DC.modified" would generate a predicate http://www.w3.org/1999/xhtml/vocab#DC.modified.
  • A similar rule applies for the link element. So, e.g., a rel attribute value of "EX.commentsOn" would generate a predicate http://www.w3.org/1999/xhtml/vocab#EX.commentsOn and a rel attribute value of "schema.DC" would generate a predicate http://www.w3.org/1999/xhtml/vocab#schema.DC

As far as I can tell, these are rules to be applied to any HTML5 document - there is no "flag" to say that they apply to document A but not to document B - so would need to be taken into consideration in any DCMI convention for using meta/@name and link/@rel attributes in HTML5. For example, given the following HTML5 document (and leaving aside for a moment the registration issue, and assuming that "EX.commentsOn", "schema.DC" and "schema.EX" are registered values for @name and @rel)

Example 7: Microdata in HTML5 (3)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
    <link rel="schema.DC"
          href="http://purl.org/dc/terms/" />
    <link rel="schema.EX"
          href="http://example.org/terms/" />
    <meta name="DC.modified"
          content="2007-07-22" />
    <link rel="EX.commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

the microdata-to-RDF algorithm would generate the following five RDF triples:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix xhv: http://www.w3.org/1999/xhtml/vocab#> .
<> dc:title "Document 001" ;
   xhv:schema.DC <http://purl.org/dc/terms/> ;
   xhv:schema.EX <http://example.org/terms/> ;
   xhv:DC.modified "2007-07-22" ;
   xhv:EX.commentsOn <http://example.org/docs/123> .

It's probably worth emphasising that although the URIs generated here are not-DCMI-owned URIs, it would be quite possible to assert an "equivalence" between a property with a URI beginning http://www.w3.org/1999/xhtml/vocab# and a corresponding DCMI-owned URI, which would imply a second triple using that DCMI-owned URI (e.g. <http://www.w3.org/1999/xhtml/vocab#DC.modified> owl:equivalentProperty <http://purl.org/dc/terms/modified>) - though, AFAICT, no such equivalence is suggested at the moment.

3.3. RDFa in HTML5

I noted above that, although the initial RDFa syntax specification had focused on the case of XHTML, a recent draft sought to extend this by describing the use of RDFa in HTML, including the case of HTML5.

As I already discussed, using RDFa, it is quite possible to represent any data that could be represented in HTML4/XHTML using the DC-HTML profile. So, using RDFa in HTML5, my two example triples could be represented (using the XML serialisation) as:

Example 8: RDFa in HTML5

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" />
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

Note that the RDFa in HTML draft requires the use of the html/@version attribute which, in the current draft, is obsoleted by the "core" HTML5 specification. As noted above, RDFa in HTML also proposes the (optional) use of a link/@rel value of "profile".

In the initial discussion of RDFa above, I noted the existence of a list of "reserved keyword" values for the link/@rel attribute in XHTML, to which an RDFa processor prefixes the URI to generate RDF predicate URIs. In HTML5, that list of reserved values is defined not by the HTML5 specification but by the WHATWG registry of @rel values. So there may be cases where a value used in an link/@rel attribute in an HTML4/XHTML document does not result in an RDFa processor generating a triple (because that value is not included in the HTML4/XHTML reserved list), but the same value in an link/@rel attribute in an HTML5 document does cause an RDFa processor to generate a triple (because that value is included in the HTML5 @rel registry). My understanding is that the RDFa/CURIE processing model is designed to cope with such host-language-specific variations, but it is something document creators will need to be aware of.

4. Some concluding thoughts

DCMI's specifications for embedding metadata in X/HTML have focused on "document metadata", data describing the document. The current DCMI Recommendation for encoding DC metadata in HTML was created in 2008, and is based on the DCMI Abstract Model and on RDF. The syntactic conventions are largely those developed by DCMI in the late 1990s. The current document was developed with reference to HTML4 and XHTML only, and it does not take into consideration the changes introduced by HTML5. The conventions described are not limited to the use of a fixed set of DCMI-owned properties, but support the representation of document metadata using any RDF properties.

Looking at the HTML5 drafts raises various issues:

  1. HTML5 removes some of the syntactic components used by the DC-HTML profile in HTML4/XHTML, namely the scheme and profile attributes.
  2. HTML5 introduces a requirement for the registration of meta/@name and link/@rel attribute values; the current DC-HTML specification makes the assumption that an "open-ended" set of values is available for those attributes.
  3. The status of the concept of the meta data profile in HTML5 seems unclear. On the one hand, the profile attribute has been removed, but the proposed registration of a link/@rel value suggests that the profile approach is still available in HTML5.
  4. The microdata specification provides "global" RDF interpretations for meta and link elements in HTML5.
  5. Microdata offers a mechanism for embedding data in HTML5 documents, and that mechanism can be used for embedding some RDF data in HTML5 documents. Microdata has some limitations (the absence of support for typed literals), but microdata could be used to express a large subset of the data that (in HTML4/XHTML) is expressed using the DC-HTML profile. The microdata specification is still a working draft and liable to change.
  6. RDFa also offers a mechanism for embedding RDF data in HTML5 documents. RDFa is designed to support the RDF model, and RDFa could be used in HTL5 to express the same data that (in HTML4/XHTML) is expressed using the DC-HTML profile. The specification for using RDFa in HTML5 is still a working draft and liable to change.

There seems to be a possible tension between HTML5's requirement for the registration of meta/@name and link/@rel values and the assumption in DC-HTML that an "open-ended" set of values is available. Also, the microdata specification's mapping by simple concatenation of registered meta/@name and link/@rel values to URIs differs from DC-HTML's use of a prefix-to-URI mapping. However, as I suggested above, it seems quite probable that at least some applications using Dublin Core metadata in HTML do indeed operate on the basis of a small set of invariant meta/@name and link/@rel values corresponding to (a subset of the) DCMI-owned properties, i.e. they use a subset of the conventions of DC-HTML to represent document metadata using only a limited set of DCMI-owned properties - to represent data conforming to what DCMI would call a single "description set profile". With the addition of assertions of equivalence between properties (see above), then it would be possible to represent data conforming to the version of "Simple Dublin Core" that I described a little while ago - i.e. using only the properties of the Dublin Core Metadata Element Set with plain literal values - using HTML5 meta/@name (with a registered set of 15 values) and meta/@content attributes.

Both microdata and RDFa, on the other hand, are extensions to HTML5 that are designed to provide generalised, "vocabulary-agnostic" conventions for embedding data in HTML5 documents. Using both microdata and RDFa, data may include, but is not limited to, "document metadata". RDFa is designed to represent RDF data; and microdata can be used to represent some RDF data (it lacks support for typed literals). RDFa includes abbreviation mechanisms for URIs that are broadly similar to those used in the DC-HTML profile (in the sense that they both use a sort of "prefixed name" to abbreviate URIs); microdata does not provide such a mechanism and (I think) would require the use of URIs in full.

Both microdata and RDFa address the problem that DCMI seeks to address via the DC-HTML profile, in the context of a more generalised mechanism for embedding data in HTML5 documents. Both microdata and RDFa could be used in HTML5 to represent document metadata that in HTML4/XHTML is represented using the DC-HTML profile (partially, for the case of microdata because of the absence of datatyping support). Currently, the documents describing RDFa in HTML5 and microdata are both still in draft and the subjects of vigorous debate, and it remains to be seen how/whether they progress through W3C process, and how implementers respond. But it seems to me the use of either would offer an opportunity for DCMI to move away from the maintenance of a DCMI-defined convention and to align with more generalised approaches.

January 31, 2010

Readability and linkability

In July last year I noted that the terminology around Linked Data was not necessarily as clear as we might wish it to be.  Via Twitter yesterday, I was reminded that my colleague, Mike Ellis, has a very nice presentation, Don't think websites, think data, in which he introduces the term MRD - Machine Readable Data.

It's worth a quick look if you have time:

We also used the 'machine-readable' phrase in the original DNER Technical Architecture, the work that went on to underpin the JISC Information Environment, though I think we went on to use both 'machine-understandable' and 'machine-processable' in later work (both even more of a mouthful), usually with reference to what we loosely called 'metadata'.  We also used 'm2m - machine to machine' a lot, a phrase introduced by Lorcan Dempsey I think.  Remember that this was back in 2001, well before the time when the idea of offering an open API had become as widespread as it is today.

All these terms suffer, it seems to me, from emphasising the 'readability' and 'processability' of data over its 'linkedness'. Linkedness is what makes the Web what it is. With hindsight, the major thing that our work on the JISC Information Environment got wrong was to play down the importance of the Web, in favour of a set of digital library standards that focused on sharing 'machine-readable' content for re-use by other bits of software.

Looking at things from the perspective of today, the terms 'Linked Data' and 'Web of Data' both play up the value in content being inter-linked as well as it being what we might call machine-readable.

For example, if we think about open access scholarly communication, the JISC Information Environment (in line with digital libraries more generally) promotes the sharing of content largely through the harvesting of simple DC metadata records, each of which typically contains a link to a PDF copy of the research paper, which, in turn, carries only human-readable citations to other papers.  The DC part of this is certainly MRD... but, overall, the result isn't very inter-linked or Web-like. How much better would it have been to focus some effort on getting more Web links between papers embedded into the papers themselves - using what we would now loosely call a 'micro format'?  One of the reasons I like some of the initiatives around the DOI (though I don't like the DOI much as a technology), CrossRef springs to mind, is that they potentially enable a world where we have the chance of real, solid, persistent Web links between scholarly papers.

RDF, of course, offers the possibility of machine-readability, machine-processable semantics, and links to other content - which is why it is so important and powerful and why initiatives like data.gov.uk need to go beyond the CSV and XML files of this world (which some people argue are good enough) and get stuff converted into RDF form.

As an aside, DCMI have done some interesting work on Interoperability Levels for Dublin Core Metadata. While this work is somewhat specific to DC metadata I think it has some ideas that could be usefully translated into the more general language of the Semantic Web and Linked Data (and probably to the notions of the Web of Data and MRD).

Mike, I think, would probably argue that this is all the musing of a 'purist' and that purists should be ignored - and he might well be right.  I certainly agree with the main thrust of the presentation that we need to 'set our data free', that any form of MRD is better than no MRD at all, and that any API is better than no API.  But we also need to remember that it is fundamentally the hyperlink that has made the Web what it is and that those forms of MRD that will be of most value to us will be those, like RDF, that strongly promote the linkability of content, not just to other content but to concepts and people and places and everything else.

The labels 'Linked Data' and 'Web of Data' are both helpful in reminding us of that.

November 23, 2009

Memento and negotiating on time

Via Twitter, initially in a post by Lorcan Dempsey, I came across the work of Herbert Van de Sompel and his comrades from LANL and Old Dominion University on the Memento project:

The project has since been the topic of an article in New Scientist.

The technical details of the Memento approach are probably best summarised in the paper "Memento: Time Travel for the Web", and Herbert has recently made available a presentation which I'll embed here, since it includes some helpful graphics illustrating some of the messaging in detail:

Memento seeks to take advantage of the Web Architecture concept that interactions on the Web are concerned with exchanging representations of resources. And for any single resource, representations may vary - at a single point in time, variant representations may be provided, e.g. in different formats or languages, and over time, variant representations may be provided reflecting changes in the state of the resource. The HTTP protocol incorporates a feature called content negotiation which can be used to determine the most appropriate representation of a resource - typically according to variables such as content type, language, character set or encoding. The innovation that Memento brings to this scenario is the proposition that content negotiation may also be applied to the axis of date-time. i.e. in the same way that a client might express a preference for the language of the representation based on a standard request header, it could also express a preference that the representation should reflect resource state at a specified point in time, using a custom accept header (X-Accept-Datetime).

More specifically, Memento uses a flavour of content negotiation called "transparent content negotiation" where the server provides details of the variant representations available, from which the client can choose. Slides 26-50 in Herbert's presentation above illustrate how this technique might be applied to two different cases: one in which the server to which the initial request is sent is itself capable of providing the set of time-variant representations, and a second in which that server does not have those "archive" capabilities but redirects to (a URI supported by) a second server which does.

This does seem quite an ingenious approach to the problem, and one that potentially has many interesting applications, several of which Herbert alludes to in his presentation.

What I want to focus on here is the technical approach, which did raise a question in my mind. And here I must emphasise that I'm really just trying to articulate a question that I've been trying to formulate and answer for myself: I'm not in a position to say that Memento is getting anything "wrong", just trying to compare the Memento proposition with my understanding of Web architecture and the HTTP protocol, or at least the use of that protocol in accordance with the REST architectural style, and understand whether there are any divergences (and if there are, what the implications are).

In his dissertation in which he defines the REST architectural style, Roy Fielding defines a resource as follows:

More precisely, a resource R is a temporally varying membership function MR(t), which for time t maps to a set of entities, or values, which are equivalent. The values in the set may be resource representations and/or resource identifiers. A resource can map to the empty set, which allows references to be made to a concept before any realization of that concept exists -- a notion that was foreign to most hypertext systems prior to the Web. Some resources are static in the sense that, when examined at any time after their creation, they always correspond to the same value set. Others have a high degree of variance in their value over time. The only thing that is required to be static for a resource is the semantics of the mapping, since the semantics is what distinguishes one resource from another.

On representations, Fielding says the following, which I think is worth quoting in full. The emphasis in the first and last sentences is mine.

REST components perform actions on a resource by using a representation to capture the current or intended state of that resource and transferring that representation between components. A representation is a sequence of bytes, plus representation metadata to describe those bytes. Other commonly used but less precise names for a representation include: document, file, and HTTP message entity, instance, or variant.

A representation consists of data, metadata describing the data, and, on occasion, metadata to describe the metadata (usually for the purpose of verifying message integrity). Metadata is in the form of name-value pairs, where the name corresponds to a standard that defines the value's structure and semantics. Response messages may include both representation metadata and resource metadata: information about the resource that is not specific to the supplied representation.

Control data defines the purpose of a message between components, such as the action being requested or the meaning of a response. It is also used to parameterize requests and override the default behavior of some connecting elements. For example, cache behavior can be modified by control data included in the request or response message.

Depending on the message control data, a given representation may indicate the current state of the requested resource, the desired state for the requested resource, or the value of some other resource, such as a representation of the input data within a client's query form, or a representation of some error condition for a response. For example, remote authoring of a resource requires that the author send a representation to the server, thus establishing a value for that resource that can be retrieved by later requests. If the value set of a resource at a given time consists of multiple representations, content negotiation may be used to select the best representation for inclusion in a given message.

So at a point in time t1, the "temporally varying membership function" maps to one set of values, and - in the case of a resource whose representations vary over time - at another point in time t2, it may map to another, different set of values. To take a concrete example, suppose at the start of 2009, I launch a "quote of the day", and I define a single resource that is my "quote of the day", to which I assign the URI http://example.org/qotd/. And I provide variant representations in XHTML and plain text. On 1 January 2009 (time t1), my quote is "From each according to his abilities, to each according to his needs", and I provide variant representations in those two formats, i.e. the set of values for 1 January 2009 is those two documents. On 2 January 2009 (time t2), my quote is "Those who do not move, do not notice their chains", and again I provide variant representations in those two formats, i.e. the set of values for 2 January 2009 (time t2) is two XHTML and plain text documents with different content from those provided at time t1.

So, moving on to that second piece of text I cited, my interpretation of the final sentence as it applies to HTTP (and, as I say, I could be wrong about this) would be that the RESTful use of the HTTP GET method is intended to retrieve a representation of the current state of the resource. It is the value set at that point in time which provides the basis for negotiation. So, in my example here, on 1 January 2009, I offer XHTML and plain text versions of my "From each according to his abilities..." quote via content negotiation, and on 2 January 2009, I offer XHTML and plain text versions of my "Those who do not move..." quotations. i.e. At two different points in time t1 and t2, different (sets of) representations may be provided for a single resource, reflecting the different state of that resource at those two different points in time, but at either of those points in time, the expectation is that each representation of the set available represents the state of the resource at that point in time, and only members of that set are available via content negotiation. So although representations may vary by language, content-type etc, they should be in some sense "equivalent" (Roy Fielding's term) in terms of their representation of the current state of the resource.

I think the Memento approach suggests that on 2 January 2009, I could, using the date-time-based negotiation convention, offer all four of those variants listed above (and on each day into the future, a set which increases in membership as I add new quotes). But it seems to me that is at odds with the REST style, because the Memento approach requires that representations of different states of the resource (i.e. the state of the resource at different points in time) are all made available as representations at a single point in time.

I appreciate that (even if my interpretation is correct, which it may not be) the constraints specified by the REST architectural style are just that: a set of constraints which, if observed, generate certain properties/characteristics in a system. And if some of those constraints are relaxed or ignored, then those properties change. My understanding is not good enough to pinpoint exactly what the implications of this particular point of divergence (if indeed it is one!) would be - though as Herbert notes in hs presentation, it would appear that there would be implications for cacheing.

But as I said, I'm really just trying to raise the questions which have been running around my head and which I haven't really been able to answer to my own satisfaction.

As an aside, I think Memento could probably achieve quite similar results by providing some metadata (or a link to another document providing that metadata) which expressed the relationships between the time-variant resource and all the time-specific variant resources, rather than seeking to manage this via HTTP content negotiation.

Postscript: I notice that, in the time it has taken me to draft this post, Mark Baker has made what I think is a similar point in a couple of messages (first, second) to the W3C public-lod mailing list.

November 20, 2009

COI guidance on use of RDFa

Via a post from Mark Birbeck, I notice that the UK Central Office for Information has published some guidelines called Structuring information on the Web for re-usability which include some guidance on the use of RDFa to provide specific types of information, about government consultations and about job vacancies.

This is exciting news as, as far as I know, this is the first formal document from UK central government to provide this sort of quite detailed, resource-type-specific guidance with recommendations on the use of particular RDF vocabularies - guidance of the sort I think will be an essential component in the effective deployment of RDFa and the Linked Data approach. It's also the sort of thing that is of considerable interest to Eduserv, as a developer of Web sites for several government agencies. The document builds directly on the work Mark has been doing in this area, which I mentioned a while ago.

As Mark notes in his post, the document is unequivocal in its expression of the government's commitment to the Linked Data approach:

Government is committed to making its public information and data as widely available as possible. The best way to make structured information available online is to publish it as Linked Data. Linked Data makes the information easier to cut and combine in ways that are relevant to citizens.

Before the announcement of these guidelines, I recently had a look at the "argot" for consultations - "argot" is Mark's term for a specification of how a set of terms from multiple RDF vocabularies is used to meet some application requirement; as I noted in that earlier post, I think it is similar to what DCMI calls an "application profile" - , and I had intended to submit some comments. I fear it is now somewhat late in the day for me to be doing this, but the release of this document has prompted me to write them up here. My comments are concerned primarily with the section titled "Putting consultations into Linked Data"

The guidelines (correctly, I think) establish a clear distinction between the consultation on the one hand and the Web page describing the consultation on the other by (in paragraphs 30/31) introducing a fragment identifier for the URI of the consultation (via the about="#this" attribute). The consultation itself is also modelled as a document, an instance of the class foaf:Document, which in turn "has as parts" the actual document(s) on which comment is being sought, and for which a reply can be sent to some agent.

I confess that my initial "instinctive" reaction to this was that this seemed a slightly odd choice, as a "consultation" seemed to me to be more akin to an event or a process, taking place during an interval of time, which had a as "inputs" to that process a set of documents on which comments were sought, and (typically at least) resulted in the generation of some other document as a "response". And indeed the page describing the Consultation argot introduces the concept as follows (emphasis added):

A consultation is a process whereby Government departments request comments from interested parties, so as to help the department make better decisions. A consultation will usually be focused on a particular topic, and have an accompanying publication that sets the context, and outlines particular questions on which feedback is requested. Other information will include a definite start and end date during which feedback can be submitted and contact details for the person to submit feedback to.

I admit I find it difficult to square this with the notion of a "document". And I think a "consultation-as-event" (described by a Web page) could probably be modelled quite neatly using the Event Ontology or the similar LODE ontology (with some specialisation of classes and properties if required).

Anyway, I appreciate that aspect may be something of a "design choice". So for the remainder of the comments here, I'll stick to the actual approach described by the guidelines (consultation as document).

The RDF properties recommended for the description of the consultation are drawn mainly from Dublin Core vocabularies, and more specifically from the "DC Terms" vocabulary.

The first point to note is that, as Andy noted recently, DCMI recently made some fairly substantive changes to the DC Terms vocabulary, as a result of which the majority of the properties are now the subject of rdfs:range assertions, which indicate whether the value of the property is a literal or a non-literal resource. The guidelines recommend the use of the publisher (paragraphs 32-37), language(paragraphs 38-39), and audience (paragraph 46) properties, all with literal values, e.g.

<span property="dc:publisher" content="Ministry of Justice"></span>

But according to the term descriptions provided by DCMI, the ranges of these properties are the classes dcterms:Agent, dcterms:LinguisticSystem and dcterms:AgentClass respectively. So I think that would require the use of an XHTML-RDFa construct something like the following, introducing a blank node (or a URI for the resource if one is available):

<div rel="dc:publisher"><span property="foaf:name" content="Ministry of Justice"></span></div>

Second, I wasn't sure about the recommendation for the use of the dcterms:source property (paragraph 37). This is used to "indicate the source of the consultation". For the case where this is a distinct resource (i.e. distinct from the consultation and this Web page describing it), this seems OK, but the guidelines also offer the option of referring to the current document (i.e. the Web page) as the source of the consultation:

<span rel="dc:source" resource=""></span>

DCMI's definition of the property is "A related resource from which the described resource is derived", but it seems to me the Web page is acting as a description of the consultation-as-document, rather than a source of it.

Third, the guidelines recommend the use of some of the DCMI date properties (paragraph 42):

  • dcterms:issued for the publication date of the consultation
  • dcterms:available for the start date for receiving comments
  • dcterms:valid for the closing date ("a date through which the consultation is 'valid'")

I think the use of dcterms:valid here is potentially confusing. DCMI's definition is "Date (often a range) of validity of a resource", so on this basis I think the implication of the recommended usage is that the consultation is "valid" only on that date, which is not what is intended. The recommendations for dcterms:issued and dcterms:available are probably OK - though I do think the event-based approach might have helped make the distinction between dates related to documents and dates related to consultation-as-process rather clearer!

Oh dear, this must read like an awful lot of pedantic nitpicking on my part, but my intent is to try to ensure that widely used vocabularies like those provided by DCMI are used as consistently as possible. As I said at the start I'm very pleased to see this sort of very practical guidance appearing (and I apologise to Mark for not submitting my comments earlier!)

November 04, 2009

"Simple DC" Revisited

In a recent post I outlined a picture of Simple Dublin Core as a pattern for constructing DC description sets (a Description Set Profile), in which statements referred to one of the fifteen properties of the Dublin Core Metadata Element Set, with a literal value surrogate in which syntax encoding schemes were not permitted.

While I think this is a reasonable reflection of the informal descriptions of "Simple DC" provided in various DCMI documents, this approach does tend to give primacy to a pattern based solely on the use of literal values. It also ignores the important work done more recently by DCMI in "modernising" its vocabularies, emphasing the distinction between literal and other values, and reflecting that in the formal RDFS descriptions of its terms.

In a document presented to a DCMI Usage Board meeting at DC-2007 in Singapore, an alternative, "modern" approach to "Simple DC" was proposed by Mikael Nilsson and Tom Baker. (I don't have a current URI for the particular document, but it is part of the "meeting packet"). That proposal suggested a view of "Simple DC" as a DSP (actually, it proposed a DCAP, but I'm focussing here on the "structural constraints" component) in which the properties referenced are not the "original" fifteen properties of the DCMES, but rather the fifteen new properties added to the "DC Terms" collection as part of that modernisation exercise:

  • A description set must contain exactly one description (Description Template: Minimum occurrence constraint = 1; Maximum occurrence constraint = 1)
  • That description may be of a resource of any type (Description Template: Resource class constraint: none (default))
  • For each statement in that description, the type of value surrogate supported depends on the range of the property:
    • For the following property URIs: dcterms:title, dcterms:identifier, dcterms:date, dcterms:description (Statement Template: Property List constraint):
      • There may be no such statement; there may be many (Statement Template: Minimum occurrence constraint = 0; Maximum occurrence constraint = unbounded)
      • A literal value surrogate is required (Statement Template: Type constraint = literal)
      • Within that literal value surrogate, the use of a syntax encoding sceme URI is not permitted (Statement Template/Literal Value: Syntax Encoding Scheme Constraint = disallowed)
    • For the following property URIs: dcterms:creator, dcterms:contributor, dcterms:publisher, dcterms:type, dcterms:language, dcterms:format, dcterms:source, dcterms:relation, dcterms:subject, dcterms:coverage, dcterms:rights (Statement Template: Property List constraint):
      • There may be no such statement; there may be many (Statement Template: Minimum occurrence constraint = 0; Maximum occurrence constraint = unbounded)
      • A non-literal value surrogate is required (Statement Template: Type constraint = non-literal)
      • Within that non-literal value surrogate
        • the use of a value URI is not permitted (Statement Template/Non-Literal Value: Value URI Constraint = disallowed)
        • the use of a vocabulary encoding scheme URI is not permitted (Statement Template/Non-Literal Value: Vocabulary Encoding Scheme Constraint = disallowed)
        • a single value string is required (Statement Template/Non-Literal Value/Value String: Minimum occurrence constraint = 1; Maximum occurrence constraint = 1) and the use of a syntax encoding scheme URI is not permitted (Statement Template/Non-Literal Value/Value String: Syntax Encoding Scheme Constraint = disallowed)

This pattern seeks to combine the simplicity of use of the "traditional" "Simple DC" approach of using only 15 properties with a recognition of the value of using literal and non-literal values as appropriate for each property. However, it is, by definition, slightly more complex than the "all literal values" pattern outlined in the earlier post, and it differs from the patterns described informally in existing DCMI documentation (and I think it would be difficult to argue that it is represented using formats like the oai_dc XML format, which of course predated the creation of the new properties by several years.)

This does not have to be an either/or choice. It may well be that there is a use for both patterns, and if they are clearly named (I don't really care what they are called as long as the names are different!) and documented, there is no reason why two such DSPs should not co-exist.

Having said all that, I'd just re-emphasise that I think both of these patterns are fairly limited in the sort of functionality they can support. It seems to me the notion of "Simple DC" emerged at at time when the emphasis was still very much on the indexing and searching of textual values, and it largely ignores the Web principle of making links between resources. It would be difficult to categorise "Simple DC" - in either of the forms suggested - as a "linked data" friendly approach. I fear a lot of effort has been spent trying to build services on the basis of "Simple DC" when it may have been more appropriate to recognise the inherent limitations of that approach, and to focus instead on richer patterns designed from the outset to support more complex functions.

P.S. I know, I know, I promised a post on "Qualified DC". It's on its way....

October 19, 2009

Helpful Dublin Core RDF usage patterns

In the beginning [*] there was the HTML meta element and we used to write things like:

<meta name="DC.Creator"content="Andy Powell">
<meta name="DC.Subject" content="something, something else, something else again">
<meta name="DC.Date.Available" scheme="W3CDTF" content="2009-10-19">
<meta name="DC.Rights" content="Open Database License (ODbL) v1.0">

Then came RDF and a variety of 'syntax' guidance from DCMI and we started writing:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dc:creator>Andy Powell</dc:creator>
    <dcterms:available>2009-10-19</dcterms:available>
    <dc:subject>something</dc:subject>
    <dc:subject>something else</dc:subject>
    <dc:subject>something else again</dc:subject>
    <dc:rights>Open Database License (ODbL) v1.0</dc:rights>
  </rdf:Description>
</rdf:RDF>

Then came the decision to add 15 new properties to the DC terms namespace which reflected the original 15 DC elements but which added a liberal smattering of domains and ranges.  So, now we write:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dcterms:creator>
      <dcterms:Agent>
        <rdf:value>Andy Powell</rdf:value>
        <foaf:name>Andy Powell</foaf:name>
      </dcterms:Agent>
    </dcterms:creator>
    <dcterms:available
rdf:datatype="http://purl.org/dc/terms/W3CDTF">2009-10-19</dcterms:available>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something else</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something else again</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:rights
rdf:resource="http://opendatacommons.org/licenses/odbl/1.0/" />
  </rdf:Description>
</rdf:RDF>

Despite the added verbosity and rather heavy use of blank nodes in the latter, I think there are good reasons why moving towards this kind of DC usage pattern is a 'good thing'.  In particular, this form allows the same usage pattern to indicate a subject term by URI or literal (or both - see addendum below) meaning that software developers only need to code to a single pattern. It also allows for the use of multiple literals (e.g. in different languages) attached to a single value resource.

The trouble is, a lot of existing usage falls somewhere between the first two forms shown here.  I've seen examples of both coming up in discussions/blog posts about both open government data and open educational resources in recent days.

So here are some useful rules of thumb around DC RDF usage patterns:

  • DC properties never, ever, start with an upper-case letter (i.e. dcterms:Creator simply does not exist).
  • DC properties never, ever, contain a full-stop character (i.e. dcterms:date.available does not exist either).
  • If something can be named by its URI rather than a literal (e.g. the ODbL licence in the above examples) do so using rdf:resource.
  • Always check the range of properties before use.  If the range is anything other than a literal (as is the case with both dc:subject and dc:creator for example) and you don't know the URI of the value, use a blank or typed node to indicate the value and rdf:value to indicate the value string.
  • Do not provide lists of separate keywords as a single dc:subject value.  Repeat the property multiple times, as necessary.
  • Syntax encoding schemes, W3CDTF in this case, are indicated using rdf:datatype.

See the Expressing Dublin Core metadata using the Resource Description Framework (RDF) DCMI Recommendation for more examples and guidance.

[*] The beginning of Dublin Core metadata obviously! :-)

Addendum

As Bruce notes in the comments below, the dcterms:subject pattern that I describe above applies in those situations where you do not know the URI of the subject term. In cases where you do know the URI (as is the case with LCSH for example), the pattern becomes:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dcterms:subject>
      <rdf:Description rdf:about="http://id.loc.gov/authorities/sh85101653#concept">
        <rdf:value>Physics</rdf:value>
      </rdf:Description>
    </dcterms:subject>
  </rdf:Description>
</rdf:RDF>

October 14, 2009

Open, social and linked - what do current Web trends tell us about the future of digital libraries?

About a month ago I travelled to Trento in Italy to speak at a Workshop on Advanced Technologies for Digital Libraries organised by the EU-funded CACOA project.

My talk was entitled "Open, social and linked - what do current Web trends tell us about the future of digital libraries?" and I've been holding off blogging about it or sharing my slides because I was hoping to create a slidecast of them. Well... I finally got round to it and here is the result:

Like any 'live' talk, there are bits where I don't get my point across quite as I would have liked but I've left things exactly as they came out when I recorded it. I particularly like my use of "these are all very bog standard... err... standards"! :-)

Towards the end, I refer to David White's 'visitors vs. residents' stuff, about which I note he has just published a video. Nice one.

Anyway... the talk captures a number of threads that I've been thinking and speaking about for the last while. I hope it is of interest.

October 12, 2009

Description Set Profiles for DC-2009

For the first time in a few years, I'm not attending the Dublin Core conference, which is taking place in Seoul, Korea this week. I've been actively trying to cut back on the long-haul travel, mainly to shrink my personal carbon footprint, which for several years was probably the size of a medium-sized Hebridean island. But also I have to admit that increasingly I've found it pretty exhausting to try to engage with a long meeting or deliver a presentation after a long flight, and on occasions I've found myself questioning whether it's the "best" use of my own energy reserves!

More broadly, I suppose I think we - the research community generally, not just the DC community - really have to think carefully about the sustainability of the "traditional" model of large international conferences with hundreds of people, some of them travelling thousands of miles to participate.

But that's a topic for another time, and of course, this week I will miss catching up with friends, practising the fine art of waving my hands around animatedly while also trying to maintain control of a full lunch plate, and hatching completely unrealistic plans over late-night beers. Good luck and safe travels to everyone who is attending the conference in Seoul.

Anyway, Liddy Nevile, who is chairing the Workshop Committee (and has over recent years made efforts to enable and encourage remote participation in the conference), invited Karen Coyle and me to contribute a pre-recorded presentation on Description Set Profiles. We only have a ten minute slot so it's necessarily a rather hasty sprint through the topic, but the results are below.

I think this is the first time I've recorded my own voice since I was in a language lab in an A-Level French class in 1990! To date, I've done my best to avoid anything resembling podcasting, or retrospectively adding audio to any of my presentation slide decks on Slideshare, mostly because I hate listening to the sound of my own voice, perhaps all the more so because I know I tend to "umm" and "err" quite a lot. And even if I write myself a "script" (which in most cases I don't) I seem to find it hard to resist the temptation to change bits and insert additional comments on the fly, and then I realise I'm altering the structure of a sentence and repeating things, and I "umm" and "err" even more... Argh. I'm sure I recorded every sentence of this in Audacity at least three times over! :-)

October 07, 2009

What is "Simple Dublin Core"?

Over the last couple of weeks I've exchanged some thoughts, on Twitter and by email, with John Robertson of CETIS, on the topic of "Qualified Dublin Core", and as we ended up discussing a number of areas where it seems to me there is a good deal of confusion, I thought it might be worth my trying to distill them into a post here (well, it's actually turned into a couple of posts!).

I'm also participating in an effort by the DCMI Usage Board to modernise some of DCMI's core documentation, and I hope this can contribute to that work. However, at this point this is, I should emphasise, a personal view only, based on my own interpretation of historical developments, not all of which I was around to see at first hand, and should be treated accordingly.

The exchange began with a couple of posts from John on Twitter in which he expressed some frustration in getting to grips with the notion referred to as "Qualified Dublin Core", and its relationship to the concept of "DC Terms".

First, I think it's maybe worth taking a step back from the "Qualified DC" question, and looking at the other concepts John mentions in his first question: "the Dublin Core Metadata Element Set (DCMES)" and "Simple Dublin Core", and that's what I'll focus on in this post.

The Dublin Core Metadata Element Set (DCMES) is a collection of (what DCMI calls) "terms" - and it's a collection of "terms" of a single type, a collection of properties - each of which is identified by a URI beginning http://purl.org/dc/elements/1.1/; the URIs are "in that namespace". Historically, DCMI referred to this set of properties as "elements".

Although I'm not sure it is explicitly stated anywhere, I think there is a policy that - at least barring any quite fundamental changes of approach by DCMI - no new terms will be added to that collection of fifteen terms; it is a "closed" set, its membership is fixed in number.

A quick aside: for completeness, I should emphasise that those fifteen properties have not been "deprecated" by DCMI. Although, as I'll discuss in the next post, a new set of properties has been created in the "DC terms" set of terms, the DCMES properties are still available for use in just the same way as the other terms owned by DCMI. The DCMES document says:

Implementers may freely choose to use these fifteen properties either in their legacy dc: variant (e.g., http://purl.org/dc/elements/1.1/creator) or in the dcterms: variant (e.g., http://purl.org/dc/terms/creator) depending on application requirements. The RDF schemas of the DCMI namespaces describe the subproperty relation of dcterms:creator to dc:creator for use by Semantic Web-aware applications. Over time, however, implementers are encouraged to use the semantically more precise dcterms: properties, as they more fully follow emerging notions of best practice for machine-processable metadata.

The intent behind labelling them as "legacy" is, as Tom Baker puts it, to "gently promote" the use of the more recently defined set of properties.

Perhaps the most significant characteristic of that set of terms is that it was created as a "functional" set, by which I mean that it was created with the notion that that set of fifteen properties could and would be used together in combination in the descriptions of resources. And I think this is reflected, for instance, in the fact that some of the "comments" provided for individual properties refer to other properties in that set (e.g. dc:subject/dc:coverage, dc:format/dc:type).

And there was particular emphasis placed on one "pattern" for the construction of descriptions using those fifteen properties, in which a description could contain statements referring only to those fifteen properties, all used with literal values, and any of those 15 properties could be referred to in multiple statements (or in none). In that pattern of usage, the fifteen properties were all "optional and repeatable", if you like. And that pattern is often referred to as "Simple Dublin Core".

Such a "pattern" is what today - if viewed from the perspective of the DCMI Abstract Model and the Singapore Framework - we would call a Description Set Profile (DSP).

So "Simple Dublin Core" might be conceptualised as a DSP designed, initially at least, for use within a very simple, general purpose DC Application Profile (DCAP), constructed to support some functions related to the discovery of a broad range of resources. That DSP specifies the following constraints:

  • A description set must contain exactly one description (Description Template: Minimum occurrence constraint = 1; Maximum occurrence constraint = 1)
  • That description may be of a resource of any type (Description Template: Resource class constraint: none (default))
  • For each statement in that description:
    • The property URI must be drawn from a list of the fifteeen URIs of the DCMES properties (Statement Template: Property list constraint: (the 15 URIs))
    • There must be at least one such statement; there may be many (Statement Template: Minimum occurrence constraint = 1; Maximum occurrence constraint = unbounded)
    • A literal value surrogate is required (Statement Template: Type constraint = literal)
    • Within that literal value surrogate, the use of a syntax encoding sceme URI is not permitted (Statement Template/Literal Value: Syntax Encoding Scheme Constraint = disallowed)

And this DSP represents a "pattern" that is quite widely deployed, perhaps most notably in the context of systems supporting the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which requires that an OAI-PMH repository expose records using an XML format called oai_dc, which is essentially a serialisation format for this DSP. (There may be an argument that the "Simple DC" pattern has been overemphasised at the expense of other patterns, and as a result people have poured their effort into using that pattern when a different one might have been more appropriate for the task at hand, but that's a separate discussion!)

It seems to me that, historically, the association between the DCMES as a set of terms on the one hand and that particular pattern of usage of those terms on the other was so close that, at least in informal accounts, the distinction between the two was barely made at all. People tended to (and still do) use the terms "Dublin Core Metadata Element Set" and "Simple Dublin Core" interchangeably. So, for example, in the introduction to the Usage Guide, one finds comments like "Simple Dublin Core comprises fifteen elements" and "The Dublin Core basic element set is outlined in Section 4. Each element is optional and may be repeated." I'd go as far as to say that many uses of the generic term "Dublin Core", informal ones at least, are actually references to this one particular pattern of usage. (I think the glossary of the Usage Guide does try to establish the difference, referring to "Simple Dublin Core" as "The fifteen Dublin Core elements used without qualifiers, that is without element refinement or encoding schemes.")

The failure to distinguish more clearly between a set of terms and one particular pattern of usage of those terms has caused a good deal of confusion, and I think this will becomes more apparent when we consider the (rather more complex) case of "Qualified Dublin Core", as I'll do in the next post, and it's an area which I'm hoping will be addressed as part of the Usage Board review of documentation.

If you look at the definitions of the DCMES properties, in the human-readable document, and especially in the RDF Schema descriptions provided in the "namespace document" http://purl.org/dc/elements/1.1/, with the possible exceptions of the "cross-references" I mentioned above, those definitions don't formally say anything about using those terms together as a set, or about "optionality/repeatability": they just define the terms; they are silent about any particular "pattern of usage" of those terms.

So, such patterns of usage of a collection of terms exist distinct from the collection of terms. And it is possible to define multiple patterns of usage. multiple DSPs, referring to that same set of 15 properties. In addition to the "all optional/repeatable" pattern, I might find myself dealing with some set of resources which all have identifiers and all have names, and operations on those identifiers and names are important to my application, so I could define a pattern/DSP ("PeteJ's Basic DC" DSP) where I say all my descriptions must contain at least one statement referring to the dc:identifier property and at least one statement referring to the dc:title property, and the other thirteen properties are optional/repeatable, still all with literal values. Another implementer might find themselves dealing with some stuff where everything has a topic drawn from some specified SKOS concept scheme, so they define a pattern ("Fred's Easy DC" DSP) which says all their descriptions must contain at least one statement referring to the dc:subject property and they require the use, not of a literal, but of a value URI from that specific set of URIs. So now we have three three different DC Application Profiles, incorporating three different patterns for constructing description sets (three different DSPs) each referring to the same set of 15 properties.

It's also worth noting that the "Simple DC" pattern of usage, a single set of structural constraints, could be deployed in multiple DC Application Profiles, supporting different applications and containing different human-readable guidelines. (I was going to point to the document Using simple Dublin Core to describe eprints as an actual example of this, but having read that document again, I think strictly speaking it probably introduces additional structural constraints (i.e. introduces a different DSP), e.g. it requires that statements using the dc:type property refer to values drawn from a bounded set of literal values.)

The graphic below is an attempt to represent what I see as those relationships between the DCMES vocabulary, DSPs and DCAPs:

Slide1

Finally, it's worth emphasising that the 15 properties of the DCMES, or indeed any subset of them - there is no requirement that a DSP refer to all, or indeed any, of the properties of the DCMES - , may be referred to in other DSPs in combination with other terms from other vocabularies, owned either by DCMI or by other parties.

Slide2

The point that DCMI's concept of an "application profile" is not based either on the use of the DCMES properties in particular or on the "Simple DC" pattern is an important one. Designing a DC application profile does not require taking either the DCMES or the "Simple DC" pattern as a starting point; any set of properties, classes, vocabulary encoding schemes and syntax encoding schemes, owned by any agency, can be referenced. But that is rather leading me into the next post, where I'll consider the similar (but rather more messy) picture that emerges once we start talking about "Qualified DC".

October 05, 2009

QNames, URIs and datatypes

I posted a version of this on the dc-date Jiscmail list recently, but I thought it was worth a quick posting here too, as I think it's a slightly confusing area and the differences in naming systems is something which I think has tripped up some DC implementers in the past.

The Library of Congress has developed a (very useful looking) XML schema datatype for dates, called the Extended Date Time Format (EDTF). It appears to address several of the issues which were discussed some time ago by the DCMI Date working group. It looks like the primary area of interest for the LoC is the use of the data type in XML and XML Schema, and the documentation for the new datatype illustrates its usage in this context.

However, as far as I can see, it doesn't provide a URI for the datatype, which is a prerequisite for referencing it in RDF. I think sometimes we assume that providing what the XML Namespaces spec calls an expanded name is sufficient, and/or that doing that automatically also provides a URI, but I'm not sure that is the case.

In XML Schema, a datatype is typically referred to using an XML Qualified Name (QName), which is essentially a "shorthand" for an "expanded name". An expanded name has two parts: a (possibly null-valued) "namespace name" and a "local name".

Take the case of the built-in datatypes defined as part of the XML Schema specification. In XML/XML Schema, these datatypes are referenced using these two part "expanded names", typically represented in XML documents as QNames e.g.

<count xsi:type="xsd:int">42</count>

where the prefix xsd is bound to the XML namespace name "http://www.w3.org/2001/XMLSchema". So the two-part expanded name of the XML Schema integer datatype is

( http://www.w3.org/2001/XMLSchema , int )

Note: no trailing "#" on that namespace name.

The key point here is that the expanded name system is a different naming system from the URI system. True, the namespace name component of the expanded name is a URI (if it isn't null), but the expanded name as a whole is not. And furthermore, there is no generalised mapping from expanded name to URI.

To refer to a datatype in RDF, I need a URI, and for the built-in datatypes provided, the XML Schema Datayping spec tells me explicitly how to construct such a URI:

Each built-in datatype in this specification (both *primitive* and *derived*) can be uniquely addressed via a URI Reference constructed as follows:

  1. the base URI is the URI of the XML Schema namespace
  2. the fragment identifier is the name of the datatype

For example, to address the int datatype, the URI is:

* http://www.w3.org/2001/XMLSchema#int

The spec tells me that for the names of this set of datatypes, to generate a URI from the expanded name, I use the local part as the fragment id - but some other mechanism might have been applied.

And this is different from, say, the way RDF/XML maps the QNames of some XML elements to URIs, where the mechanism is simple concatenation of "namespace name" and "local name"

(And, as an aside, I think this makes for a bit of "gotcha" for XML folk coming to RDF syntaxes like Turtle where datatype URIs can be represented as prefixed names, because the "namespace URI" you use in Turtle, where the simple concatenation mechanism is used, needs to include the trailing "#" (http://www.w3.org/2001/XMLSchema#), whereas in XML/XML Schema people are accustomed to using the "hash-less" XML namespace name (http://www.w3.org/2001/XMLSchema)).

But my understanding is that that the mapping defined by the XML Schema datatyping document is specific to the datatypes defined by that document ("Each built-in datatype in this specification... can be uniquely addressed..." (my emphasis)), and the spec is silent on URIs for other user-defined XML Schema datatypes.

So it seems to me that if a community defines a datatype, and they want to make it available for use in both XML/XML Schema and in RDF, they need to provide both an expanded name and a URI for the datatype.

The LoC documentation for the datatype has plenty of XML examples so I can deduce what the expanded name is:

(info:lc/xmlns/edtf-v1 , edt)

But that doesn't tell me what URI I should use to refer to the datatype.

I could make a guess that I should follow the same mapping convention as that provided for the XML Schema built-in datatypes, and decide that the URI to use is

info:lc/xmlns/edtf-v1#edt

But given that there is no global expanded name-URI mapping, and the XML Schema specs provide a mapping only for the names of the built-in types, I think I'd be on shaky ground if I did that, and really the owners of the datatype need to tell me explicitly what URI to use - as the XML Schema spec does for those built-in types.

Douglas Campbell responded to my message pointing to a W3C TAG finding which provides this advice:

If it would be valuable to directly address specific terms in a namespace, namespace owners SHOULD provide identifiers for them.

Ray Denenberg has responded with a number of suggestions: I'm less concerned about the exact form of the URI than that a URI is explicitly provided.

P.S. I didn't comment on the list on the choice of URI scheme, but of course I agree with Andy's suggestion that value would be gained from the use of the http URI scheme.... :-)

September 22, 2009

VoCamp Bristol

At the end of the week before last, I spent a couple of days (well, a day and a half as I left early on Friday) at the VoCamp Bristol meeting, at ILRT, at the University of Bristol.

To quote the VoCamp wiki:

VoCamp is a series of informal events where people can spend some dedicated time creating lightweight vocabularies/ontologies for the Semantic Web/Web of Data. The emphasis of the events is not on creating the perfect ontology in a particular domain, but on creating vocabs that are good enough for people to start using for publishing data on the Web.

I admit that I went into the event slightly unprepared, as I didn't have any firm ideas about any specific vocabulary I wanted to work on, but happy to join in with anyone who was working on anything of interest. Some of the outputs of the various groups are listed on the wiki page.

As well as work on specific vocabularies, the opening discussions highlighted an interest in a small set of more general issues, which included the expression of "structural constraints" and "validation"; broader questions of collecting and interpreting vocabulary usage; representing RDF data using JSON; and the features available in OWL 2. Friday morning was set aside for those topics, which meant I had an opportunity to talk a little bit about the work being done within the Dublin Core Metadata Initiative on "Description Set Profiles", which I've mentioned in some recent posts here. I did hastily knock up a few slides, mainly as an aide memoire to make sure I mentioned various bits and pieces:

There was a useful discussion around various different approaches for representing such patterns of constraints at the level of the RDF graph, either based on query patterns, or on the use of OWL (with a "closed-world" assumption that the "world" in question is the graph at hand). Some of the new features in OWL 2, such as capabilities for expressing restrictions on datatypes seem to make it quite an attractive candidate for this sort of task.

I was asked about whether we had considered the use of OWL in the DCMI context. IIRC, we decided against it mostly because we wanted an approach that built explicitly on the description model of the DCMI Abstract Model (i.e. talked in terms of "descriptions" and "statements" and patterns of use of those particular constructs), though I think the "open-world" considerations were also an issue (See this piece for a discussion of some of the "gotchas" that can arise).

Having said that, it would seem a good idea to explore to what extent the constraint types permitted by the DSP model might be mapped into other form(s) of expressing constraints which might be adopted.

All in all, it was a very enjoyable couple of days: a fairly low-key, thoughtful, gentle sort of gathering - no "pitches", no prizes, no "dragons" in their "dens", or other cod-"bizniz" memes :-) - just an opportunity for people to chat and work together on topics that interested them. Thank you to Tom & Damian & Libby for doing the organisation (and introducing me to a very nice Chinese restaurant in Bristol on the Thursday night!)

September 16, 2009

Edinburgh publish guidance on research data management

The University of Edinburgh has published some local guidance about the way that research data should be managed, Research data management guidance, covering How to manage research data and Data sharing and preservation, as well as detailing local training, support and advice options.

One assumes that this kind of thing will become much more common at universities over the next few years.

Having had a very quick look, it feels like the material is more descriptive than prescriptive - which isn't meant as a negative comment, it just reflects the current state of play. The section on Data documentation & metadata for example, gives advice as simple as:

Have you created a "readme.txt" file to describe the contents of files in a folder? Such a simple act can be invaluable at a later date.

but also provides a link to the UK Data Archive's guidance on Data Documentation and Metadata, which at first sight appears hugely complex. I'm not sure what your average research will make of it?

(In passing, I note that the UKDA seem to be promoting the use of the Data Documentation Initiative standard at what they call the 'catalogue' level, a standard that I've not come across before but one that appears to be rooted firmly outside the world of linked data, which is a shame.)

Similarly, the section on Methods for data sharing lists a wide range of possible options (from "posting on a University website" thru to "depositing in a data repository") without being particularly prescriptive about which is better and why.

(As a second aside, I am continually amazed by this firm distinction in the repository world between 'posting on the website' and 'depositing in a repository' - from the perspective of the researcher, both can, and should, achieve the same aims, i.e. improved management, more chance of persistence and better exposure.)

As we have found with repositories of research publications, it seems to me that research data repositories (the Edinburgh DataShare in this case) need to hide much of this kind of complexity, and do most of the necessary legwork, in order to turn what appears to be a simple and obvious 'content management' workflow (from the point of view of the individual researcher) into a well managed, openly shared, long term resource for the community.

September 01, 2009

Experiments with DSP and Schematron

There has been some discussion recently around DCMI's draft Description Set Profile specification, both on the dc-architecture Jiscmail list and, briefly, on Twitter.

From my perspective, the DSP specification is one of the most interesting recent technical developments made by DCMI. For me, it provides the long-needed piece of the jigsaw that enables us to construct a coherent picture of what a "DC application profile" is. What do these tabular lists of "terms", or combinations of terms, that have typically appeared in these documents people call "DC application profiles" actually "say"? What does "use dc:subject with vocabulary encoding schemes S and T" actually "mean"? How can we formalise this information?

To recap, the DSP specification takes the approach that what is at issue here is a set of "structural constraints" on the information structure that the DCMI Abstract Model calls a "description set". The DCAM itself defines the basic structure (a "description set" contains "descriptions"; a "description" contains "statements"; a "statement" contains a "property URI", an optional "value URI" and "vocabulary encoding scheme URI", and so on). But that's where the DCAM stops: it doesn't say anything about any particular set of property URIs or vocabulary encoding scheme URIs; it doesn't specify whether, in the particular set of description sets I'm creating, plain literals should be in English or Spanish. This is where the DSP spec comes in. The DSP model allows us to say, "I want to apply a more specific set of requirements: a description of a book must provide a title (i.e. must include a statement with property URI http://purl.org/dc/terms/title) and must include exactly two subject terms from LCSH (i.e. must include two statements with property URI http://purl.org/dc/terms/subject and vocabulary encoding scheme URI http://purl.org/dc/terms/LCSH), or a description of a person is optional, but if included it must provide a name (i.e. must include a statement with property URI http://xmlns.com/foaf/0.1/name).

To express these constraints, the spec defines a model of "Description Templates", in turn containing sets of "Statement Templates". A set of such templates provides a set of "patterns", if you like, to which some set of actual "instance" descriptions sets can "match" or "conform". The specification also defines both an XML syntax and an RDF vocabulary for representing such a set of constraints.

As an aside, it's also worth noting that a single description set may be matched against multiple profiles, depending on the context (or indeed against none: there is no absolute requirement that a description set matches any DSP at all). The same description set may be tested against a fairly permissive set of constraints in one context, and a "tighter" set of constraints in another: the same description set may match the former, and fail to match the latter. To paraphrase James Clark's comments on XML schema, "validity" should be treated not as a property of a description set but as a relationship between a description set and a description set profile.

The current draft is very much just that, a draft on which feedback is being gathered. Are the current constraints fully/clearly specified? Is the processing algorithm complete/unambiguous? Are the current constraint types the ones typically required? Are there other constraint types which would be useful? And it is almost certain that there will be changes made in a future version, but nevertheless, it seems to me it is a very solid first step, and it's very encouraging to see that implementers are starting to test out the current model in earnest.

One of the questions that I've been asked in discussions is that of how the DSP model relates to XML schema languages.

A description set might be represented in many different concrete formats, including XML formats. XML schema languages (and here I'm using that term in a generic sense to refer to the family of technologies, not specifically to W3C XML Schema, one particular XML schema language) allow you to express a set of structural constraints on an XML document.

An XML format which is designed to serialise the description set structure provides a mapping between the components in that structure and some set of components in an XML document (XML elements and attributes, their names and their content and values).

And so, for such an XML format, it should be possible to map a DSP - a set of structural constraints on a description set - into a corresponding set of constraints on an instance of that XML format. I say "should" because there are a number of factors to be taken into consideration here:

  • The current draft DSP model includes some constraints which are not strictly structural. For example, the model allows a Statement Template to include a "Sub-property Constraint" (6.4.2), which allows a DSP to "say" things like "This statement template applies to a statement referring to any subproperty of the property dc:contributor". A processor attempting to determine whether or not a particular statement referring to some property ex:property matches such a constraint requires information about that property external to the description set itself in order to know whether the DSP requirement is met or not
  • Whether all the constraints can be reflected in an XML schema depends on the characteristics of the XML format and on the features of the XML schema language. Different XML schema languages have different capabilities when it comes to expressing structural constraints, and, for a single XML format, one schema language may be able to express constraints which another can not. So for the case of mapping the DSP constraints into an XML schema, it may be that, depending on the nature of the XML format, one XML schema language is capable of capturing more of the constraints on the XML document than another.

Anyway, to try to illustrate one possible application of the DSP model, I've spent some time recently playing around with XSLT and Schematron to try to create an XSLT transformation which:

  • takes as input an instance of the DSP-XML format described in the current draft (version of 2008-03-31) i.e. a representation of a DSP in XML; and
  • provides as output a Schematron schema containing a corresponding set of patterns expressing constraints on an instance of the XML format described in the proposed recommendation for the XML format known as DC-DS XML (version of 2008-09-01).

I should emphasise that I'm very much a newcomer to Schematron, my XSLT is a bit rusty, I haven't tested what I've done exhaustively, and I've worked on this on and off over a few days and haven't done a great deal to tidy up the results. So I'm sure there are more elegant and efficient ways of achieving this, but, FWIW, I've put where I've got to on a page on the DCMI Architecture Forum wiki.

The transform is dsp2sch-dcds.xsl

To illustrate its use, I created a short DSP-XML document and a few DC-DS XML instances.

bookdsp.xml is an DSP-XML representation of a short example DSP. It's loosely based on the book-person example that Tom Baker and Karen Coyle used in their recently published Guidelines for Dublin Core Application Profiles, but I've tweaked and extended it to include a broader range of constraints.

Running the transform against that DSP generates a Schematron schema: dsp-dcds.xml.

The page on the wiki lists a few example DC-DS XML instances, and the results of validating those instances against this Schematron schema. So for example, book4.xml is a DC-DS XML instance which conforms to the syntactic rules of the format, but fails to match some of the constraints of the Book DSP (the DSP allows the "book" description to have only two statements using the dc:creator property, and the example has three; and the DSP allows only two "person" descriptions, and the example has three). The result of validation using the Schematron schema is the document valbook4.xml. (The Schematron processor outputs an XML format called Schematron Validation Report Language (SVRL), which is a bit verbose, but fairly self-explanatory; it could be post-processed into a more human-readable format).

The approach taken is, roughly, that the transform generates:

  • a Schematron pattern with a rule with context dcds:descriptionSet, which, for each Description Template, tests for the number of child dcds:description elements that satisfy that Description Template's Resource Class Membership Constraint (more on this below), using a corresponding XPath predicate. e.g. from the bookdsp example dcds:description[dcds:statement[@dcds:propertyURI='http://www.w3.org/1999/02/22-rdf-syntax-ns#type' and (@dcds:valueURI='http://purl.org/dc/dcmitype/Collection')]]
  • for each DSP Description Template, a Schematron pattern with a rule with context dcds:description[the resource class membership predicate above], which tests the Standalone Constraints, and then, for each Statement Template, tests for the number of child dcds:statement elements that satisfy the Statement Template's Property Constraint, using a corresponding XPath predicate. e.g. from the bookdsp example dcds:statement[@dcds:propertyURI='http://purl.org/dc/terms/title']
  • for each DSP Statement Template that specifies a Type Constraint, a Schematron pattern with a rule with context dcds:description[the resource class membership predicate above]/dcds:statement[the property predicate above], which tests for the various other (Literal or Non-Literal) constraints specified within the Statement Template.

A few thoughts and notes are in order.

  1. The transform is specific to the version of the DSP-XML format specified in the current draft, and to the current version of the DC-DS XML format. If either of these change then the transform will require modificaton. Another transform could be written to generate patterns for another XML format, e.g. for RDF/XML (or maybe more easily, a defined "profile" of RDF/XML) or even for the use of the DC-HTML profile for embedding data in XHTML meta/link elements (subject to the limitations of that profile in terms of which aspects of the DCAM description model are supported).
  2. It assumes that the DSP XML instance is valid, and that the DC-DS XML instance is valid, in the sense that it conforms to the basic syntactic rules of that format. (I've got some additional general, DSP-independent Schematron patterns for DC-DS XML, which in theory could be "included" in the generated schema, but I haven't managed to get that to work correctly yet.)
  3. The output from the current version includes a lot of "informational" reporting ("this description contains three statements" etc), as well as actual error messages for mismatches with the DSP constraints. Mostly this was to help me debug the transform and get my head round how Schematron was working, but it makes the output rather verbose. I've left it in for now, but I might remove or reduce it in a subsequent version.
  4. What I've come up with currently implements only a subset of the model in the DSP draft. In particular, I've ignored constraints that go beyond the structural and require checking beyond the syntactic level (like the Subproperty Constraint I mentioned above). And for some other constraints, I've adopted a "strictly structural" interpretation: this is the case for the Description Template Resource Class Membership Constraint (5.5), which I interpreted as "the description should contain a statement referring to the rdf:type property with a value URI matching one of the listed URIs", and for the Statement Template Value Class Membership Constraint (6.6.2), which I interpreted as "there should be a description of the value containing a statement referring to the rdf:type property with a value URI matching one of the listed URIs". i.e. I haven't allowed for the possibility that an application might derive information about the resource type from property semantics (e.g. from inferencing based on RDF schema range/domain).
  5. Finally, the handling of literals is somewhat simplistic. In particular, I haven't given any thought to the handling of XML literals, but even leaving that aside it probably needs some additional character escaping.

Anyway, I intend this not as any sort of "authorised" tool, nor as "the finished article", but as a fairly rough first stab at an example of the sort of XML-schema-based functionality that I think can be built starting from the DSP model, and as a contribution to the ongoing discussion of the current working draft.

August 20, 2009

A socio-political analysis of the history of Dublin Core terms

No... this isn't one. Sorry about that! But reading Eric Hellman's recent blog post, Can Librarians Be Put Directly Onto the Semantic Web?, made me wonder if such a thing would be interesting (in a "how much metadata can you fit on the head of a pin" kind of way!).

Eric's post discusses the need for, or not, inverse properties in the Semantic Web and the necessary changes of mindset in moving from thinking about 'metadata' to thinking about 'ontologies':

In many respects, the most important question for the library world in examining semantic web technologies is whether librarians can successfully transform their expertise in working with metadata into expertise in working with ontologies or models of knowledge. Whereas traditional library metadata has always been focused on helping humans find and make use of information, semantic web ontologies are focused on helping machines find and make use of information. Traditional library metadata is meant to be seen and acted on by humans, and as such has always been an uncomfortable match with relational database technology. Semantic web ontologies, in contrast, are meant to make metadata meaningful and actionable for machines. An ontology is thus a sort of computer program, and the effort of making an RDF schema is the first step of telling a computer how to process a type of information.

I think there's probably some interesting theorising to be done about the history of the Dublin Core metadata properties, in particular about the way they have been named over the years and the way some have explicit inverse properties but others don't.

So, for example, dcterms:creator and dcterms:hasVersion use different naming styles ('creator' rather than 'hasCreator') and dcterms:hasVersion has an explicit inverse (dcterms:isVersionOf) whereas dcterms:creator does not (there is no dcterms:isCreatorOf).

Unfortunately, I don't recall much of the detail of why these changes in attitude to naming occured over the years. My suspicion is that it has something to do with the way our understanding of 'web' metadata has evolved over time.  Two things in particular I guess.  Firstly, the way in which there has been a gradual change from understanding properties as being 'attributes with string values' (very much the view when dc:creator was invented) to understanding properties as 'the meat between two resources in an RDF triple'.  And, secondly, a change in thinking first and foremost about 'card catalogues' and/or relational databases to thinking about triple stores (perhaps better characterised (as Eric did) as a transition between thinking about metadata as something that is viewed by humans to something that is acted upon by software).

I strongly suspect that both these changes in attitude are very much ongoing (at least in the DC community - possibly elsewhere?).

Note also the difference in naming between dcterms:valid and dcterms:dateCopyrighted (both of which are refinements of dcterms:date). The former emerged at a time when the prefered encoding syntaxes tended to prefix 'valid' with 'DC.date.' to give 'DC.date.valid' whereas the latter emerged at a time when properties where recognised as being stand-alone entities (i.e. after the emergence of Semantic Web thinking).

If nothing else, working with the Dublin Core community over the years has served as a very useful reminder about the challenges 'ordinary' (I don't mean that in any way negatively) people face in understanding what some 'geeks' might perceive to be very simple Semantic Web concepts.  I've lost track of the number of 'strings vs. things' type discussions I've been involved in!  And to an extent, part of the reason for developing the DCMI Abstract Model was to try to bridge the gap between a somewhat 'old-skool' (dare I say, 'traditional librarian'?) view of the world and the Semantic Web view of the world.  Of course, one can argue about whether we succeeded in that aim.

June 19, 2009

Repositories and linked data

Last week there was a message from Steve Hitchcock on the UK jisc-repositories@jiscmail.ac.uk mailing list noting Tim Berners-Lee's comments that "giving people access to the data 'will be paradise'". In response, I made the following suggestion:

If you are going to mention TBL on this list then I guess that you really have to think about how well repositories play in a Web of linked data?

My thoughts... not very well currently!

Linked data has 4 principles:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.

Of these, repositories probably do OK at 1 and 2 (though, as I’ve argued before, one might question the coolness of some of the http URIs in use and, I think, the use of cool URIs is implicit in 2).

3, at least according to TBL, really means “provide RDF” (or RDFa embedded into HTML I guess), something that I presume very few repositories do?

Given lack of 3, I guess that 4 is hard to achieve. Even if one was to ignore the lack of RDF or RDFa, the fact that content is typically served as PDF or MS formats probably means that links to other things are reasonably well hidden?

It’d be interesting (academically at least), and probably non-trivial, to think about what a linked data repository would look like? OAI-ORE is a helpful step in the right direction in this regard.

In response, various people noted that there is work in this area: Mark Diggory on work at DSpace, Sally Rumsey (off-list) on the Oxford University Research Archive and parallel data repository (DataBank), and Les Carr on the new JISC dotAC Rapid Innovation project. And I'm sure there is other stuff as well.

In his response, Mark Diggory said:

So the question of "coolness" of URI tends to come in second to ease of implementation and separation of services (concerns) in a repository. Should "Coolness" really be that important? We are trying to work on this issue in DSpace 2.0 as well.

I don't get the comment about "separation of services". Coolness of URIs is about persistence. It's about our long term ability to retain the knowledge that a particular URI identifies a particular thing and to interact with the URI in order to obtain a representation of it. How coolness is implemented is not important, except insofar as it doesn't impact on our long term ability to meet those two aims.

Les Carr also noted the issues around a repository minting URIs "for things it has no authority over (e.g. people's identities) or no knowledge about (e.g. external authors' identities)" suggesting that the "approach of dotAC is to make the repository provide URIs for everything that we consider significant and to allow an external service to worry about mapping our URIs to "official" URIs from various "authorities"". An interesting area.

As I noted above, I think that the work on OAI-ORE is an important step in helping to bring repositories into the world of linked data. That said, there was some interesting discussion on Twitter during the recent OAI6 conference about the value of ORE's aggregation model, given that distinct domains will need to layer their own (different) domain models onto those aggregations in order to do anything useful. My personal take on this is that it probably is useful to have abstracted out the aggregation model but that the hypothesis still to be tested that primitive aggregation is useful despite every domain needing own richer data and, indeed, that we need to see whether the way the ORE model gets applied in the field turns out to be sensible and useful.

May 22, 2009

Google Rich Snippets

As ever, I'm slow off the mark with this, but last week's big news within the Web metadata and Semantic Web communities was the announcement by Google of a feature they are calling Rich Snippets, which provides support for the parsing of structured data within HTML pages - based on a selection of microformats and on RDFa using a specified RDF vocabulary - and the surfacing of that data in Google search result sets. In the first instance, at least, only a selected set of sources are being indexed, with the hope of extending participation soon (see the discussion in the O'Reilly interview with Othar Hansson and Guha.)

A number of commentators, including Ian Davis, Tom Scott, and Jeni Tennison have pointed out that Google's support for RDFa, at least as currently described, is somewhat partial, and its reliance on a centralised Google-owned URIspace for terms is at odds with RDF's support for the distributed creation of vocabularies - and indeed in coining that Google vocabulary, Google appears to have ignored the existence of some already widely deployed vocabularies.

And of course, Yahoo was ahead of the game here with their (complete) support for RDFa in their Search Monkey platform (which I mentioned here).

Nevertheless, it's hard to disagree with Timothy O'Brien's recognition of the huge power that Google wields in this space:

Google is certainly not the first search engine to support RDFa and Microformats, but it certainly has the most influence on the search market. With 72% of the search market, Google has the influence to make people pay attention to RDFa and Microformats.

Or, to put it another way, we may be approaching a period in which, to quote Dries Buytaert of the Drupal project, "structured data is the new search engine optimization" - with, I might add, both the pros and cons that come with that particular emphasis!

One of the challenges to an approach based on consuming structured data from the open Web is, of course, that of dealing with inaccuracies, spammers and "gamers" - see, for example, Cory Doctorow's "metacrap" piece, from back in 2001. But as Jeni Tennison notes towards the end of her post, having Google in a position where they have an interest in tackling this problem must be a good thing for the data web community more widely:

They will now have a stake in answering the difficult questions around trust, confidence, accuracy and time-sensitivity of semantic information.

Google's announcement is also one of the topics discussed in the newly released Semantic Web Gang podcast from Talis, and in that conversation - which is well worth a listen as it covers many of the issues I've mentioned here and more besides - Tom Tague from Thomson-Reuters highlights another potential outcome when he expresses optimism that the interest in embedded metadata generated by the Google initiative will also provide an impetus for the development of other tools to consume that data, such as browser plug-ins.

Thinking about activities that I have some involvement in, I think the use of RDFa in general is an area that should be entering on the radar of the "repositories" community in their efforts to improve access to the outputs of scholarly research.

It's also an area that I think the Dublin Core Metadata Initiative should be engaging with. Embedding metadata in HTML pages with the intent of facilitating the discovery of those pages using search engines was probably one of the primary motivating cases, at least in the early days of work on DC, though of course there has historically been little support from the global search engines for the approach, in large part because of the sort of problems identified by Doctorow. The current DCMI recommendation for doing this makes use of an HTML metadata profile (associated with a GRDDL namespace transformation). While on the one hand, RDFa is "just another syntax for RDF", it might be useful for DCMI to produce a short document illustrating the use of RDFa (and perhaps to consider the use of RDFa in its own documents). Of course, as far as the use of DCMI's own RDF vocabularies in data exposed to Google is concerned, it remains to be seen whether support for RDF vocabularies other than Google's own will be introduced. (Having said that, it's also worth noting that one of the strengths of RDFa is that the attribute-based syntax is fairly amenable to the use of multiple vocabularies in combination.)

Finally, I think this is an area which Eduserv should be tracking carefully with regard to its relevance to the services it provides to the education sector and to the wider public sector in the UK: it seems notable that, as I mentioned a few weeks ago, some of the early deployments of RDFa have been within UK government services.

May 08, 2009

The Nature of OAI, identifiers and linked data

In a post on Nascent, Nature's blog on web technology and science, Tony Hammond writes that Nature now offer an OAI-PMH interface to articles from over 150 titles dating back to 1869.

Good stuff.

Records are available in two flavours - simple Dublin Core (as mandated by the protocol) and Prism Aggregator Message (PAM), a format that Nature also use to enhance their RSS feeds.  (Thanks to Scott Wilson and TicTocs for the Jopml listing).

Taking a quick look at their simple DC records (example) and their PAM records (example) I can't help but think that they've made a mistake in placing a doi: URI rather than an http: URI in the dc:identifier field.

Why does this matter?

Imagine you are a common-or-garden OAI aggregator.  You visit the Nature OAI-PMH interface and you request some records.  You don't understand the PAM format so you ask for simple DC.  So far, so good.  You harvest the requested records.  Wanting to present a clickable link to your end-users, you look to the dc:identifier field only to find a doi: URI:

doi:10.1038/nature01234

If you understand the doi: URI scheme you are fine because you'll know how to convert it to something useful:

http://dx.doi.org/10.1038/nature01234

But if not, you are scuppered!  You'll just have to present the doi: URI to the end-user and let them work it out for themselves :-(

Much better for Nature to put the http: URI form in dc:identifier.  That way, any software that doesn't understand DOIs can simply present the http: URI as a clickable link (just like any other URL).  Any software that does understand DOIs, and that desperately wants to work with the doi: URI form, can do the conversion for itself trivially.

Of course, Nature could simply repeat the dc:identifier field and offer both the http: URI form and the doi: URI form side-by-side.  Unfortunately, this would run counter the the W3C recommendations not to mint multiple URIs for the same resource (section 2.3.1 of the Architecture of the World Wide Web):

A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

On balance I see no value (indeed, I see some harm) in surfacing the non-HTTP forms of DOI:

10.1038/nature01234

and

doi:10.1038/nature01234

both of which appear in the PAM record (somehwat redundantly?).

The http: URI form

http://dx.doi.org/10.1038/nature01234

is sufficient.  There is no technical reason why it should be perceived as a second-class form of the identifier (e.g. on persistence grounds).

I'm not suggesting that Nature gives up its use of DOIs - far from it.  Just that they present a single, useful and usable variant of each DOI, i.e. the http: URI form, whenever they surface them on the Web, rather than provide a mix of the three different forms currently in use.

This would be very much in line with recommended good practice for linked data:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.

April 24, 2009

More RDFa in UK government

It's quite exciting to see various initiatives within UK government starting to make use of Semantic Web technologies, and particularly of RDFa. At the recent OKCon conference, I heard Jeni Tennison talk about her work on using RDFa in the London Gazette. Yesterday, Mark Birbeck published a post outlining some of his work with the Central Office of Information.

The example Mark focuses on is that of a job vacancy, where RDFa is used to provide descriptions of various related resources: the vacancy, the job for which the vacancy is available, a person to contact, and so on. Mark provides an example of a little display app built on the Yahoo SearchMonkey platform which processes this data.

As a a footnote (a somewhat lengthy one, now that I've written it!), I'd just draw attention to Mark's description of developing what he calls an RDF "argot" for constructing such descriptions:

The first vocabularies -- or argots -- that I defined were for job vacancies, but in order to make the terminology usable in other situations, I broke out argots for replying to the vacancy, the specification of contact details, location information, and so on.

An argot doesn't necessarily involve the creation of new terms, and in fact most of the argots use terms from Dublin Core, FOAF and vCard. So although new terms have been created if they are needed, the main idea behind an argot is to collect together terms from various vocabularies that suit a particular purpose.

I was struck by some of the parallels between this and DCMI's descriptions of developing what it calls an "DC application profile" - with the caveat that DCMI typically talks in terms of the DCMI Abstract Model rather than directly of the RDF model. e.g. the Singapore Framework notes:

In a Dublin Core Application Profile, the terms referenced are, as one would expect, terms of the type described by the DCMI Abstract Model, i.e. a DCAP describes, for some class of metadata descriptions, which properties are referenced in statements and how the use of those properties may be constrained by, for example, specifying the use of vocabulary encoding schemes and syntax encoding schemes. The DC notion of the application profile imposes no limitations on whether those properties or encoding schemes are defined and managed by DCMI or by some other agency

And in the draft Guidelines for Dublin Core Application Profiles:

the entities in the domain model -- whether Book and Author, Manifestation and Copy, or just a generic Resource -- are types of things to be described in our metadata. The next step is to choose properties for describing these things. For example, a book has a title and author, and a person has a name; title, author, and name are properties.

The next step, then, is to scan available RDF vocabularies to see whether the properties needed already exist. DCMI Metadata Terms is a good source of properties for describing intellectual resources like documents and web pages; the "Friend of a Friend" vocabulary has useful properties for describing people. If the properties one needs are not already available, it is possible to declare one's own

And indeed the Job Vacancy argot which Mark points to would, I think, probably be fairly recognisable to those familiar with the DCAP notion: compare, for example, with the case of the Scholarly Works Application Profile. The differences are that (I think) an "argot" focuses on the description of a single resource type, and I don't think it goes as far as a formal description of structural constraints in quite the same way DCMI's Description Set Profile model does.

About

Search

Loading
eFoundations is powered by TypePad