January 10, 2012

Introducing Bricolage

My last post here (two months ago - yikes, must do better...) was an appeal to anyone who might be interested in my making a contribution to a project under the JISC call 16/11: JISC Digital infrastructure programme to get in touch. I'm pleased to say that Jasper Tredgold of ILRT at the University of Bristol contacted me about a proposal he was putting together for a project called Bricolage, with the prospect of my doing some consultancy. The proposal was to work with the University Library's Special Collections department and the University Geology Museum to make available metadata for two collections - the archive of Penguin Ltd. and the specimen collection of the Geology Museum - as Linked Open Data.

And just before I went off for the Christmas break, Jasper let me know that the proposal had been accepted and the project was being funded. I'm very pleased to have another opportunity to carry on applying some of what I've learned in the other JISC-funded projects I've contributed to recently, and also to explore some new categories of data. It's also nice to be working with a local university - I worked on a few projects with folks from ILRT during my time at UKOLN, and from a selfish perspective I look forward to project meetings which involve a twenty-minute walk up the hill for me rather than a 7am start and a three or four hour train journey!

The project will start in February and run through to July. I'm sure there'll be a project blog once we get going and I'll add a URI here when it is available.

November 04, 2011

JISC Digital Infrastructure programme call

JISC currently has a call, 16/11: JISC Digital infrastructure programme, open for project proposals in a number of "strands"/areas, including the following:

  • Resource Discovery: "This programme area supports the implementation of the resource discovery taskforce vision by funding higher education libraries archives and museums to make open metadata about their collections available in a sustainable way. The aim of this programme area is to develop more flexible, efficient and effective ways to support resource discovery and to make essential resources more visible and usable for research and learning."

This strand advances the work of the UK Discovery initiative, and is similar to the "Infrastructure for Resource Discovery" strand of the JISC 15/10 call under which the SALDA project (in which I worked with the University of Sussex Library on the Mass Observation Archive data) was funded. There is funding for up to ten projects of between £25,000 and £75,000 per project in this strand

First, I should say this is a great opportunity to explore this area of work and I think we're fortunate that JISC is able to fund this sort of activity. A few particular things I noticed about the current call:

  • a priority for "tools and techniques that can be used by other institutions"
  • a focus on unique resources/collections not duplicated elsewhere
  • should build on lessons of earlier projects, but must avoid duplication/reinvention
  • a particular mention of "exploring the value of integrating structured data into webpages using microformats, microdata, RDFa and similar technologies" as an area in scope
  • an emphasis on sharing the experience/lessons learned: "The lessons learned by projects funded under this call are expected to be as important as the open metadata produced. All projects should build sharing of lessons into their plans. All project reporting will be managed by a project blog. Bidders should commit to sharing the lessons they learn via a blog"

Re that last point, as I've said before, one of the things I most enjoyed about the SALDA and LOCAH projects was the sense that we were interested in sharing the ideas as well as getting the data out there.

I'm conscious the clock is ticking towards the submission deadline, and I should have posted this earlier, but if anyone reading is considering a proposal and thinks that I could make a useful contribution, I'd be interested to hear from you. My particular areas of experience/interest are around Linked Data, and are probably best reflected by the posts I made on the LOCAH and SALDA blogs, i.e. data modelling, URI pattern design, identification/selection of useful RDF vocabularies, identification of potential relationships with things described in other datasets, construction of queries using SPARQL, etc. I do have some familiarity with RDFa, rather less with microdata and microformats. I'm not a software developer, but I can do a little bit of XSLT (and maybe enough PHP to be dangerous hack together rather flakey basic demonstrators). And I'm not a technical architect, but I did get some experience of working with triple stores in those recent projects.

My recent work has been mainly with archival metadata, and I'd be particularly interested in further work which complements that. I'm conscious of the constraint in the call of not repeating earlier work, so I don't think "reapplying" the sort of EAD to RDF work I did with LOCAH and SALDA would fit the bill. (I'd love to do something around the event/narrative/storytelling angle that I wrote about recently here, for example.) Having said that, I certainly don't mean to limit myself to archival data. Anyway, if you think I might be able to help, please do get in touch ([email protected]).

October 05, 2011

Storytelling, archives and linked data

Yesterday on Twitter I saw Owen Stephens (@ostephens) post a reference to a presentation titled "Every Story has a Beginning", by Tim Sherratt (@wragge), "digital historian, web developer and cultural data hacker" from Canberra, Australia.

The presentation content is available here, and the text of the talk is here. I think you really need to read the text in one window and click through the presentation in another. I found it fascinating, and pretty inspiring, from several perspectives.

First, I enjoyed the form of the presentation itself. The content is built up incrementally on the screen, with an engaging element of "dynamism" but kept simple enough to avoid the sort of vertiginous barrage that seems to characterise the few Prezi presentations I've witnessed. And perhaps most important of all, the presentation itself is very much "a thing of the Web": many of the images are hyperlinked through to the "live" resources pictured, providing not only a record of "provenance" for the examples, but a direct gateway into the data sources themselves, allowing people to explore the broader context of those individual records or fragments or visualisations.

Second, it provides some compelling examples of how digitised historical materials and data extracted or derived from them can be brought together in new combinations and used to uncover and (re)tell stories - and stories not just of the "famous", the influential and powerful, but of ordinary people whose life events were captured in historical records of various forms. (Aside: Kate Bagnall has some thoughtful posts looking at some of the ethical issues of making people who were "invisible" "visible").

Finally, what really intrigued me from the technical perspective was that - if I understand correctly - the presentation is being driven by a set of RDF data. (Tim said on Twitter he'd post more explaining some of the mechanics of what he has done, and I admit I'm jumping the gun somewhat in writing this post, so I apologise for any misinterpretations.) In his presentation, Tim says:

What we need is a data framework that sits beneath the text, identifying people, dates and places, and defining relationships between them and our documentary sources. A framework that computers could understand and interpret, so that if they saw something they knew was a placename they could head off and look for other people associated with that place. Instead of just presenting our research we’d be creating a whole series of points of connection, discovery and aggregation.

Sounds a bit far-fetched? Well it’s not. We have it already — it’s called the Semantic Web.

The Semantic Web exposes the structures that are implicit in our web pages and our texts in ways that computers can understand. The Linked Data movement takes the basic ideas of the Semantic Web and turns them into a collaborative activity. You share vocabularies, so that other people (and computers) know when you’re talking about the same sorts of things. You share identifiers, so that other people (and computers) know that you’re talking about a specific person, place, object or whatever.

Linked Data is Storytelling 101 for computers. It doesn’t have the full richness, complexity and nuance that we invest in our narratives, but it does at least help computers to fit all the bits together in meaningful ways. And if we talk nice to them, then they can apply their newly-acquired interpretative skills to the things that they’re already good at — like searching, aggregating, or generating the sorts of big pictures that enable us to explore the contexts of our stories.

So, if we look at the RDF data for Tim's presentation, it includes "descriptions" of many different "things", including people, like Alexander Kelley, the subject of his first "story" (to save space, I've skipped the prefix declarations in these snippets but I hope they convey the sense of the data):

story:kelley a foaf1:Person ;
     bio:death story:kelley_death ;
     bio:event
         story:kelley_cremation,
         story:kelley_discharge,
         story:kelley_enlistment,
         story:kelley_reunion,
         story:kelley_wounded_1,
         story:kelley_wounded_2 ;
     foaf1:familyName "Kelley"@en-US ;
     foaf1:givenName "Alexander"@en-US ;
     foaf1:isPrimaryTopicOf story:kelley_moa ;
     foaf1:name "Alexander Dewar Kelley"@en-US ;
     foaf1:page 
       <http://discontents.com.au/shoebox/every-story-has-a-beginning> . 

There is data about events in his life:

story:kelley_discharge a bio:Event ;
     rdfs:label 
       "Discharged from the Australian Imperial Force."@en-US ;
     dc:date "1918-11-22"@en-US . 

story:kelley_enlistment a bio:Event ;
     rdfs:label 
       "Enlistment in the Australian Imperial Force for 
        service in the First World War."@en-US ;
     dc:date "1916-01-22"@en-US . 
     
story:kelley_ww1_service a bio:Interval ;
     bio:concludingEvent story:kelley_discharge ;
     bio:initiatingEvent story:kelley_enlistment ;
     foaf1:isPrimaryTopicOf story:kelley_ww1_record . 

and about the archival materials that record/describe those events:

story:kelley_ww1_record a locah:ArchivalResource ;
     locah:accessProvidedBy 
       <http://dbpedia.org/resource/National_Archives_of_Australia> ;
     dc:identifier "B2455, KELLEY ALEXANDER DEWAR"@en-US ;
     bibo:uri 
       "http://www.aa.gov.au/cgi-bin/Search?O=I&Number=7336927"@en-US . 

The presentation itself, the conference at which it was presented, various projects and researchers mentioned - all of these are also "things" described in the data.

I'd be interested in hearing more about how this data was created, the extent to which it was possible to extract the description of people, events, archival resources etc directly from existing data sources and the extent to which it was necessary to "hand craft" parts of it.

But I get very excited when I think about the potential in this sort of area if (when!?) we do have the data for historical records available as linked data (and available under open licences that support its free use).

Imagine having a "story building tool" which enables a "narrator" to visit a linked data page provided by the National Archives of Australia or the Archives Hub or one of the other projects Tim refers to, and to select and "intelligently clip" a chunk of data which you can then arrange into the "story" you are constructing - in much the way that bookmarklets for tools like Tumblr and Posterous enable you to do for text and images now. That "clipped chunk of data" could include a description of a person and some of their life events and metadata about digitised archival resources, including URIs of images - as in Tim's examples. You might follow pointers to other datasets from which additional data could be pulled. You might supplement the "clipped" data with your own commentary. Then imagine doing the same with data from the BBC describing a TV programme or radio broadcast related to the same person or events, or with data from a research repository describing papers about the person or events. The tool could generate some "provenance data" for each "chunk" saying "this subgraph was part of that graph over there, which was created by agency A on date D" in much the way that the blogging bookmarklets provide backlinks to their sources.

And the same components might be reorganised, or recombined with others, to tell different stories, or variants of the same story.

Now, yes, I'm sure there are some thorny issues to grapple with here, and coming up with an interface that balances usability and the required degree of control may be a challenge - so maybe I'm getting carried way with my enthusiasm, but it doesn't seem to be an entirely unreasonable proposition.

I think it's important here that, as Tim emphasises towards the end of his text, it is the human "narrator", not an algorithm, who decides on the structure of the story and selects its components and brings them into (possibly new) relationships with each other.

I'm aware that there's other work in this area of narratives and stories, particularly from some of the people at the BBC, but I admit I haven't been able to keep up with it in detail. See e.g. Tristan Ferne on "The Mythology Engine" and Michael Smethurst's thoughtful "Storytellin'".

For me, Tim's very concrete examples made the potential of these approaches seem very real. They suggest a vision of Linked Data not as one more institutional "output", but as a basis for individual and collective creativity and empowerment, for the (re)telling of stories that have been at least partly concealed - stories which may even challenge the "dominant" stories told by the powerful. It seems all too infrequent these days that I come across something that reminds me why I bothered getting interested in metadata and data on the Web in the first place: Tim's presentation was one of those things.

September 19, 2011

Things & their conceptualisations: SKOS, foaf:focus & modelling choices

I thought I'd try to write up some thoughts around an issue which I've come across in a few different contexts recently, and which as a shorthand I sometimes think of as "the foaf:focus question". It was prompted mainly by:

  • my work on modelling the Archives Hub data during the LOCAH project, and looking at datasets to which we wanted to make links, such as VIAF;
  • looking at the data model for the recent Linked Open BNB dataset from the British Library, and how some Dublin Core properties were being used, and some email discussions I had around that;
  • a recent message by Dan Brickley to the foaf-dev mailing list, explaining how the design of some FOAF properties was conditioned by the context at the time, and reflecting on how that context had changed, and what the implications of those changes might be.

Rather by chance on Friday evening, just as I was about to try to tie up what had become a rather long and rambling post, I noticed a conversation on Twitter, initiated by John Goodwin (@gothwin), which I think addressed the broader issue which I'd been circling around without quite addressing it: the use of different "modelling styles" and the issues which arise as a result when we try to link or merge data.

After much chopping and changing, the post adopts a very roughly "chronological" approach. The initial parts cover areas and activities that I wasn't directly involved in at the time, so I am providing my own retrospective interpretation based on my reading of the sources rather than a first-hand account, and I apologise in advance for any omissions or misrepresentations.

FOAF and "interests"

The FOAF Project was an initiative launched by a community of Semantic Web enthusisasts back in 2000, which explored the use of the - then newly emerging - Resource Description Framework specification to express information about individuals, their contact details and their interests and projects - the sort of information that was typically presented on a "personal home page" - and also some of the practical considerations in providing and consuming such information on the Web as RDF. The principle "deliverable" of the project is the Friend of a Friend (FOAF) RDF vocabulary, which continues to evolve and is now very widely used.

As Dan Brickley notes in his recent post, when looking at some of the FOAF properties from the perspective of 2011, their design may seem slightly "unwieldy" to the newcomer, and this is in part because their design was shaped by the context of how the Web was being used at the time of their creation, perhaps nine or ten years ago. At that point, as Dan notes, while there were URIs available for Web documents, and a growing recognition of the importance of maintaining the stability of those URIs, the use of http URIs to identify things other than documents was much less widely adopted, and we often lacked stable URIs for those things of other types that we wanted to "talk about".

One of the use cases covered by FOAF is to express the "interests" of an individual - where that "interest" might be an idea, a place, a person, an event or series of events, anything at all. To work around the issue of the availability of a URI of that thing, FOAF adopted a convention of "indirection" in some of its early properties. So, for example, the foaf:interest property expresses a relation, not between agent and idea/place/person (etc), but between agent and document: it says "this agent is interested in the topic of this page - the thing the page is 'about'" - where that might be anything at all. Using this convention, the topic itself is not explicitly referenced, so the question of its URI does not arise.

So, for example, to express an interest in the Napoleonic Wars, one might make use of the URI of the Wikipedia page 'about' that thing, and say:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:interest
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

A second property, foaf:topic_interest, does allow for the expression of that "direct" relationship, linking the agent to the thing of interest - which again might be anything at all. (I'm not sure whether thse two properties were created at the same time or whether one preceded the other). Even in the absence of URIs for concepts and people and places, RDF allows for the use of "blank nodes" to refer to such things.

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest [ rdfs:label "Napoleonic Wars"@en ] .

However, a blank node is limited in its scope as an identifier to the graph within which it is found: if Fred provides a blank node for the notion of the Napoleonic Wars in his graph and Freda provides a blank node for an interest in her graph, I can't tell (from that information alone) whether those two nodes are references to the same thing or to two different things (e.g. the historical event and a book about the event). Again, historically, one solution to this problem was to introduce the URI of a document, together with some inferencing based on OWL:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest 
          [ rdfs:label "Napoleonic Wars"@en ;
            foaf:isPrimaryTopicOf 
              <http://en.wikipedia.org/wiki/Napoleonic_Wars> ] .

According to the FOAF documentation for foaf:isPrimaryTopicOf:

The isPrimaryTopicOf property is inverse functional: for any document that is the value of this property, there is at most one thing in the world that is the primary topic of that document. This is useful, as it allows for data merging

i.e. if Freda's graph says:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:freda 
        foaf:topic_interest 
          [ rdfs:label "Napoleonic Wars"@en ;
            foaf:isPrimaryTopicOf 
              <http://en.wikipedia.org/wiki/Napoleonic_Wars> ] .

then my application can conclude that they are indeed both interested in the same thing - though that does depend on that application having some "built-in knowledge" of the OWL inferencing rules (or access to another service which does).

http URIs for Things

The "httpRange-14 resolution" by the W3C Technical Architecture Group, the publication of the W3C Note on Cool URIs for the Semantic Web, the adoption of the principles of Linked Data and the emergence of a large number of datasets based on those principles has, of course, changed the landscape considerably, and the use of http URIs to identify things other than documents has become commonplace - even if there remain concerns about the practical challenges of implementing of some of the recommended techniques.

So, now DBpedia assigns a distinct http URI for the thing the Wikipedia page http://en.wikipedia.org/wiki/Napoleonic_Wars "is about", and provides a description of that thing in which it says (amongst other things):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://dbpedia.org/resource/Napoleonic_Wars> 
        rdfs:label "Napoleonic Wars"@en ;
        a <http://dbpedia.org/ontology/Event> ,
          <http://dbpedia.org/ontology/MilitaryConflict> ,
          <http://umbel.org/umbel/rc/Event> ,
          <http://umbel.org/umbel/rc/ConflictEvent> ;
        foaf:page 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

i.e. that thing, the topic of the page, is an event, a military conflict etc.

We could substitute this new DBpedia URI for the blank nodes in our foaf:topic_interest data above:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

<http://dbpedia.org/resource/Napoleonic_Wars>
        rdfs:label "Napoleonic Wars"@en ;
        foaf:isPrimaryTopicOf 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:freda 
        foaf:topic_interest
          <http://dbpedia.org/resource/Napoleonic_Wars> .

<http://dbpedia.org/resource/Napoleonic_Wars>
        rdfs:label "Napoleonic Wars"@en ;
        foaf:isPrimaryTopicOf 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

When those two graphs are merged, the use of the common URI now makes it trivial to determine that Fred and Freda share the same interest.

Concept Schemes, SKOS and Document Metadata

The other factor Dan mentions in his message was the emergence of of the Simple Knowledge Organisation System (SKOS) RDF vocabulary, which after a long evolution became a W3C Recommendation in 2009.

SKOS is designed to provide an RDF representation of the various flavours of "knowledge organisation systems" and "controlled vocabularies" which information managers have traditionally used to organise information about various resources (books in libraries, objects in museums etc etc etc).

The core class in SKOS is that of the concept (skos:Concept). Each concept can be labelled with one or more names; documented with notes of various types; grouped into collections; related to other concepts through relationships such as "broader"/"narrower"/"related"; or mapped to other concepts in other collections.

The Library of Congress has published several library thesauri/classification schemes/controlled vocabularies as SKOS RDF data, including the Library of Congress Subject Headings, which includes a concept named "Napoleonic Wars, 1800-1815" with the URI http://id.loc.gov/authorities/subjects/sh85089767 (this is a subset of the actual data provided):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

lcsh:sh85089767
        a skos:Concept ;
        rdfs:label "Napoleonic Wars, 1800-1815"@en ;
        skos:prefLabel "Napoleonic Wars, 1800-1815"@en ;
        skos:altLabel "Napoleonic Wars, 1800-1814"@en ;
        skos:broader lcsh:sh85045703 ;
        skos:narrower lcsh:sh85144863 ;
        skos:inScheme 
          <http://id.loc.gov/authorities/subjects> .

A metadata creator coming from a bibliographic background and providing Dublin Core-based metadata for the Wikipedia page http://en.wikipedia.org/wiki/Napoleonic_Wars might well use this concept URI to provide the "subject" of that page:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars>
        a foaf:Document ;
        rdfs:label "Napoleonic Wars"@en ;
        dcterms:subject lcsh:sh85089767 .

And indeed the concept URI could (I think) also be used with the foaf:topic or foaf:primaryTopic properties:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars>
        a foaf:Document ;
        rdfs:label "Napoleonic Wars"@en ;
        foaf:topic lcsh:sh85089767 ;
        foaf:primaryTopic lcsh:sh85089767 .

Note that all three of these properties (dcterms:subject, foaf:topic, and foaf:primaryTopic) are defined in such a way that they are not limited to being used with concepts. We've seen this above where the DBpedia data includes:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://dbpedia.org/resource/Napoleonic_Wars> 
        foaf:page 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

which (since foaf:page and foaf:topic are inverse properties) implies:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars> 
        foaf:topic 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

It is also true for the dcterms:subject property. Although traditionally the Dublin Core community has highlighted the use of formal classification schemes like LCSH, and it may well be true that the dcterms:subject property is often used to link things to concepts, it is not limited to taking concepts as values, and one could also link the document to the event using dcterms:subject:

@prefix dcterms: <http://purl.org/dc/terms/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars> 
        dcterms:subject 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

I should acknowledge here that some in the Dublin Core community might disagree with my last example above, and argue that the values of dcterms:subject should be concepts. I think my position is backed up by the current DCMI documentation, and particularly by the fact that when they assigned ranges to the DCMI Terms properties in 2008, the DCMI Usage Board did not specify a range for the dcterms:subject property i.e. the intention is that dcterms:subject may link to a resource of any type.

I also note in passing that when DCMI created the DCMI Abstract Model in its attempt to reflect the "classical view" of Dublin Core (perhaps best expressed in Tom Baker's "A Grammar of Dublin Core") in an RDF-based model, the notion of a "Vocabulary Encoding Scheme" was defined as a set of things of any type, not specifically as a set of concepts.

Things and their Conceptualisations: foaf:focus

The SKOS approach and the SKOS Concept class introduce a new sort of "indirection" from our "things-in-the-world". As Dan puts it in a message to the W3C public-esw-thes list:

a SKOS "butterflies" concept is a social and technological artifact designed to help interconnect descriptions of butterflies, documents (and data) about butterflies, and people with interest or expertise relating to butterflies. I'm quite consciously avoiding saying what a "butterflies" concept in SKOS "refers to", because theories of reference are hard to choose between. Instead, I prefer to talk about why we bother building SKOS and what we hope can be achieved by it.

So, although both the DBpedia URI http://dbpedia.org/resource/Napoleonic_Wars) and the Library of Congress URI http://id.loc.gov/authorities/subjects/sh85089767) may be used in the triple patterns shown above, those two URIs identify two different resources - and both of them are distinct from the Wikipedia page which we cited in the examples back at the very start of this post.

i.e. we now have three separate URIs identifying three separate resources:

  1. a Wikipedia page, a document, created and modified by Wikipedia contributors between 2002 and the present identified by the URI http://en.wikipedia.org/wiki/Napoleonic_Wars
  2. the Napoleonic Wars as event taking place between 1800 and 1815, something with a duration in time, which occurred in physical locations, and in which human beings participated, identified by the DBpedia URI http://dbpedia.org/resource/Napoleonic_Wars
  3. a "conceptualisation of" the Napoleonic Wars, "a social and technological artifact designed to help interconnect", an "abstraction" created by the authors of LCSH editors for the purposes of classifying works; it has "semantic" relationships to other concepts, and is identified by the Library of Congress URI http://id.loc.gov/authorities/subjects/sh85089767

As we've seen, properties like dcterms:subject, foaf:topic/foaf:page, foaf:primaryTopic/foaf:isPrimaryTopicOf provide the vocabulary to express the relationships between the first and second of these resources, and between the first and third. But what about the relationship between the second and third, between the "thing in the world" and its conceptualisation in a classification scheme? Or to make the issue concrete, what happens if, in their "interest" graphs, Fred cites the DBpedia event URI and Frida cites the LCSH concept URI? How do we establish that their interests are indeed related? Can a publisher of an SKOS concept scheme indicate a relationship between a concept and a "conceptualised thing" (person, place, event etc)?

Dan provides a rather neat diagram illustrating the issue, using the example of Ronald Reagan. The arcs Dan labels "it" represent this "missing" (at the time the diagram was drawn) relationship type/property.

The resolution was to create a new property in the FOAF vocabulary, called foaf:focus. (See this page on the FOAF Project Wiki) for some of the discussion of its name).

The FOAF vocabulary specification says of the property:

The focus property relates a conceptualisation of something to the thing itself. Specifically, it is designed for use with W3C's SKOS vocabulary, to help indicate specific individual things (typically people, places, artifacts) that are mentioned in different SKOS schemes (eg. thesauri).

W3C SKOS is based around collections of linked 'concepts', which indicate topics, subject areas and categories. In SKOS, properties of a skos:Concept are properties of the conceptualization (see 2005 discussion for details); for example administrative and record-keeping metadata. Two schemes might have an entry for the same individual; the foaf:focus property can be used to indicate the thing in they world that they both focus on. Many SKOS concepts don't work this way; broad topical areas and subject categories don't typically correspond to some particular entity. However, in cases when they do, it is useful to link both subject-oriented and thing-oriented information via foaf:focus.

It's worth emphasising the point made in the penultimate sentence: not all concepts "have a focus"; some concepts are "just concepts" (poetry, slavery, conscientious objection, anarchism etc etc etc).

Dan summarises how he sees the new property being used in a message to the W3C public-esw-thes list:

The addition of foaf:topic is intended as a modest and pragmatic bridge between SKOS-based descriptions of topics, and other more entity-centric RDF descriptions. When a SKOS Concept stands for a person or agent, FOAF and its extensions are directly applicable; however we expect foaf:focus to also be used with places, events and other identifiable entities that are covered both by SKOS vocabularies as well as by factual datasets like wikipedia/dbpedia and Freebase.

A single "thing in the world" may be "the focus of" multiple concepts: e.g. several different library classification schemes may include concepts for the Napoleonic Wars or Ronald Reagan or Paris. Even within a single scheme, it may be that there are multiple concepts each reflecting different facets or aspects of a single entity.

VIAF

This aspect of the relationship between conceptualisation and "thing in the world" is illustrated in VIAF, the Virtual International Authority File, a service provided by OCLC. VIAF aggregates library "authority records" from multiple library "name authority files" maintained mainly by national libraries. Each record provides a "preferred form" of the name of a person or corporate entity, and multiple "alternate forms" - though that preferred form may vary from one file to the next. VIAF analyses and collates the aggregated data to establish which records refer to the same person or corporate entity, and presents the results as Linked Data.

Jeff Young of OCLC summarises the VIAF model in a post on the Outgoing blog. The post actually describes the transition between an earlier, slightly more complex model and the current model. For the purposes of this discussion, the thing to look at is the "Example (After)" graphic at the top right of the post (direct link)

Consider Dan's example of Ronald Reagan, identified by the VIAF URI http://viaf.org/viaf/76321889. The RDF description provided shows that there are eleven concepts linked to the person resource by a foaf:focus link. Each of those concepts (I think!) corresponds to a record in an authority file harvested by VIAF. Each concept has a preferred label (the preferred form of the name in that authority file) and may have a number of alternate labels. An abridged version of the VIAF description is below:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> . 

<http://viaf.org/viaf/76321889> a foaf:Person ;
        foaf:name "Reagan, Ronald" ,
                  "Reagan, Ronald W." , 
                  "Reagan, Ronald, 1911-2004" , 
                  "Reagan, Ronald W. 1911-2004" , 
                  "Reagan, Ronald Wilson 1911-2004" , 
                  "Reagan, Elvis 1911-2004" ;
# plus various other names!
        owl:sameAs 
          <http://d-nb.info/gnd/118598724> ,
          <http://dbpedia.org/resource/Ronald_Reagan> , 
          <http://libris.kb.se/resource/auth/237204> , 
          <http://www.idref.fr/027091775/id> .

<http://viaf.org/viaf/sourceID/BNE%7CXX1025345#skos:Concept>
        a skos:Concept ;
        skos:prefLabel "Reagan, Ronald, 1911-2004" ;
        skos:altLabel "Reagan, Elvis 1911-2004" , 
                      "Reagan, Ronald W. 1911-2004", 
                      "Reagan, Ronald Wilson 1911-2004" ;
        skos:inScheme 
          <http://viaf.org/authorityScheme/BNE> ;
        foaf:focus 
          <http://viaf.org/viaf/76321889> .

<http://viaf.org/viaf/sourceID/BNF%7C11921304#skos:Concept>
        a skos:Concept ;
        skos:prefLabel "Reagan, Ronald, 1911-2004" ;
        skos:altLabel "Reagan, Ronald Wilson 1911-2004" ;
        skos:inScheme 
          <http://viaf.org/authorityScheme/BNF> ;
        foaf:focus 
          <http://viaf.org/viaf/76321889> .

# plus nine other concepts

(There really is an alternate label of "Reagan, Elvis 1911-2004" in the actual data!)

Fig1

Note that the owl:sameAs links here are between the VIAF person resource and person resources in external datasets.

LOCAH and index terms

My own first engagement with foaf:focus came during the LOCAH project. In deciding how to represent the content of the Archives Hub EAD documents as RDF, we had to decide how to model the use of "index terms" provided using the EAD <controlaccess> element. Within the Hub EAD documents, those index terms are names of one of the following categories of resource:

  • Concepts
  • Persons
  • Families
  • Organisations
  • Places
  • Genres or Forms
  • Functions

The names are sometimes (but not always) drawn from some sort of "controlled list", which is also named in the data. In other cases, they are constructed using some specified set of rules, again named in the data.

For some of these categories (Concepts, Genres/Forms, Functions), the "thing" named is simply a concept, an abstraction; for others (Persons, Families, Organisations and Places), there is a second "non-abstract" entity "out there in the world". And for this second case, we chose to represent the two distinct things, each with their own distinct URI, linked by a triple using the foaf:focus property.

The LOCAH data model is illustrated in the diagram in this post. The Concept entity type is in the lower centre; directly below are four boxes for the related "conceptualised" types (Person, Family, Organisation and Place), each linked from the Concept by the foaf:focus property.

As in the VIAF case, for a single person/family/organisation/place, there may be multiple distinct "conceptualisations", reflecting the fact that different data providers have referred to the same "thing in the world" by citing entries from different "authority files". The nature of the process by which the LOCAH RDF data is generated - the EAD documents are processed on a "document by document" basis - means that in this case, multiple URIs for the person are generated, and the "reconciliation" of these URIs as co-references to a single entity is performed as a subsequent step.

The British Library Linked Data

The British Library recently announced the release of Linked Open BNB, a new Linked Data dataset covering a subset of the British National Bibliography. The approach taken is described in a post by Richard Wallis of Talis, who worked with the BL as consultants in preparing the dataset.

The data model for the BNB data shows quite extensive use of the Concept-foaf:focus-Thing pattern.

For the "subjects" of Bibliographic Resources, a dcterms:subject link is made to a skos:Concept, reflecting an authority file or classification scheme entry, and which is in turn linked using foaf:focus to a Person, Family, Organisation or Place. In other cases, the Bibliographic Resource is linked directly to the "thing-in-the-world" and a corresponding Concept is also provided, linking to the "thing-in-the world" using foaf:focus. This is the case for languages, for persons as creators or contributors to the bibliographic resource, and for the Dublin Core "spatial coverage property".

So for the example http://bnb.data.bl.uk/id/resource/009436036 (again, this is a very stripped-down version of the actual data):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix wgs84_pos: <http://www.w3.org/2003/01/geo/wgs84_pos#> .

<http://bnb.data.bl.uk/id/resource/009436036>
        a dcterms:BibliographicResource ;
        dcterms:creator 
          <http://bnb.data.bl.uk/id/person/KingGRD%28GeoffreyRD%29> ; 
        dcterms:subject
          <http://bnb.data.bl.uk/id/concept/place/lcsh/Aden%28Yemen%29> ;
        dcterms:spatial
          <http://bnb.data.bl.uk/id/place/Aden%28Yemen%29> ;
        dcterms:language 
          <http://lexvo.org/id/iso639-3/eng> .

<http://bnb.data.bl.uk/id/concept/place/lcsh/Aden%28Yemen%29> 
        a skos:Concept ;
        foaf:focus 
          <http://bnb.data.bl.uk/id/place/Aden%28Yemen%29> .

<http://bnb.data.bl.uk/id/place/Aden%28Yemen%29>        
        a dcterms:Location, wgs84_pos:SpatialThing .

In this case, the subject-concept and the spatial coverage-place happen to be linked to each other, but the point I wanted to illustrate was that the object of the dcterms:subject triple is the URI of a concept, and is the subject of an "outgoing" foaf:focus link to a "thing", but the object of the dcterms:spatial triple is the URI of a location/place, and is the object of an "incoming" foaf:focus link from a concept.

The two cases are perhaps best illustrated using the graph representation:

Fig2

Authority Files, Concepts, foaf:focus and Dublin Core

As discussed above, the dcterms:subject property is defined in such a way that it can be used to link to a thing of any type, although there may be a preference amongst some implementers to use dcterms:subject to link only to concepts.

For the other four properties I highlighted in the BL model, DCMI specifies an rdfs:range for the properties:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .

dcterms:creator rdfs:range dcterms:Agent .
dcterms:contributor rdfs:range dcterms:Agent .
dcterms:spatial rdfs:range dcterms:Location .
dcterms:language rdfs:range dcterms:LinguisticSystem .

Those three classes (dcterms:Agent, dcterms:LinguisticSystem, dcterms:Location) are described using RDFS, and although there is no formal statement that they are disjoint from the class skos:Concept, I think their human-readable descriptions, taken together with those of the properties, carry a fairly strong suggestion that instances of these classes are the "things in the world" rather than their conceptualisations - certainly for the first three cases at least. A dcterms:Agent is "A resource that acts or has the power to act", which a concept can not; a dcterms:Location is "A spatial region or named place", which again seems distinct from a concept.

The case of dcterms:LinguisticSystem, "A system of signs, symbols, sounds, gestures, or rules used in communication", seems a bit less clear, as this is a "conceptual thing" but I think one can argue that the actual linguistic system practiced by a community of language speakers is a distinct concept from the "conceptualisation" created within a classification scheme.

And this is, I think, reflected in the patterns used in the BL data model.

As I noted earlier, the Library of Congress has published SKOS representations of a number of controlled vocabularies, and these include:

In each case, the entries/members of the vocabularies are modelled as instances of skos:Concept. Following the argument I just constructed above, then, to use these vocabularies with the dcterms:spatial and dcterms:languageproperties, strictly speaking, one should adopt the patterns used in the BL model, where the concept URI is not the direct object, but is linked to a thing (location, linguistic system) URI by a foaf:focus link.

Finally, I'd draw attention to lexvo.org, which also provides Linked Data representations for ISO639-3 languages and ISO 3166-1 / UN M.49 geographical regions. In contrast to the Library of Congress SKOS representations, lexvo.org models "the things themselves", the languages and geographical regions, e.g. (again, a subset of the actual data)

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix lvont: <http://lexvo.org/ontology#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://lexvo.org/id/iso639-3/spa>
        a lvont:Language ;
        rdfs:label "Spanish"@en ;
        lvont:usedIn 
          <http://lexvo.org/id/iso3166/ES> ;
        owl:sameAs 
          <http://dbpedia.org/resource/Spanish_language> .
        
<http://lexvo.org/id/iso3166/ES>
        a lvont:GeographicRegion ;
        rdfs:label "Spain"@en ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/039> ;
        owl:sameAs 
          <http://sws.geonames.org/2510769> .
        
<http://lexvo.org/id/un_m49/039>
        a lvont:GeographicRegion ;
        rdfs:label "Southern Europe"@en ;
        lvont:hasMember 
          <http://lexvo.org/id/iso3166/ES> ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/150> .
 
<http://lexvo.org/id/un_m49/150>
        a lvont:GeographicRegion ;
        rdfs:label "Europe"@en ;
        lvont:hasMember 
          <http://lexvo.org/id/un_m49/039> ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/001> .

In the BL data one finds lexvo.org language URIs used as objects of dcterms:language (which we do in the LOCAH data too).

The lexvo.org language URIs are the subjects of properties such as lvont:usedIn linking language to place where it is used or spoken and of owl:sameAs triples linking to language URIs in other datasets. And the geographic region URIs are the subjects of properties such as lvont:memberOf linking one region to another region of which it is part and of owl:sameAs triples linking to place URIs in other datasets.

Compare this with the relationships between the SKOS-based "conceptualisations of languages" and "conceptualisations of geographic in the Library of Congress dataset (again, a subset of the actual data):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<http://id.loc.gov/vocabulary/iso639-2/spa>
        a skos:Concept ;
        skos:prefLabel "Spanish | Castilian"@en ;
        skos:altLabel "Spanish"@en ,
                      "Castilian"@en ;
        skos:note "Bibliographic Code"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/languages/spa> ,
          <http://id.loc.gov/vocabulary/iso639-1/es> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/iso639-2> .
                        
<http://id.loc.gov/vocabulary/countries/sp>
        a skos:Concept ;
        skos:prefLabel "Spain"@en ;
        skos:altLabel "Balearic Islands"@en ,
                      "Canary Islands"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/geographicAreas/e-sp> ;
        skos:broadMatch 
          <http://id.loc.gov/vocabulary/geographicAreas/e> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/countries> .
        
<http://id.loc.gov/vocabulary/geographicAreas/e-sp>
        a skos:Concept ;        
        skos:prefLabel "Spain"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        skos:broader 
          <http://id.loc.gov/vocabulary/geographicAreas/e> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/geographicAreas> .

<http://id.loc.gov/vocabulary/geographicAreas/e>
        a skos:Concept ;        
        skos:prefLabel "Europe"@en ;
        skos:narrower 
          <http://id.loc.gov/vocabulary/geographicAreas/e-sp> ;
        skos:narrowMatch 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/geographicAreas> .
        

Here the types of relationships involved are the SKOS "semantic relations" and "mapping properties" between concepts: e.g. Spain-as-concept has-broader-concept Europe-as-concept, and so on.

Conclusions

If anyone has read this far, I can imagine it is not without some rolling of eyes at the pedantic distinctions I seem to be unpicking!

Given my understanding of SKOS, FOAF and Dublin Core, I do think the designers of the BL data have done a sterling job in trying to "get it right", carefully observing the way terms have been described by their owners and seeking to use those terms in ways that are consistent with those descriptions.

At the same time, I admit I can well imagine that to many Dublin Core implementers who see the Dublin Core properties as providing a relatively simple approach, this will seem over-complicated.

And I rather expect that we will see uses of the dcterms:language and dcterms:spatial properties which simply link directly to the concept:

@prefix dcterms: <http://purl.org/dc/terms/> .

<http://example.org/doc/1234>
        a dcterms:BibliographicResource ;
        dcterms:spatial 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        dcterms:language 
          <http://id.loc.gov/vocabulary/iso639-2/spa> .

I think this is particularly likely for the case of dcterms:spatial as it is perceived as "similar" to dcterms:subject - amongst other things, it covers "aboutness" in the case where the topic is a place - particularly since, as in the BL example above, the concept URI may be used with dcterms:subject in the very same graph.

Returning to the recent post by Dan which in part prompted me to start this post, he makes a similar point, focusing on the FOAF properties with which I introduced the post. He recognises that "in some contexts having careful names for all these inter-relations is very important" but suggests

We should consider making foaf:interest vaguer, such that any of those three are acceptable. Publishers aren't going to go looking up obscure distinctions between foaf:interest and foaf:topic_interest and that other pattern using SKOS ... they just want to mark that this is a relationship between someone and a URI characterising one of their interests.

So I suggest we become more tolerant about the URIs acceptable as values of foaf:interest

(See the message in full for Dans examples.)

Part of me is tempted to suggest that similar reasoning might be applied to the Dublin Core Terms properties i.e. that consideration might be given to "relaxing" the range of dcterms:spatial to allow for the use of either the place or the concept, to resolve the dilemna that, with the current design, the patterns are different for dcterms:subject and dcterms:spatial. But where do we stop? Do we also relax dcterms:language? I really don't know. And I think I'd be quite uneasy making the suggestion for the "agent" properties.

The fundamental issue here is that the thesaurus-based approach introduces a layer of abstraction - Dan's "social and technological artifact[s] designed to help interconnect descriptions" - and that is reflected in SKOS's notion of the skos:Concept. Outside the "knowledge organisation" community, however, many RDFS/OWL modellers "model the world directly": they consider the language as system used by a community of speakers (as modelled by lexvo.org), the place as thing in space (as modelled by Geonames), and so on. Constructs such as foaf:focus help us bridge the two approaches - but not without some complexity.

On Twitter on Friday evening, I noticed a (slightly provocative!) comment on Twitter from John Goodwin (@gothwin) which I think is related:

coming to the conclusion that #SKOS is being waaay over used #linkeddata

It attracted some sympathetic responses, e.g. from Rob Styles (@mmmmmmrob) 1, 2:

@gothwin @juansequeda if you're not publishing an existing subject heading scheme on the cheap then SKOS is the wrong tool.

@johnlsheridan @juansequeda @gothwin Paris is not "narrower than" France, it's "capital of". That's the problem.

which I think echoes my examples above of Spain and Europe.

And from Leigh Dodds (@ldodds): 1, 2:

@gothwin that's all it was designed for: converting one specific class of datasets into RDF. It's just been mistaken for a modelling tool

@juansequeda @gothwin @kendall only use #skos if you're CONVERTING a taxonomy. Otherwise, just model the domain

These comments struck home with me, and I think I may have made exactly this sort of mistake in some other work I've been doing recently, and I need to revisit it, to check whether what I've done is really necessary or useful. As John suggests, I think sometimes I reach for SKOS as a tool for something that has a "controlled list" feel to it without thinking hard enough whether it is really the appropriate tool for the task at hand.

Having said that, I do also think we will need to manage the "mixed" cases - particularly in datasets coming from communities such as libraries where the use of thesauri is commonplace, and a growing number are available as SKOS - where we end up needing the sort of bridge which foaf:focus provides, so some of this complexity may be unavoidable.

In any case, I think it's an area where some guidance/best practice notes - with lots of examples! - would be very helpful.

August 15, 2011

Two ends and one start

The end of July formally marked the end of two projects I've been contributing to recently:

There are still some things to finish for LOCAH, particularly the publication of the Copac data.

A few (not terribly original or profound) thoughts, prompted mainly by my experience of working with the Archives Hub data:

  • Sometimes data is "clean" and consistent and "normalised", but more often than not it has content in inconsistent forms or is incomplete: it is "messy". Data aggregated from multiple sources over a long period of time is probably going to be messier. (With an XML format like EAD, with what I think of as its somewhat "hybrid" document/data character, the potential for messiness increases).
  • Doing things with data requires some sort of processing by software, and while there are off-the-shelf apps and various configurable tools and frameworks that can provide some functions, some element of software development is usually required.
  • It may be relatively easy to identify in advance the major tasks where developer effort is required, and to plan for that, but sometimes there are additional tasks which it's more difficult to anticipate; rather they emerge as you attempt some of the other processes (and I think messy data probably makes that more likely).
  • Even with developers on board, development work has to be specified, and that is a task in itself, and one that can be quite complex and time-consuming (all the more so if you find yourself trying to think through and describe what to do with a range of different inputs from a messy data source).

It's worth emphassing that most of the above is not specific to generating Linked Data: it applies to data in all sorts of formats, and it applies to all sorts of tasks, whether you're building a plain old Web site or exposing RSS feeds or creating some other Web application to do something with the data.

Sort of related to all of the above, I was reminded that my limited programming skills often leave me in the position where I can identify what needs doing but I'm not able to "make it happen", and that is something I'd like to try to change. I can "get by" in XSLT, and I can do a bit of PHP, but I'd like to get to the point where I can do more of the simpler data manipulation tasks and/or pull together simple presentation prototypes.

I've enjoyed working on both the projects. I'm pleased that we've managed to deliver some Linked Data datasets, and I'm particularly pleased to have been able to contribute to doing this for archival description data, as it's a field in which I worked quite a long time ago.

Both projects gave me opportunities to learn by actually doing stuff, rather than by reading about it or prodding at other people's data. And perhaps more than anything, I've enjoyed the experience of working together as a group. We had a lot of discussion and exchange of ideas, and I'd like to think that this process was valuable in itself. In an early post on the SALDA blog, Karen Watson noted:

it is perhaps not the end result that is most important but the journey to get there. What we hope to achieve with SALDA is skills and knowledge to make our catalogues Linked Data and use those skills and that knowledge to inform decisions about whether it would be beneficial for make all our data Linked Data.

Of course it wasn't all plain sailing, and particularly near the start there were probably times when we ran up against differences of perception and terminology. Jane Stevenson has written about some of these issues from her point of view on the LOCAH blog (e.g. here, here and here). As the projects progressed, I think we moved closer towards more of a shared understanding - and I think that is a valuable "output" in its own right, even if it is one which it may be rather hard to "measure".

So, a big thank you to everyone I worked with on both projects.

Looking forward, I'm very pleased to be able to say that Jane prepared a proposal for some further work with the Archives Hub data, under JISC's Capital Funded Service Enhancements initiative, and that bid has been successful, and I'll be contributing some work to that project as a consultant. The project is called "Linking Lives" and is focused on providing interfaces for researchers to explore the data (as well as making any enhancements/extensions to the data and the "back-end" processes that are required to enable that). More on that work to come once we get moving.

Finally, as I'm sure many of you are aware, JISC recently issued some new calls for projects, including call 13/11 for projects "to develop services that aim to make content from museums, libraries and archives more visible and to develop innovative ways to reuse those resources". If there are any institutions out there who are considering proposals and think that I could make a useful contribution - despite my lamenting my limitations above, I hope my posts on the LOCAH and SALDA blogs give an idea of the sorts of areas I can contribute to! - , please do get in touch.

May 11, 2011

LOCAH releases Linked Archives Hub dataset

The LOCAH project, one of the two JISC-funded projects to which I've been contributing, this week announced the availability of an initial batch of data derived from a small subset of the Archives Hub EAD data as linked data. The homepage for what we have called the "Linked Archives Hub" dataset is http://data.archiveshub.ac.uk/

The project manager, Adrian Stevenson of UKOLN, provides an overview of the dataset, and yesterday I published a post which provides a few example SPARQL queries.

I'm very pleased we've got this data "out there": it feels as if we've been at the point of it being "nearly ready" for a few weeks now, but a combination of ironing out various technical wrinkles (I really must remember to look at pages in Internet Explorer now and again) and short working weeks/public holidays held things up a little. It is very much a "first draft": we have additional work planned on making more links with other resources, and there are various things which could be improved (and it seems to be one of those universal laws that as soon as you open something up, you spot more glitches...). But I hope it's enough to demonstrate the approach we are taking to the data, and to provide a small testbed that people can poke at and experiment with.

(If you have any specific comments on content of the LOCAH dataset, it's probably better to post them over on the LOCAH blog where other members of the project team can see and respond to them).

March 25, 2011

RDTF metadata guidelines - next steps

A few weeks ago I blogged about the work that Pete and I have been doing on metadata guidelines as part of the JISC/RLUK Resource Discovery Task Force, RDTF metadata guidelines - Limp Data or Linked Data?.

In discussion with the JISC we have agreed to complete our current work in this area by:

  • delivering a single summary document of the consultation process around the current draft guidelines, incorporating the original document and all the comments made using the JISCpress site during the comment period; and
  • suggesting some recommendations about any resulting changes that we would like to see made to the draft guidelines.

For the former, a summary view of the consultation is now available. It's not 100% perfect (because the links between the comments and individual paragraphs are not retained in the summary) but I think it is good enough to offer a useful overview of the draft and the comments in a single piece of text. Furthermore, the production of this summary was automated (by post-processing the export 'dump' from Wordpress), so the good news is that a similar view can be obtained for any future (or indeed past) JISCpress consultations.

For the latter, this blog post forms our recommendations.

As noted previously, there were 196 comments during the comment period (which is not bad!), many of which were quite detailed in terms of particular data models, formats and so on. On the basis that we do not know quite what form any guidelines might take from here on (that is now the responsibility of the RDTF team at MIMAS I think), it doesn't seem sensible to dig into the details too much. Rather, we will make some comments on the overall shape of the document and suggest some areas where we think it might be useful for JISC and RLUK to undertake additional work.

You may recall that our original draft proposed three approaches to exposing metadata, which we refered to as the community formats approach, the RDF data approach and the Linked Data approach. In light of comments (particularly those from Owen Stephens and Paul Walk) we have been putting some thought into how the shape of the whole document might be better conceptualised. The result is the following four-quadrant model:

Rdtf
Like any simple conceptualisation, there is some fuzziness in this but we think it's a useful way of thinking about the space.

Traditionally (in the world of library, museum and archives at least), most sharing of metadata has happened in the bottom-left quadrant - exchanging bulk files of MARC records for example. And, to an extent, this approach continues now, even outside those sectors. Look at the large amount of 'open data' that is shared as CSV files on sites like data.gov.uk for example. Broadly speaking, this is what we refered to as the community formats approach (though I think our inclusion of the OAI-PMH in that area probably muddied the waters a little - see below).

One can argue that moving left to right across the quadrants offers semantically richer metadata in a 'small pieces, loosely joined' kind of way (though this quickly becomes a religious argument with no obvious point of closure! :-) ) and that moving bottom to top offers the ability to work with individual item descriptions rather than whole collections of them - and that, in particular, it allows for the assignment of 'http' URIs to those descriptions and the dereferencing of those URIs to serve them.

Our three approaches covered the bottom-left, bottom-right and top-right quadrants. The web, at least in the sense of serving HTML pages about things of interest in libraries, museums and archives, sits in the top-left quadrant (though any trend towards embedded RDFa in HTML pages moves us firmly towards the top-right).

Interestingly, especially in light of the RDTF mantra to "build better websites", our guidelines managed to miss that quadrant. In their comments, Owen and Paul argued that moving from bottom to top is more important than moving left to right - and, on balance, we tend to agree.

So, what does this mean in terms of our recommendations?

We think that the guidelines need to cover all four quadrants and that, in particular, much greater emphasis needs to be placed on the top-left quadrant. Any guidance needs to be explicit that the 'http' URIs assigned to descriptions served on the web are not URIs for the things being described; that, typically, multiple descriptions may be served for the things being described (an HTML page and an XML document for example, each of which will have separate URIs) and that mechanisms such as '<link rel="alternative" ...>' can be used to tie them together; and that Google sitemaps (on the left) and semantic sitemaps (on the right) can be used to guide robots to the descriptions (either individually or in collections).

Which leaves the issue of the OAI-PMH. In a sense, this sits along-side the top-left quadrant - which is why, I think, it didn't fit particularly well with our previous three approaches. If you think about a typical repository for example, it is making descriptions of the content it holds available as HTML 'splash' pages (sometimes with embedded links to descriptions in other formats). In that sense it is functioning in top-left, "page per thing", mode. What the OAI-PMH does is to give you a protocol mechanism for getting at that those descriptions in a way that is useful for harvesting.

Several people noted that Atom and RSS might be used as an alternative to both sitemaps and the OAI-PMH, and we agree - though it may be that some additional work is needed to specify the exact mechanisms for doing so.

There were some comments on our suggestion to follow the UK government guidelines on assigning URIs. On reflection, we think it would make more sense to recommend only the W3C guidelines on Cool URIs for the Semantic Web, particularly on the separation of things from the desriptions of things, suggesting that it may be sensible to fund (or find) more work in this area making specific recommendations around persistent URIs (for both things and their descriptions).

Finally, there were a lot of comments on the draft guidelines about our suggested models and formats - notably on FRBR, with many commenters suggesting that this was premature given significant discussion around FRBR elsewhere. We think it would make sense to separate out any guidance on conceptual models and associated vocabularies, probably (again) as a separate piece of work.

To summarise then, we suggest:

  • that the four-quadrant model above is used to frame the guidelines - we think all four quadrants are useful, and that there should probably be some guidance on each area;
  • that specific guidance be developed for serving an HTML page description per 'thing' of interest (possibly with associated, and linked, alternative formats such as XML);
  • that guidance be developed (or found) about how to sensibly assign persistent 'http' URIs to everything of interest (including both things and descriptions of things);
  • that the definition of 'open' needs more work (particularly in the context of whether commercial use is allowed) but that this needs to be sensitive to not stirring up IPR-worries in those domains where they are less of a concern currently;
  • that mechanisms for making statement of provenance, licensing and versioning be developed where RDF triples are being made available (possibly in collaboration with Europeana work); and
  • that a fuller list of relevant models that might be adopted, the relationships between them, and any vocabularies commonly associated with them be maintained separately from the guidelines themselves (I'm trying desperately not to use the 'registry' word here!).

March 10, 2011

Term-based thesauri and SKOS (Part 4): Change over time (ii)

This is the fourth in a series of posts (previously: part 1, part 2, part 3) on making a thesaurus available as linked data using the SKOS and SKOS-XL RDF vocabularies. In the previous post, I examined some of the ways the thesaurus can change over time, and problems that arose with my proposed mapping to RDF. Here I'll outline one potential solution to those problems.

The last three cases I described in the previous post, where an existing preferred term loses that status and is "relegated" to a non-preferred term, all present a problem for my suggested simple mapping, because the URI for a concept disappears from the generated RDF graph - and this creates a conflict with the principles of URI stability and reliability I advocated at the start of that post.

My first thoughts on a solution circled around generating concept URIs, not just for the preferred term, but also for all the non-preferred terms, and using owl:sameAs (or skos:exactMatch?) to indicate that the concept URIs derived from the terms associated with a single preferred term were synonyms, i.e. each of them identified the same concept. That way the change from preferred term to non-preferred term would not result in the loss of a concept URI. But the proliferation of URIs here feels fundamentally flawed - the problem is not one that is solved by having multiple URIs for a single concept; the issue is the persistence of a single URI. Introducing the multiple URIs also seems like a recipe for a lot of practical difficulties in managing the impact of changes on external applications, particularly if URIs which were once synonyms cease to be so.

After some searching, I found a couple of useful pages on the W3C wiki: some notes on versioning (which as far as I know didn't make it into the final SKOS specifications) and particularly this page on "Concept Evolution" in SKOS. The latter is rather more a collection of pointers than the concrete set of examples and guidelines I was hoping for, but one of those pointers is to a thread on the W3C public-esw-thes mailing list, starting with this message from Rob Tice, which I think describes (in his point 2) exactly the situation I'm dealing with in the problem cases in the previous post:

How should we identify and manage change between revisions of concept schemes as this 'seems' to result in imprecision. e.g. a concept 'a' is currently in thes 'A' and only has a preferred label. A new revision of thes 'A' is published and what was concept 'a' is now a non preferred concept and thus becomes simply a non preferred label for a new concept 'b'.

It seems to me that this operation loses some of the semantic meaning of the change as all references to the concept id of 'concept a' would be lost as it now is only a non preferred label of a different concept with a different id (concept 'b').

The suggested approach emerging from that discussion has two elements:

  1. A notion that a concept can be marked as "deprecated" (using e.g. a "status" property with a value of "deprecated" or a "deprecated" property with Boolean (yes/no) values) or as being "valid" or "applicable" only for a specified bounded period of time (see the messages from Johan De Smedt and from Margarita Sini)
  2. Such a "deprecated" concept can be the subject of a "replaced by" relationship linking it to the "preferred term" concept (see the message from Michael Panzer)

The application of these two elements in combination is illustrated in this example by Joachim Neubert (again, I think, addressing the same scenario).

I wasn't aware of the owl:deprecated property before, but as far as I can tell, it would be appropriate for this case.

Joachim's message highlights the question of what to do about skos:prefLabel/skosxl:prefLabel or skos:altLabel/skosxl:altLabel properties for the deprecated concept. In the term-based thesaurus, the term has become a non-preferred term for another term: in the SKOS model, the term is now the alternate label for a different concept, and the preferred label for no concept. So on that basis, I'm inclined to follow Joachim's suggestion that the deprecated concept should be the subject of neither skos:prefLabel/skosxl:prefLabel nor skos:altLabel/skosxl:altLabel properties, though it could, as Joachim's example shows, retain an rdfs:label property. And similarly it is no longer the subject or object of semantic relationships.

I did wonder about the option of introducing a set of properties, parallel to the SKOS ones, to indicate those former relationships, e.g. ex:hadPrefLabel, ex:hadAltLabel, ex:hadRelated, ex:hadBroader, ex:hadNarrower, essentially as documentation. But I'm really not sure how useful this is: the semantic relationships in which those other target concepts are involved may themselves change. And I suppose in principle (though it seems unlikely in practice) a single concept may itself go through several status changes (e.g. from active to deprecated to active to deprecated) and accrue various different "former" relationships in the course of that. If this level of information is required, then I think it probably has to be provided using some other approach - like the use of a set of date-stamped graphs/documents that reflect the state of a concept at a point in time.

So applying Joachim's approach to Case 8 from the examples in the previous post, where the current preferred term "Political violence" is to become a non-preferred term for "Collective violence", we end up with the concept con:C2 as a "deprecated" concept with a Dublin Core dcterms:isReplacedBy relationship to concept con:C6 (and the inverse from con:C6 to con:C2):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C6 .
       
term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2; 
       dcterms:replaces con:C2 .       

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

Using this approach then, the full output graph for Case 8 would be as follows (the highlighting indicates the difference between this graph and that for Case 8 in the previous post):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3; 
       dcterms:replaces con:C2 .       

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C6 .
       
term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C6 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

Now our graph retains the URI con:C2 and provides a description of that resource as a "deprecated concept".

And for Case 9 (again the highlighting indicates the difference from the initial graph for Case 9):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3; 
       dcterms:replaces con:C2 .       
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C1 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C1 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

In the (unlikely?) event that the (previously preferred) non-preferred term is once again restored to the state of preferred term, then the concept con:C2 loses its deprecated status and the dcterms:isReplacedBy relationship, and acquires skos:prefLabel/skos:altLabel properties as normal.

Generating these graphs does, however, imply a change to the process of generating the RDF representation. As I noted at the start of the previous post, my first cut at this was based on being able to process a snaphot of the thesaurus "stand-alone" without knowledge of previous versions. But the capacity to detect deprecated concepts depends on knowledge of the current state of the thesaurus, i.e. when the transformation process encounters a non-preferred term x, it needs to behave differently depending on whether:

  1. concept con:Cx exists in the current thesaurus dataset (as either an "active" or "deprecated" concept), in which case a "deprecated concept" con:Cx should be output, as well as term:Tx (as alternate label for some other concept, con:Cy); or
  2. concept con:Cx does not exist in the current thesaurus dataset, in which case only term:Tx (as alternate label for a concept con:Cy) is required

I think that test has to be made against the current RDF thesaurus dataset rather than using the previous XML snapshot in time, as the "deprecation" may have taken place several snapshots ago.

I have to admit this does make the transformation process rather more complicated than I had hoped. The only way alternative would be if it is somehow possible to distinguish the "deprecation" case from the "static" non-preferred term case from the input data alone, but as far as I know this isn't possible.

Summary

The previous post highlighted that for one particular category of change, where an existing preferred term is "relegated" to the status of a non-preferred term, the results of the suggested simple mapping into SKOS had problematic consequences.

Based on some investigation of how others approach similar scenarios (and here I should note I'm very grateful to the contributors to the wiki page on concept evolution and to those discussions linked from it, as I was struggling to see clearly how to deal with these scenarios), I've sketched above an approach to representing a concept which has been "deprecated", or is no longer applicable, and is replaced by another concept. I'm sure it isn't the only way of addressing the problem, but it seems a reasonable one to try.

I think this creates new challenges for implementing this approach in the transformation process and I need to work on that to test it, but I think it is achievable. But I would also be very grateful for any comments, particularly if there are gaping holes in this which I haven't spotted!

Term-based thesauri and SKOS (Part 3): Change over time (i)

This is the third in a series of posts (previously: part 1, part 2) on making a thesaurus available as linked data using the SKOS and SKOS-XL RDF vocabularies. In this post, I'll examine some of the ways the thesaurus can change over time, and how such changes are reflected when applying the mapping I described earlier.

A note on "workflow"

In the case I'm working on, the term-based thesaurus is managed in a purpose-built application, from which a snapshot is exported (as an XML document) at regular intervals. These XML documents are the inputs to a transformation process which generates an SKOS/SKOS-XL RDF version, to be exposed as linked data.

Currently at least, each "run" of that transformation operates on a single snaphot of the thesaurus "stand-alone" i.e. the transform process has no "knowledge" of the previous snapshot, and the expectation is that the output generated from processing will replace the output of the previous run (either in full, or through a process of establishing the differences and then removing some triples and adding others). This "stand-alone" approach may be something I have to revisit.

The mapping

To summarise the transformation described in the previous post, a single preferred term and its set of zero or more non-preferred terms are treated as labels for a single concept. For each such set:

  • a single SKOS concept is created with a URI based on the term number of the preferred term
  • the concept is related to the literal form of the preferred term by an skos:prefLabel property
  • an SKOS-XL label is created with a URI based on the term number of the preferred term
  • the label is related to the literal form of the preferred term by an skosxl:literalForm property
  • the concept is related to the label by an skosxl:prefLabel property
  • the "hierarchical" (broader term, narrower term) and "associative" (related term) relationships between preferred terms are represented as "semantic" relationships between concepts
  • And for each non-preferred term in the set
    • the concept is related to the literal form of the non-preferred term by an skos:altLabel property
    • an SKOS-XL label is created with a URI based on the term number of the non-preferred term
    • the label is related to the literal form of the preferred term by an skosxl:literalForm property
    • the concept is related to the label by an skosxl:altLabel property

In the discussion below, I'll take the following "snapshot" of a notional thesaurus - it's another version of the example used in the previous posts, extended with an additional preferred term - as a starting point:

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

Using the mapping above, it is represented as follows in RDF using SKOS/SKOS-XL:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.
Fig1

"Versioning" and change over time

Once our resource URIs are generated and published, they will be used/cited by other agencies in their data - in other linked data datasets, in other thesauri, or in simple Web documents which reference terms or concepts using those URIs. From the linked data perspective, it is important that once generated and published the resource URIs, which will be http: URIs, remain stable and reliable. I'm using the terms "stable" and "reliable" as they are used by Henry Thompson and Jonathan Rees in their note Guidelines for Web-based naming, which I've found very helpful in breaking down the various aspects of what we tend to call "persistence". And for "stability", I'm thinking particularly of what they call "resource stability". So

  • once a URI is created, we should continue to use that URI to denote/identify the same resource
  • it should continue to be possible to obtain some information "about" the identified resource using the HTTP protocol - though that information obtained may change over time

For our particular case, the requirement is only that the "current version" of the thesaurus is available at any point in time, i.e. for each concept and for each term/label, at any point in time, it is necessary to serve only a description of the current state of that resource.

So, in my previous post, I mentioned that the Cabinet Office guidelines Designing URI Sets for the UK Public Sector allow for the case of creating a set of "date-stamped" document URIs, to provide variant descriptions of a resource at different points in time. I don't think that is required for this case, so for each term and concept, we'll have a URI for the that "thing", a "Document URI" for a "generic document" (current) description of that thing, and "Representation URIs" for each "specific document" in a particular format.

The formats provided will include a human-readable HTML version, an RDF/XML version and possibly other RDF formats. Over time, additional formats can be added as required through the addition of new "Representation URIs".

My primary focus here is the changes to the thesaurus content. Over time, various changes are possible. New terms may be added, and the relationships between terms may change. Terms are not deleted from the theasurus, however.

The most common type of change is the "promotion" of an existing non-preferred term to the status of a preferred term, but all of the following types of change can occur, even if some are infrequent:

  1. Addition of new semantic relationships between existing preferred terms
  2. Removal of existing semantic relationships between existing preferred terms
  3. Addition of new preferred terms
  4. Addition of new non-preferred terms (for existing preferred terms)
  5. An existing non-preferred term becomes a new preferred term
  6. An existing non-preferred term becomes a non-preferred term for a different existing preferred term
  7. An existing non-preferred term becomes a non-preferred term for a newly-added preferred term
  8. An existing preferred term becomes a non-preferred term for another existing preferred term
  9. An existing preferred term become a non-preferred term for a term which is currently a non-preferred term for it (and vice versa)
  10. An existing preferred term becomes a non-preferred term for a newly added preferred term

Below, I'll try to walk through an example of each of those changes in turn, starting from the example thesaurus above, showing the results using the mapping suggested above, and examining any issues which arise.

Case 1: Addition of new semantic relationship

The addition of new broader term (BT), narrower term (NT) or related term (RT) relationships is straightforward, as it involves only the creation of additional assertions of relationships between concepts, using the skos:broader, skos:narrower or skos:related properties, not the creation of new resources.

So if the example above is extended to add a BT relation between the "Collective violence" (term no 6) and "Violence" (term no 4) terms (and the inverse NT relation):

Civil violence
USE Political violence
TNR 1

Collective violence
BT Violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
NT Collective violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the RDF graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:broader con:C4 .

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 ;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

i.e. two new triples are added to the RDF graph

con:C6 skos:broader con:C4 .
con:C4 skos:narrower con:C6 .

The addition of the triples means that, from a linked data perspective, the graphs served as descriptions of the resources con:C6 and con:C4 change. They each include one additional triple for the concise bounded description case; two triples for the symmetric bounded description case (see the previous post for the discussion of different forms of bounded description). So the contents of the representations of documents http://example.org/doc/concept/polthes/C4 and http://example.org/doc/concept/polthes/C6 change - but no new resources are generated, and no new URIs required.

Case 2: Removal of existing semantic relationship

The removal of existing broader term (BT), narrower term (NT) or related term (RT) relationships is similarly straightforward, as it involves only the deletion of assertions of relationships between concepts, using the skos:broader, skos:narrower or skos:related properties, without the removal of existing resources.

I won't bother writing out an example in full for this case, but imagine the case of the previous example reverting to its initial state.

Again, from a linked data perspective, the graphs served as descriptions of the resources con:C6 and con:C4 change, with each containing one triple less for the CBD case or two triples less for the SCBD case, but we still have the same set of term URIs and concept URIs.

Case 3: Addition of new preferred terms

The addition of a new preferred term is again a matter of extending the graph with new information, though in this case some new URIs are also introduced.

Suppose a new preferred term "Revolution" (term no 7) is added to our initial example:

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Revolution
TNR 7

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the following graph:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C7 a skos:Concept;
       skos:prefLabel "Revolution"@en;
       skosxl:prefLabel term:T7 .

term:T7 a skosxl:Label;
        skosxl:literalForm "Revolution"@en.
        
con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following triples are added:

con:C7 a skos:Concept;
       skos:prefLabel "Revolution"@en;
       skosxl:prefLabel term:T7.

term:T7 a skosxl:Label;
        skosxl:literalForm "Revolution"@en.

The RDF representation now includes an additional concept and label, each with a new URI. So now there are two new resources, with new URIs (con:C7 and term:T7), and a corresponding set of new Document URIs and Representation URIs for descriptions of those resources.

It is quite probable that the addition of a new preferred term is accompanied by the assertion of semantic relationships with other existing preferred terms. This is the equivalent of following this step, then a second step of the type shown in case 1.

Case 4: Addition of new non-preferred term (for existing preferred term)

The addition of a new non-preferred term is, again, a matter of adding new information, and new URIs.

Suppose a new term "Assault" (term no 8) is added as a new non-preferred term for "Violence" (term no 4):

Assault
USE Violence
TNR 8

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
UF Assault
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

which results in the graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T8 a skosxl:Label;
        skosxl:literalForm "Assault"@en.

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:altLabel "Assault"@en;
       skosxl:altLabel term:T8;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

i.e. the following triples are added:

term:T8 a skosxl:Label;
        skosxl:literalForm "Assault"@en.

con:C4 skos:altLabel "Assault"@en;
       skosxl:altLabel term:T8 .

So from a linked data perspective, there is a new resource with a new URI (term:T8) (and its own new description with a new Document URI), and the existing URI con:C4 is the subject of two new triples, an skos:altLabel for the literal, and an skosxl:altLabel link to the new label, so the graph served as description of that existing resource changes to include additional triples.

Case 5: Existing non-preferred term becomes new preferred term

Suppose the existing term "Civil violence", initially a non-preferred term for "Political violence" is "promoted" and made a preferred term in its own right

Civil violence
USE Political violence
BT Violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Civil violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the following graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:broader con:C4 .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

For this case, the following new triples are added

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:broader con:C4 .

con:C4 skos:narrower con:C1 .

and also the following existing triples are removed

con:C2 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1 .

So from a linked data perspective, there is a new resource with a new URI (concept:C1) (and its own new description with a new Document URI), and the graph served as description of the existing resources con:C2 and con:C4 both change: the former loses the skos:altLabel and skosxl:altLabel triples and the latter includes a new skos:narrower triple. If symmetric bounded descriptions are used, the description of term:T1 changes too.

Case 6: Existing non-preferred term becomes non-preferred term for a different existing preferred term

Suppose we decide that "Civil violence", initially a non-preferred term for "Political violence", is to become a non-preferred term for "Collective violence".

Civil violence
USE Political violence
USE Collective violence
TNR 1

Collective violence
UF Civil violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

This generates the following graph:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

For this case, the following new triples are added

con:C6 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1.

and also the following existing triples are removed

con:C2 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1 .

The graphs served as descriptions of the existing resources con:C2 and con:C6 both change: the former loses the skos:altLabel and skosxl:altLabel triples and the latter gains skos:altLabel and skosxl:altLabel triples. If symmetric bounded descriptions are used then the description of term:T1 also changes.

Case 7: Existing non-preferred term becomes non-preferred term for a newly added preferred term

I think this case is just a combination of Case 3 (addition of new preferred term) and Case 6 (existing non-preferred term becomes non-preferred term for a different existing preferred term) in sequence. We've seen above that these changes can be made without problems, so the "composite" case should be OK too, and I won't bother working through a full example here.

Case 8: An existing preferred term becomes a non-preferred term for another existing preferred term

Suppose the current preferred term "Political violence" is to be "relegated" to become a non-preferred term for "Collective violence", with the latter becoming the participant in hierarchical relations previously involving the former. (I appreciate that these two terms probably don't constitute a great example, but let’s suppose it works, for the sake of the discussion!)

Civil violence
USE Political violence
USE Collective violence
TNR 1

Collective violence
UF Civil violence
UF Political violence
UF Violent protest
BT Violence
NT Terrorism
TNR 6

Political violence
USE Collective violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
BT Collective violence
TNR 3

Violence
NT Political violence
NT Collective violence
TNR 4

Violent protest
USE Political violence
USE Collective violence
TNR 5

This maps to the rather substantially changed RDF graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C6 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following RDF triples have been added

con:C6 skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C6 .

con:C4 skos:narrower con:C6 .

And the following RDF triples have been removed

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .

So the graphs served as descriptions of the concepts con:C3 and con:C4 change (with the removal of a triple and the addition of a new one); and that for concept con:C6 changes with the addition of several triples.

So far, so good.

However, the URI con:C2 has now completely disappeared from the graph. If this new graph simply replaces the previous graph, then there will be no description available for resource con:C2.

Case 9: An existing preferred term become a non-preferred term for a term which is currently a non-preferred term for it (and vice versa)

Suppose that the current non-preferred term "Civil violence" is to become preferred to "Political violence", and the latter is to become a non-preferred term for the former - both "relegation" and "promotion" taking place together, if you like.

Civil violence
USE Political violence
UF Political violence
UF Violent protest
BT Violence
NT Terrorism
TNR 1

Collective violence
TNR 6

Political violence
USE Civil violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
BT Civil violence
TNR 3

Violence
NT Political violence
NT Civil violence
TNR 4

Violent protest
USE Political violence
USE Civil violence
TNR 5
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C1 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following RDF triples have been added

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
       
con:C3 skos:broader con:C1 .

con:C4 skos:narrower con:C1 .

And the following RDF triples have been removed

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .

The outcome here is similar to that of the previous case.

The graphs served as descriptions of the concepts con:C3 and con:C4 change (with the removal of a triple and the addition of a new one). A new concept con:C1 is created. But again the URI con:C2 has completely disappeared from the graph, with the same consequences that no description will be available.

Case 10: An existing preferred term becomes a non-preferred term for a newly added preferred term

I think this case is just a combination of Case 3 (addition of new preferred term) and Case 8 (existing preferred term becomes a non-preferred term for another existing preferred term) in sequence.

The same problem will arise with the URI of the existing concept disappearing from the new output graph.

Summary

I've walked through in detail the different types of changes which can occur to the content of the thesaurus. This highlighted that for one particular category of change, where an existing preferred term is "relegated" to the status of a non-preferred term, exemplified by my cases 8, 9 and 10 above, the results of the suggested simple mapping into SKOS had problematic consequences: the URI for a concept disappears from the generated RDF graph - and this creates a conflict with the principles of URI stability and reliability I advocated at the start of this post.

In the next post, I'll suggest one way of (I hope!) addressing this problem.

March 01, 2011

Term-based thesauri and SKOS (Part 2): Linked Data

In my previous post on this topic I outlined how I was approaching making a thesaurus available using the SKOS and SKOS-XL RDF vocabularies. In that post I focused on:

  • how the thesaurus content is modelled using a "concept-based" approach - what are the "things of interest", their attributes, and the relationships between them;
  • how those things (concepts, terms/labels) are named/identified using http URIs;
  • how those things can be "described" using the simple "triple" statement model of RDF, and using the SKOS and SKOS-XL RDF vocabularies;
  • an example of how an expresssion of the thesaurus using the term-based model is mapped or transformed into an SKOS RDF expression

What I didn't really address in that post is how that resulting RDF data is made available and accessed on the Web - which is more specifically where the "Linked Data" principles articulated by Tim Berners-Lee come into play.

(A good deal of the content of this post is probably familiar stuff for those of you already working with Linked Data, but I thought it was worth documenting it both to fill out the picture of some of the "design choices" to be made in this particular example, and to provide some more background to others less familiar with the approaches.)

Linked Data, URIs, things, documents and HTTP

The use of http URIs as identifiers provides two features:

  • a global naming system, and a set of processes by which authority for assigning names can be delegated/distributed;
  • through the HTTP protocol, a well understood and widely deployed mechanism for providing access to information "about", or descriptions of, the things identified by those URIs (in our case, the concepts and terms/labels).

As a user/consumer of an http URI, given only the URI, I can "look up" that URI using the HTTP protocol, i.e. I can provide it to a tool (like a Web browser) and that tool can issue a request to obtain a description of the thing identified by the URI. And conversely as the owner/provider of a URI, I can configure my server to respond to such requests with a document providing a description of the thing.

And the HTTP protocol incorporates features which we can use to "optimise" this process. So, for example, the "content negotiation" feature allows a client to specify a preference for the format in which it wishes to receive data, and allows a server to select - from amongst several it may have available - format which the server determines is most appropriate for the client. In the terminology of the Web Architecture, the description can have multiple "representations", each of which can vary by format (or by other criteria). In the context of Linked Data, this technique is typically used to support the provision of document representations in formats suitable for a human reader (HTML, XHTML) and in one or more RDF syntaxes (usually, at least as RDF/XML). (The emergence of the RDFa syntax, which enables the embedding of RDF data in HTML/XHTML documents, and the growing support for RDFa in RDF tools, offers the possibility, in principle at least, of a single format serving both purposes.)

The widespread use of the HTTP protocol and tools that support it mean that these techniques are widely available (in theory at least; experience suggests that in practice the ability (or authority) to set up things like HTTP redirects (more below) can create something of a barrier). It also means that the "Web of Linked Data" is part of the existing Web of documents that we are accustomed to navigating using a Web browser.

One of the principles underpinning RDF's use of URIs as names is that we should try to avoid ambiguity in our use of those names, i.e. we should use different URIs for different things, and avoid using the same URI as a name for two different things. One of the issues I've slightly glossed over in the last few paragraphs is the distinction between a "thing" and a document describing that thing as two different resources. After all, if I provide a page describing the Mona Lisa, both the page and the Mona Lisa have creators, creation dates, terms of use, but they have different creators, creation dates and terms of use. And if I want to provide such information in RDF, then I need to take care to avoid confusing the two objects - by using two different URIs, one for my document and one for the painting, and citing those URIs in my RDF statements accordingly.

However, as I emphasise above, we also want to be in a position where, given only a "thing URI", I can obtain a document describing that thing: I shouldn't need in advance information about a second URI, the URI of a document about that thing.

The W3C Note Cool URIs for the Semantic Web describes some possible approaches to addressing this issue, broadly using two different techniques:

  • the use of URIs containing "fragment identifiers" ("hash URIs") (i.e. URIs of the form http://example.org/doc/123#thing). In this case, the "fragment identifier" part of the URI is always "trimmed" from the URI when the client makes a request to the server, and this allows the use of the URI with fragment identifier as "thing URI", leaving the trimmed URI without fragment id as a document URI.
  • the use of a convention of HTTP "redirects". In this case, when a server receives a request for a URI which it "knows" is a "thing URI" rather than a document URI, it returns a response which provides a second URI as a source of information about the thing, and the client then sends a second request for that second URI. Formally, the initial response uses the HTTP "303 See Also" status code, which sometimes leads to these being called colloquially "303 URIs", even though there's nothing special about the URIs themselves.

I'm conscious that I'm skipping over some of the details here; for a more detailed description, particularly of the "flow" of the interactions involved, and some consideration of the pros and cons of the two approaches, see Cool URIs for the Semantic Web.

URI Sets for the UK Public Sector

The Cool URIs note focuses mainly on the patterns of "interaction" for handling the two approaches to moving from "thing URI" to document URI. Its examples include example URIs, but the exact form of those URIs is intended to be illustrative rather than prescriptive. I think it's important to note that in the redirect case, it is the server's notification to the client of the second URI that provides the client with that information. There is no technical requirement for a structural similarity in the forms of the "thing URI" and the document URI, and consumers of the URIs should rely on the information provided to them by the server rather than making assumptions about URI structure.

Having said that, the use of a shared, consistent set of URI patterns within a community can provide some useful "cues" to human readers of those URIs. It can also simplify the work of data providers - for example by facilitating the use of similar HTTP server configurations or the reuse of scripts/tools for serving "Linked Data" documents. With this (and other factors such as URI stability) in mind, the UK Cabinet Office has provided a set of guidelines, Designing URI Sets for the UK Public Sector which build on the W3C Cool URIs note, but offer more specific guidance, particularly on the design of URIs.

For the purposes of this discussion, of particular interest is the document's specification (in the "Definitions, frameworks and principles" section) of several distinct "types of URI", or perhaps more accurately, URIs for different categories of resource, and (in the "The path structure for URIs" section) of suggested structural patterns for each:

  • Identifier URIs (what I have been calling above "thing URIs") name "real-world things" and should use patterns like:
    • http://{domain}/{concept}/{reference}#id or
    • http://{domain}/id/{concept}/{reference}
    where:
    • {concept} is "a word or string to capture the essence of the real-world 'Thing' that the set names e.g. school". (I think this is roughly what I think of as the name of a "resource type" - note this is a more generic use of the word "concept" than that of the SKOS concept)
    • {reference} is "a string that is used by the set publisher to identify an individual instance of concept".
    The document allows for the use of a hierarchy of concept-reference pairs in a single URI if appropriate, so for a specific class within a specific school, the path might be /id/school/123/class/5
  • Document URIs name the documents that provide information about, descriptions of, "real-world things". The suggested pattern is
    • http://{domain}/doc/{concept}/{reference}
    These documents are, I think, what Berners-Lee calls Generic Resources. For each such document, multiple representations may be available, each in different formats, and each of those multiple "more specific" concrete formats may be available as a separate resource in its own right (see "Representation URIs" below). If descriptions vary over time, and those variants are to be exposed then a series of "date-stamped" URIs can be used, with the pattern
    • http://{domain}/doc/{concept}/{reference}/{yyyy-mm-dd}
  • Representation URIs name a document in a specific format. The suggested pattern is
    • http://{domain}/doc/{concept}/{reference}/{doc.file-extension}
    This can also be applied to a date-stamped version:
    • http://{domain}/doc/{concept}/{reference}/{yyyy-mm-dd}/{doc.file-extension}

The guidelines also distinguish a category of "Ontology URIs" which use the pattern http://{domain}/def/{concept}. I had interpreted "Ontology URIs" as applying to the identification of classes and properties, and I was treating the terms/concepts of a thesaurus as "conceptual things" which would fall under the /id/ case. But I do notice that in an example in which she refers to these guidelines, Jeni Tennison uses the /def/ pattern for an SKOS example. I don't think it's really much of an issue - and pretty much all the other points I discuss apply anyway - but any advice on this point would be appreciated.

So, applying these general rules for the thesaurus case, where, as I discussed in the previous post, the primary types of thing of interest in our SKOS-modelled thesaurus are "concepts" and "terms":

  • Term URI Pattern: http://example.org/id/term/T{termid}
  • Concept URI Pattern: http://example.org/id/concept/C{termid}

However, if we bear in mind that within the URI space of the example.org domain, we're likely to want to represent, and coin URIs for the components of, multiple thesauri, and our "termid" references (drawn from the term numbers in the input) are unique only within the scope of a single thesaurus, then we should include some sort of thesaurus-specific component in the path to "qualify" those term numbers. Let's use the token "polthes" for this example:

  • Term URI Pattern: http://example.org/id/term/{schemename}/T{termid}
    Example: http://example.org/id/term/polthes/T2
  • Concept URI Pattern: http://example.org/id/concept/{schemename}/C{termid}
    Example: http://example.org/id/concept/polthes/C2

We should also include a URI for the thesaurus as a whole. The SKOS model provides a generic class of "concept scheme" to cover aggregations of concepts:

  • Concept Scheme URI Pattern: http://example.org/id/concept-scheme/{schemename}
    Example: http://example.org/id/concept-scheme/polthes

where each concept and term in the thesaurus is linked to this concept scheme by a triple using the skos:inScheme property. (I omitted this from the example in the previous post so that it was easier to focus on the concept-term and concept-concept relationships, and to try to keep the already rather complex diagrams slightly readable!)

Aside: An alternative for the concept and term URI patterns would be to use the "hierarchical concept-reference" approach and use patterns like:

  • Term URI Pattern: http://example.org/id/concept-scheme/{schemename}/term/T{termid}
    Example: http://example.org/id/concept-scheme/polthes/term/T2
  • Concept URI Pattern: http://example.org/id/concept-scheme/{schemename}/concept/C{termid}
    Example: http://example.org/id/concept-scheme/polthes/concept/C2

My only slight misgiving about this approach is that (bearing in mind that strictly speaking the URIs should be treated as opaque and such information obtained from the descriptions provided by the server) in the (non-hierarchical) form I suggested initially, the string indicating the resource type ("concept", "term") is fairly clear to the human reader from its position following the "/id/" component in the URI (e.g. http://example.org/id/concept/polthes/C2). But with the hierarchical form, it perhaps becomes slightly less clear (e.g. http://example.org/id/concept-scheme/polthes/concept/C2). But that is a minor gripe, and really the hierarchical form would serve just as well. For the remainder of this document, in the examples, I'll continue with the initial non-hierarchical pattern I suggested above, but it may be something to revisit if the hierarchical form is more in line with the intent - and current usage - of the guidelines. (So again, comments are welcome on this point.)

For each of these "Identifier URIs", there should be a corresponding "Document URI" naming a document describing the thing, and following the /doc/ pattern:

  • Description of Concept Scheme: http://example.org/doc/concept-scheme/polthes
  • Description of Term: http://example.org/doc/term/polthes/T{termid}
  • Description of Concept: http://example.org/doc/concept/polthes/C{termid}

And for each format in which the description is available, a corresponding "Representation URI":

  • Description of Concept Scheme (HTML): http://example.org/doc/concept-scheme/polthes/doc.html
  • Description of Concept Scheme (RDF/XML): http://example.org/doc/concept-scheme/polthes/doc.rdf
  • Description of Concept (HTML): http://example.org/doc/concept/polthes/C{termid}/doc.html
  • Description of Concept (RDF/XML): http://example.org/doc/concept/polthes/C{termid}/doc.rdf
  • Description of Term (HTML): http://example.org/doc/term/polthes/T{termid}.html
  • Description of Term (RDF/XML): http://example.org/doc/term/polthes/T{termid}.html

Descriptions and "boundedness"

The three documents I've mentioned so far (Berners-Lee's Linked Data Design Issues note, the W3C Cool URIs document, or the Cabinet Office URI patterns document) don't have a great deal to say on the topic of the content of the document which is returned as a description of a "thing". This is discussed briefly in the "Linked Data Tutorial" document by Chris Bizer, Richard Cyganiak and Tom Heath, How to Publish Linked Data on the Web.

In principle at least, it is quite possible to provide a single document which describes several resources. This approach has been quite common in association with the use of "hash URIs" in a pattern where a number of "thing URIs" differ only by fragment identifier, and share the same "non-fragment" part (http://example.org/school#1, http://example.org/school#2, ... http://example.org/school#99), and a number of common ontologies make use of this sort of approach. One consequence is that a client interested only in a single resource always retrieves the full set of descriptions. If my thesaurus really did consist only of the half-dozen concepts and terms I described in the example in my previous post, retrieving a document describing them all would probably not be a problem, but for the "real world" case where there are several thousand terms involved, it would represent a significant overhead if every request results in the transfer of several megabytes of data.

Generally, the approach taken is for the data provider to generate some set of "useful information" "about" the requested resource - though saying that rather begs the question of what constitutes "useful" (and whether there is a single answer to that question that is applicable across different datasets dealing with different resource types). Typically the generation of a description is based on some set of rules which, for a specified node in the dataset RDF graph (a specified "thing URI"), selects a "nearby" subgraph of the graph, representing a "bounded description" made up of triples/statements "about" the thing itself and maybe also "about" closely related resources.

Various such algorithms for generating such descriptions are possible and I don't intend to attempt any sort of rigorous analysis or comparison of them here - for further discussion see e.g. Patrick Stickler's CBD - Concise Bounded Description or Bounded Descriptions in RDF from the Talis Platform wiki. But there is one aspect which I think it is worth mentioning in the context of the thesaurus example. One of the key differences between the algorithms used to generate descriptions is how they treat the "directionality" of arcs in the RDF graph, i.e. whether they base the description only on arcs "outbound from" the selected node (RDF triples with that URI as subject), or whether they take into account both arcs "outbound from" and "inbound to" the node (triples with the URI as either subject or object).

That probably sounds like a very abstract point, and the significance is perhaps best illustrated through a concrete example. Let's take the graph for the example from my previous post (tweaked to use the slightly amended URI patterns above - I've continued to leave out the concept scheme links to keep things simple) and suppose this is the dataset to which I'm applying the rules.

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

And in graphical form (as before with the rdf:type triples omitted):

Fig1

(In the figures below, I've tried to represent the idea that a subgraph is being selected by "fading out" the parts which aren't selected, and leaving the selected part fully visible. I hope the images are sufficiently clear for this to be effective!)

Let's first take the approach known as the "concise bounded description (CBD)" - formally defined here, but essentially based on "outbound" links. For the concept C2 (http://example.org/id/concept/polthes/C2), the CBD would consist of the following subgraph (i.e. the document http://example.org/doc/concept/polthes/C2 would contain this data):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
Fig2

For the term T2 (http://example.org/id/term/polthes/T2), corresponding to the "preferred label" (i.e. the document http://example.org/doc/term/polthes/T2 would contain):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.
Fig3

For the term T5 (http://example.org/id/term/polthes/T5), corresponding to the "alternate label" (i.e. the document http://example.org/doc/term/polthes/T5 would contain):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.
Fig4

Note that for the two terms, the "concise bounded description" is quite minimal (though remember I've simplified it a bit): in particular, it does not include the relationship between the term and the concept. This is because using the SKOS-XL vocabulary that relationship is expressed as a triple in which the concept URI is the subject and the term URI is the object - an "inbound arc" to the term URI in the graph - which the CBD approach does not take into account when constructing the description of the term.

But the fact that the relationship is represented only in this way - a link from concept to term, without an inverse term to concept link - is arguably slightly arbitrary.

An alternative approach, the "symmetric bounded description" seeks to address this sort of issue, by taking into account both "outbound" and "inbound". For the same three cases, it produces the following results:

Concept C2:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .
Fig5

Term T2:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C2 skosxl:prefLabel term:T2 .
Fig6

Term T5:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

con:C5 skosxl:altLabel term:T5 .
Fig7

For the concept case, the difference is relatively minor (for the skos:broader and skos:narrower relationships, the inverse triples are ow also included). But for the term cases, the relationship between concept and term is now included.

So (rather long-windedly, I fear!), I'm trying to illustrate that it's worth thinking a little bit about the content of descriptions and how they work as "stand-alone" documents (albeit linked to others). And for this dataset, I think there's an argument that generating "symmetric" descriptions that include inbound links as well as outbound ones probably results in more "useful information" for the consumer of the data.

(Again, I'm simpifying things slightly here to illustrate the point: I've omitted type information and the links to indicate concept scheme membership. Typically the descriptions might (depending on the algorithm) include labels for related resources mentioned, rather than just the URIs, and would include some metadata about the document - its publisher, last modification date, licence/rights information, a link to the dataset of which it is a member, and so on.)

Summary

What I've tried to do in this post is expand on some of the "linked data"-specific aspects of the project, and to examine some of the design choices to be made in applying some of those general rules to this particular case, shaped both by external factors (like the Cabinet Office guidelines on URIs) and by characteristics of the data itself (like the directionality of links made using SKOS-XL). In the next post, I'll move on, as promised, to the questions of how the data changes over time, and any implications of that.

February 22, 2011

SALDA

As I've mentioned here before, I'm contributing to a project called LOCAH, funded by the JISC under its 02/10 call Deposit of research outputs and Exposing digital content for education and research (JISCexpo), working with MIMAS and UKOLN on making available bibliographic and archival metadata as linked data.

Partly as a result of that work, I was approached by Chris Keene from the University of Sussex to be part of a bid they were preparing to another recent JISC call, 15/10: Infrastructure for education and research programme, under the "Infrastructure for resource discovery" strand, which seeks to implement some of the approaches outlined by the Resource Discovery Task Force.

The proposal was to make available metadata records from the Mass Observation Archive, data currently managed in a CALM archival management system, as linked data. I'm pleased to report that the bid was successful, and the project, Sussex Archive Linked Data Application (SALDA), has been funded. It's a short project, running for six months from now (February) to July 2011. There's a brief description on the JISC Web site here, a SALDA project blog has just been launched, and the project manager Karen Watson provides more details of the planned work in her initial post there.

I'm looking forward to working with the Sussex team to adapt and extend some of the work we've done with LOCAH for a new context. I expect most information will appear over on the SALDA blog, but I'll try to post the occcasional update on progress here, particularly on any aspects of general interest.

February 11, 2011

Term-based thesauri and SKOS (Part 1)

I'm currently doing a piece of work on representing a thesaurus as linked data. I'm working on the basis that the output will make use of the SKOS model/RDF vocabulary. Some of the questions I'm pondering are probably SKOS Frequently Asked Questions, but I thought it was worth working through my proposed solution and some of the questions I'm pondering here, partly just to document my own thought processes and partly in the hope that SKOS implementers with more experience than me might provide some feedback or pointers.

SKOS adopts a "concept-based" approach (i.e. the primary focus is on the description of "concepts" and the relationships between them); the source thesaurus uses a "term-based" approach based on the ISO 2788 standard. I found the following sources provided helpful summaries of the differences between these two approaches:

Briefly (and simplifying slightly for the purposes of this discussion - I've omitted discussion of documentary attributes like "scope notes"), in the term-based model (and here I'm dealing only with the case of a simple monolingual thesaurus):

  • the only entities considered are "terms" (words or sequences of words)
  • terms can be semantically equivalent, each expressing a single concept, in which case they are distinguished as "preferred terms"/descriptors or "non-preferred terms"/non-descriptors, using USE (non-preferred to preferred) and UF (use for) (preferred to non-preferred) relations. Each non-preferred term is related to a single preferred term; a preferred term may have many non-preferred terms
  • a preferred term can be related by ("hierarchical") BT (broader term) and NT (narrower term) relations to indicate that it is more specific in meaning, or more general in meaning, than another term
  • a preferred term can also be related by an ("associative") RT (related term) relation to a second term, where there is some other relationship which may be useful for retrieval (e.g. overlapping meanings)

In the SKOS (concept-based) model:

  • the primary entities considered are "concepts", "units of thought"
  • a concept is associated with labels, of which at most one (in the monolingual case) is the preferred label, the others alternative labels or hidden labels
  • a concept can be related by "has broader", "has narrower" and "has related" relations to other concepts

The concept-based model, then, is explicit that the thesaurus "contains" two distinct types of thing: concepts and labels. In the base SKOS model, the labels are modelled as RDF literals, so the expression of relationships between labels is not supported. SKOS-XL provides an extension to SKOS which models labels as RDF resources in their own right, which can be identified, described and linked.

To represent a term-based thesaurus in SKOS, then, it is necessary to make a mapping from the term-based model with its term-term relationships to the concept-based model with its concept-label and concept-concept relationships. (An alternative approach would be to represent the thesaurus in RDF using a model/vocabulary which expresses directly the "native" "term-based" model of the source thesaurus. I'm working on the basis that using the SKOS/concept-based approach will facilitate interoperability with other thesauri/vocabularies published on the Web.

Core Mapping

The key principle underlying the mapping is the notion that, in the term-based approach, a preferred term and its multiple related non-preferred terms are acting as labels for a single concept.

Each set of terms related by USE/UF relations in the term-based approach, then, is mapped to a single SKOS concept, with the single "preferred term" becoming the "preferred label" for that concept and the (possibly multiple) "non-preferred terms" each becoming "alternative labels" for that same concept.

Consider the following example, where the term "Political violence" is a preferred term for "Civil violence" and "Violent unrest", and has a broader term "Violence" and narrower term "Terrorism":


Civil violence
USE Political violence

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrrorism

Terrorism
BT Political violence

Violence
NT Political violence

Violent protest
USE Political violence

(Leaving aside for a moment the question of how the URIs might be generated) an SKOS representation takes the form:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix con: <http://example.org/id/concept/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skos:altLabel "Civil violence"@en;
       skos:altLabel "Violent protest"@en;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skos:broader con:C2 .

con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skos:narrower con:C2 .

A graphical representation perhaps makes the links clearer (to keep things simple, I've omitted the arcs and nodes corresponding to the rdf:type triples here):

Fig1

One of the consequences of this approach is that some of the "direct" relationships between terms are reflected in SKOS as "indirect" relationships. In the term-based model, a non-preferred term is linked to a preferred term by a simple "USE" relationship. In the SKOS example, to find the "preferred term" for the term "Violent protest", one finds the node for the concept of to which it is linked via the skos:altLabel property, and then locates the literal which is linked to that concept via an skos:prefLabel property.

SKOS-XL and terms as Labels

As mentioned above, the SKOS specification defines an optional extension to SKOS which supports the representation of labels as resources in their own right, typically alongside the simple literal representation.

Using this extension, then the example above might be expressed as:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

And in graphical form (as above with the rdf:type triples omitted):

Fig2

That may be rather a lot to take in, so below I've shown a subgraph, including only the labels related to a single concept:

Fig3

Using the SKOS-XL extension in this way does introduce some additional complexity. On the other hand:

  1. it introduces a set of resources (of type skosxl:Label) that have a one-to-one correspondence to the terms in the source thesaurus, so perhaps makes the mapping between the two models more explicit and easier for a human reader to understand.
  2. it makes the labels themselves URI-identifiable resources which can be referenced in this data and in data created by others. So, it becomes possible to make assertions about relationships between the labels, or between labels and other things.

Coincidentally, as I was writing this post, I note that Bob duCharme has posted on the use of SKOS-XL for providing metadata "about" the individual labels, distinct from the concepts. So I might add a triple to indicate, say, the date on which a particular label was created. I don't think there is an immediate requirement for that in the case I'm working on, but there may be in the future.

Another possible use case is the ability for other parties to make links specifically to a label, rather than to the concept which it labels. A document could be "tagged", for example, by association with a particular label from a particular thesaurus, rather than just with the plain literal.

Relationships between Labels

The SKOS-XL vocabulary defines only a single generic relationship between labels (the skosxl:labelRelation property) with the expectation that implementers define subproperties to handle more specific relationship types.

The example given of such a relationship in the SKOS spec is that of a case where one label is an acronym for another label.

One thing I wondered about was introducing properties here to reflect the USE/UF relations in the term-based model e.g. in the above example:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .
@prefix ex: <http://example.org/prop/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en;
        ex:use term:T2 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en;
        ex:useFor term:T1 ;
        ex:useFor term:T5 .
       
term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en;
        ex:use term:T2 .

This wouldn't really add any new information; rather it just provides a "shortcut" for the information that is already present in the form of the "indirect" preferred label/alternative label relationships, as noted above.

I hesitate slightly about adding this data, partly on the grounds that it seems slightly redundant, but also on the grounds that this seems like a rather different "category" of relationship from the "is acronym for" relationship. This point is illustrated in figure 4 in the SWAD-E Review of Thesaurus Work document, which divides the concept-based model into three layers:

  • the (upper in the diagram) "conceptual" layer, with the broader/narrower/related relations between concepts
  • the (lower) "lexical" layer, with terms/labels and the relations between them
  • and a middle "indication"/"implication" layer for the preferred/non-preferred relations between concepts and terms/labels

In that diagram, the example lexical relationships exist independently of the preferred/non-preferred relationships e.g. "AIDS" is an abbreviation for "Acquired Immunodeficiency Syndrome", regardless of which is considered the preferred term in the thesaurus. With the use/useFor relationships here, this would not be the case; they would indeed vary with the preferred/non-preferred relationships.

So, having thought initially it might be useful to include these relationships, I'm becoming more rather less sure.

And without them, I'm not sure whether, for this case, the use of the SKOS-XL would be justified - though it may well be, for the other purposes I mentioned above. So, any thoughts on practice in this area would be welcome.

URI Patterns

Following the linked data principles, I want to assign http URIs for all the resources, so that descriptions of those resources can be provided using HTTP. If we go with the SKOS-XL approach, that means assigning http URIs for both concepts and labels.

The input data includes, for each term within the thesaurus, an identifier code, unique within the thesaurus, and which remains stable over time, i.e. as changes are made to the thesaurus a single term remains associated with the same code. (I'll explore the question of how the thesaurus changes over time in a follow-up post.) So in fact the example above looks something like:


Civil violence
USE Political violence
TNR 1

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

The proposal is to use that identifier as the basis of URIs, for both labels/terms and concepts:

  • Term URI Pattern: http://example.org/id/term/T{termid}
  • Concept URI Pattern: http://example.org/id/concept/C{termid}

The generation of label/term URIs, then, is straightforward, as each term (with identifier code) in the input data maps to a single label/term in the SKOS RDF data. The concept case is a little more complicated. As I noted above, the mapping between the term-based model and the concept-based model means that there is a many-to-one relationship between terms and concepts: several terms are related to a single concept, one as preferred label, others as alternative labels. In my first cut at this at least (more on this in a follow-up post), I've generated the URI of the concept based on the identifier code associated with the preferred/descriptor term.

So in my example above, the three terms "Civil violence, "Political violence", and "Violent protest" are mapped to labels for a single concept, and the URI of that concept is constructed from the identifier code of the preferred/descriptor term ("Political violence").

Summary

I've tried to cover the basic approach I'm taking to generating the SKOS/SKOS-XL RDF data from the term-based thesaurus input. In this post, I've focused only on the thesaurus as a static entity. In a follow-up post, I'll look in some detail at the changes which can occur over time, and examine any implications for the choices made here, particularly for our choices of URI patterns.

February 03, 2011

Metadata guidelines for the UK RDTF - please comment

As promised last week, our draft metadata guidelines for the UK Resource Discovery Taskforce are now available for comment in JISCPress. The guidelines are intended to apply to UK libraries, museums and archives in the context of the JISC and RLUK Resource Discovery Taskforce activity.

The comment period will last two weeks from tomorrow and we have seeded JISCPress with a small number of questions (see below) about issues that we think are particularly worth addressing. Of course, we welcome comments on all aspects of the guidelines, not just where we have raised issues. (Note that you don't have to leave public comments in JISCPress if you don't want to - an email to me or Pete will suffice. Or you can leave a comment here.)

The guidelines recommend three approaches to exposing metadata (to be used individually or in combination), referred to as:

  1. the community formats approach;
  2. the RDF data approach;
  3. the Linked Data approach.

We've used words like 'must' and 'should' but it is worth noting that at this stage we are not in a position to say how these guidelines will be applied - if at all. Nor whether there will be any mechanisms for compliance put in place. On that basis, treat phrases like 'must do this' as meaning, 'you must do this for your metadata to comply with one or other approach as recommended by these guidelines' - no more, no less. I hope that's clear.

When we started this work, we we began by trying to think about functional requirements - always a good place to start. In this case however, that turned out not to make much sense. We are not starting from a green field here. Lots of metadata formats are already in use and we are not setting out with the intent of changing current cataloguing practice across libraries, museums and archives. What we can say is that:

  1. we have tried to keep as many people happy as possible (hence the three approaches), and
  2. we want to help libraries, museums and archives expose existing metadata (and new metadata created using existing practice) in ways that support the development of aggregator services and that integrate well with the web (of data).

As mentioned previously, the three approaches correspond roughly to the 3-star, 4-star and 5-star ratings in the W3C's Linked Data Star Ratings Scheme. To try and help characterise them, we prepared the following set of bullet points for a meeting of the RDTF Technical Advisory Group earlier this week:

The community data approach

  • the “give us what you’ve got” bit
  • share existing community formats (MARC, MODS, BibTeX, DC, SPECTRUM, EAD, XML, CSV, JSON, RSS, Atom, etc.) over RESTful HTTP or OAI-PMH
  • for RESTful HTTP, use sitemaps and robots.txt to advertise availability and GZip for compression
  • for CSV, give us a column called ‘label’ or ‘title’ so we’ve got something to display and a column called 'identifier' if you have them
  • provide separate records about separate resources
  • simples!

The RDF data approach

  • use RDF
  • model according to FRBR, CIDOC CRM or EAD and ORE where you can
  • re-use existing vocabularies where you can
  • assign URIs to everything of interest
  • make big buckets of RDF (e.g. as RDF/XML, N-Tuples, N-Quads or RDF/Atom) available for others to play with
  • use Semantic Sitemaps and the Vocabulary of Interlinked Datasets (VoID) to advertise availability of the buckets

The Linked Data approach

Dunno if that is a helpful summary but we look forward to your comments on the full draft. Do your worst!

For the record, the issues we are asking questions about mainly fall into the following areas:

  • is offering a choice of three approaches helpful?
  • for the community formats approach, are the example formats we list correct, are our recommendations around the use of CSV useful and are JSON and Atom significant enough that they should be treated more prominently?
  • does the suggestion to use FRBR and CIDOC CRM as the basis for modeling in RDF set the bar too high for libraries and museums?
  • where people are creating Linked Data, should we be recommending particular RDF datasets/vocabularies as the target of external links?
  • do we need to be more prescriptive about the ways that URIs are assigned and dereferenced?

Note that a printable version of the draft is also available from Google Docs.

November 29, 2010

Still here & some news from LOCAH

Contrary to appearances, I haven't completely abandoned eFoundations, but recently I've mostly been working on the JISC-funded LOCAH project which I mentioned here a while ago, and my recent scribblings have mostly been over on the project blog.

LOCAH is working on making available some data from the Archives Hub (a collection of archival finding-aids i.e. metadata about archival collections and their constituent items) and from Copac (a "union catalogue" of bibliographic metadata from major research & specialist libraries) as linked data.

So far, I've mostly been working with the EAD data, with Jane Stevenson and Bethan Ruddock from the Archives Hub. I've posted a few pieces on the LOCAH blog, on the high-level architecture/workflow, on the model for the archival description data (also here), and most recently on the URI patterns we're using for the archival data.

I've got an implementation of this as an XSLT transform that reads an EAD XML document and outputs RDF/XML, and have uploaded the results of applying that to a small subset of data to a Talis Platform instance. We're still ironing out some glitches but there'll be information about that on the project blog coming in the not too distant future.

On a personal note, I'm quite enjoying the project. It gives me a chance to sit down and try to actually apply some of the principles that I read about and talk about, and I'm working through some of the challenges of "real world" data, with all its variability and quirks. I worked in special collections and archives for a few years back in the 1990s, when the institutions where I was working were really just starting to explore the potential of the Web, so it's interesting to see how things have changed (or not! :-)), and to see the impact of and interest in some current technological (and other) trends within those communities. It also gives me a concrete incentive to explore the use of tools (like the Talis Platform) that I've been aware of but have only really tinkered with: my efforts in that space inevitably bring me face to face with the limited scope of my development skills, though it's also nice to find that the availability of a growing range of tools has enabled me to get some results even with my rather stumbling efforts.

It'a also an opportunity for me to discuss the "linked data" approach with the archivists and librarians within the project - in very concrete ways based on actual data - and to try to answer their questions and to understand what aspects are perceived as difficult or complex - or just different from existing approaches and practices.

So while some of my work necessarily involves me getting my head down and analysing input data or hacking away at XSLT or prodding datasets with SPARQL queries, I've been doing my best to discuss the principles behind what I'm doing with Jane and Bethan as I go along, and they in turn have reflected on some of the challenges as they perceive them in posts like Jane's here.

One of the project's tasks is to:

Explore and report on the opportunities and barriers in making content structured and exposed on the Web for discovery and use. Such opportunities and barriers may coalesce around licensing implications, trust, provenance, sustainability and usability.

I think we're trying to take a broad view of this aspect of the project, so that it extends not just to the "technical" challenges in cranking out data and how we address them, but also incorporates some of these "softer" elements of how we, as individuals with backgrounds in different "communities", with different practices and experiences and perspectives, share our ideas, get to grips with some of the concepts and terminology and so on. Where are the "pain points" that cause confusion in this particular context? Which means of explaining or illustrating things work, and which don't? What (if any!) is the value of the "linked data" approach for this sort of data? How is that best demonstrated? What are the implications, if any, for information management practices within this community? It may not be the case that SPARQL becomes a required element of archivists' training any time soon, but having these conversations, and reflecting on them, is, I think, an important part of the LOCAH experience.

November 09, 2010

Is (a debate about) 303 really necessary?

At the tail end of last week Ian Davis of Talis wrote a blog post entitled, Is 303 Really Necessary?, which gave various reasons why the widely adopted (within the Linked Data community) 303-redirect pattern for moving from the URI for a non-Information Resource (NIR) to the URI for an Information Resource (IR) was somewhat less than optimal and offering an alternative approach based on the simpler, and more direct, use of a 200 response. For more information about IRs, NIRs and the use of 303 in this context see the Web Architecture section of How to Publish Linked Data on the Web.

Since then the [email protected] mailing list has gone close to ballistic with discussions about everything from the proposal itself to what happens to the URI you have assigned to China (a NIR) if/when the Chinese border changes. One thing the Linked Data community isn't short of is the ability to discuss how much RDF you can get on the head of a pin ad infinitum.

But as John Sheridan pointed out in a post to the mailing list, there are down-sides to such a discussion, coming at a time when adoption of Linked Data is slowly growing.

debating fundamental issues like this is very destabilising for those of us looking to expand the LOD community and introduce new people and organisations to Linked Data. To outsiders, it makes LOD seem like it is not ready for adoption and use - which is deadly. This is at best the 11th hour for making such a change in approach (perhaps even 5 minutes to midnight?).

I totally agree. Unfortunately the '200 cat' is now well and truly out of the 'NIR bag' and I don't suppose any of us can put it back in again so from that perspective it's already too late.

As to the proposal itself, I think Ian makes some good points and, from my slightly uninformed perspective, I don't see too much wrong with what he is suggesting. I agree with him that being in the (current) situation where clients can infer semantics based on HTTP response codes is less than ideal. I would prefer to see all semantics carried explicitly in the various representations being served on the Web. In that sense, the debate about 303 vs 200 becomes one that is solely about the best mechanism for getting to a representation, rather than being directly about semantics. I also sense that some people are assuming that Ian is proposing that the current 303 mechanism needs to be deprecated at some point in the future, in favour of his 200 proposal. That wasn't my interpretation. I assume that both mechanisms can sit alongside each other quite happily - or, at least, I would hope that to be the case.

November 03, 2010

Google support for GoodRelations

Google have announced support for the GoodRelations vocabulary for product and price information in Web pages, Product properties: GoodRelations and hProduct. This is primarily of interest to ecommerce sites but is more generally interesting because it is likely to lead to a significant rise in the amount of RDF flowing around the Web. It therefore potentially represents a significant step forward for the adoption of the Semantic Web and Linked Data.

Martin Hepp, the inventor of the GoodRelations vocabulary, has written about this development, Semantic SEO for Google with GoodRelations and RDFa, suggesting a slightly modified form of markup which is compatible with that adopted by Google but that is also "understood by ALL RDFa-aware search engines, shopping comparison sites, and mobile services".

October 26, 2010

Attending meetings remotely - a cautionary tale

In these times of financial constraints and environmental concerns, attending meetings remotely (particularly those held overseas) is becoming increasingly common. Such was the case, for me, at 7pm UK time on Friday night last week... I should have been eating pizza in front of the TV with my family but instead was messing about with Skype, my house phone, IRC and Twitter in an attempt to join the joint meeting of the DC Architecture Forum and the W3C Library Linked Data Incubator Group (described in Pete's recent post, The DCMI Abstract Model in 2010).

The meeting started with Tom Baker summarising the history and current state of the DCMI Abstract Model (the DCAM) - a tad long perhaps but basically a sound introduction and overview. Unfortunately, my Skype connection dropped a couple of times during his talk (I have no idea why) and I resorted to using my house phone instead - using the W3C bridge in Bristol. This proved more stable but some combination of my phone and the microphone positioning in the meeting meant that sound, particularly from speakers in the audience, was rather poor.

By the time we got to the meat of the discussion about the future of the DCAM I was struggling to keep up :-(. I made a rather garbled contribution, trying to summarise my views on the possible ways forward but emphasising that all the interesting possibilities had the same end-game - that DCMI would stop using the language of its own world-view, the DCAM, and would instead work within the more widely accepted language of the RDF model and Linked Data - and that the options were really about how best we get there, rather than about where we want to go.

Unfortunately, this is a view that is open to some confusion because the DCAM itself uses the RDF model. So when I say that we should stop using the DCAM and start using RDF and Linked Data its not like saying that we should stop using model A and start using model B. Rather, it's a case of carrying on with the current model (i.e. the RDF model) but documenting it and talking about it using the same language as everyone else, thus joining forces with more active communities elsewhere rather than silo'ing ourselves on the DC mailing lists by having a separate way of talking.

So, anyway, I don't know how well I put my point of view across - one of the problems of being remote is that the only feedback you get is from the person taking minutes, in this case in the W3C IRC channel:

18:50:56 [andypowe11] ok, i'd like to speak at some point

18:52:34 [markva] andypowe11: options 2b, 3 and 4: all work to RDF, which is where we want to get to

18:52:55 [markva] ... which of these is better to get to that end game, wrt time available

18:53:23 [markva] ... 4 seems not ideal, but less effort

18:54:01 [markva] ... lean to 3; 2b has political value by taking along community; but 3 better given time

Stu Weibel spoke after me - a rather animated contribution (or so it seemed from afar). No problem with that... DCMI could probably do with a bit more animation to be honest. I understood him to be saying that we should adopt the Web model and that Linked Data offered us a useful chance to re-align ourselves with other Web activity. As I say, I was struggling to hear properly, so I may have mis-understood him completely. I glanced at the IRC minutes:

18:54:56 [markva] Stu Weibel: frustrated; no productive outcomes all these years

18:55:10 [markva] ... adopt Web as the model

18:55:37 [markva] ... nobody understands DCAM

18:56:03 [markva] ... W3C published architecture document after actual implementation

18:56:45 [markva] ... revive effort: develop reference software; easily drop in data, generate linked data 

I responded positively, trying to make it clear that I was struggling to hear and that I may have mis-interpreted him but noting the reference to 'linked data', which I'd heard as 'Linked Data':

18:57:12 [markva] andypowe11: support Stu 

The minute is factually correct - I did support Stu - but in an 'economical with the truth' kind of way because I only really supported what I thought I'd heard him say - and quite possibly not what he actually said! With hindsight, I wonder if the minute-taker's use of 'linked data' (lower-case) actually reflected some subtlety in what Stu said that I didn't really pick up on at the time. If nothing else, this exchange highlights for me the potential problems caused by those who want to distinguish 'linked data' (lower-case) from 'Linked Data' (upper-case) - there is no mixed-case in conversation, particularly not where it is carried out over a bad phone connection.

So anyway... the meeting moved on to other things and, feeling somewhat frustrated by the whole experience, I dropped off the call and found my cold pizza.

My point here is not about DCMI at all, though I still have no real idea whether I was right or wrong to agree with Stu. My gut feeling is that I probably agreed with some of what he said and disagreed with the rest - and the lesson, for me, is that I should be more careful before opening my mouth! My point is really about the practical difficulties of engaging in quite challenging intellectual debates in the un-even environment of a hybrid meeting where some people are f2f in the same room and others are remote. Or, to mis-quote William Gibson:

The future of hybrid events is here, it's just not evenly distributed yet.

:-(

(Note: none of this is intended to be critical of the minute-taker for the meeting who actually seems to have done a fantastic job of capturing a complex exchange of views in what must have been a difficult environment).

October 19, 2010

The DCMI Abstract Model in 2010

The Dublin Core Metadata Initiative's 2010 conference, DC-2010, takes place this week in Pittsburgh. I won't be attending, but Tom Baker and I have been working on a paper, A review of the DCMI Abstract Model with scenarios for its future for the meeting of the DCMI Architecture Forum - actually, a joint meeeting with the W3C Library Linked Data Incubator Group.

This is a two-part meeting, the first part looking at the position of the DCMI Abstract Model in 2010, five years on from its becoming a DCMI Recommendation, from the perspective of a new context in which the emergence of the "Linked Data" approach has brought a wider understanding and take-up of the RDF model.

The second part of the meeting looks at the question of what the DCMI community calls "application profiles", descriptions of "structural patterns" within data, and "validation" against such patterns. Within the DCMI context, work in this area has built on the DCAM, in the form of the draft Description Set Profile specification. But, as I've mentioned before, there is interest in this topic within some sectors of the "Linked Data" community.

Our paper tries to outline the historical factors which led to the development of the DCAM, to take stock of the current position, and suggest a number of possible paths forward. The aim is to provide a starting point for discussions at the face-to-face meeting, and the suggestions for ways forward are not intended to be an exhaustive list, but we felt it was important to have some concrete choices on the table:

  1. DCMI carries on developing DCAM as before, including developing the DSP specification and concrete syntaxes based on DCAM
  2. DCMI develops a "DCAM 2" specification (initial rough draft here), simplified and better aligned with RDF, and with a cleaner sepration of syntax and semantics, and either:
    1. develops the DSP specification and concrete syntaxes based on DCAM; or
    2. treats "DCAM 2" as a clarification and a transitional step towards promoting the RDF model and RDF abstract syntax
  3. DCMI deprecates the DCAM and henceforth promotes the RDF abstract syntax (and examines the question of "structural constraints" within this framework)
  4. DCMI does nothing to change the statuses of existing DCAM-related specifications

For my own part, in 2010, I do rather tend to look at the DCAM as an artefact "of its time". The DCAM was created during a period when the DCMI community was between two world views, one, which I tend to think of as a "classical view", reflected in Tom's "A Grammar of Dublin Core" 2000 article for Dlib, and based on the use of "appropriate literals" - character strings - as values, and a second based on the RDF model, emphasising the use of URIs as global names and supported by a formal semantics. In developing the DCAM, we tried to do two things:

  • To provide a formalisation of that "classical" view, the "DCMI community" metadata model, if you like: in 2003, DCMI had "a typology of terms" but little consensus on the nature of the data structure(s) in which those terms were referenced.
  • To provide a "bridge" between that "classical" model and the RDF model, through the use of RDF concepts, and the provision of a mapping to the RDF abstract syntax in Expressing Dublin Core metadata using the Resource Description Framework (RDF).

If I'm honest, I think we've had limited success in these immediate aims. In creating the DCAM "description set model" we may have achieved the former in theory, but in practice people coming to the DCAM from a "classical Dublin Core" viewpoint found that model complicated, and difficult to reconcile with their own conceptualisations. So as a "community model" I suspect the "buy-in" from that community isn't as high as we might like to imagine! People coming to the Dublin Core vocabularies with some familiarity with the (much simpler) RDF model, on the other hand, were confused by, and/or didn't see the need for, the description set model. And a third (and perhaps larger still) constituency was engaged primarily in the use of XML-based metadata schemas (like MODS), with little or no notion of an abstract syntax distinct from the XML syntax itself.

However, I think the existence of the DCAM has perhaps provided some more positive outcomes in other areas.

First, I think the very existence of the DCAM helped advance discussions around comparing metadata standards from different communities, particularly in the initiatives championed by Mikael Nilsson in comparing Dublin Core and the IEEE Learning Object Metadata standard, by drawing attention to the importance of articulating the "abstract models" in use in standards when making such comparisons and when trying to establish conditions for "interoperability" between applications based on them. (This work is nicely summarised in a paper for the ProLEARN project Harmonization of Metadata Standards).

Second, while implementation of the Description Set Profile specification itself has been limited, it has provided a focus for exploring the question of describing structural patterns and performing structural validation, based not on concrete syntaxes and on e.g. XML schema technologies, but on the abstract syntax. A recent thread on the Library Linked Data Incubator Group mailing list, starting with Mikael Nilsson's post, provides a very interesting discussion of current thinking, and this area will be the focus of the second part of the Pittsburgh meeting.

And the Singapore Framework's separation of "vocabulary" from patterns for, or constraints on, the use of that vocabulary - leaving aside for a moment the actual techniques for realising that distinction - has received some attention as a general basis for metadata schema development (see, for example, the comments by Scott Wilson in his contribution to the recent JISC CETIS meeting on interoperability standards.

Finally, it's probably stating the obvious that any choice of path forward needs to take into account that DCMI, like many similar organisations, finds itself in an environment in which resources, both human and financial, are extremely limited. Many individuals who devoted time and energy to DCMI activities in the past have shifted their energy to other areas - and while I continue to maintain some engagement with DCMI, mainly through the vocabulary management activity of the Usage Board, I include myself in this category. Many of the DCMI "community" mailing lists show little sign of activity, and what few postings there are seem to receive little response. And some organisations which in the past supported staff to work in this area are choosing to focus their resources elsewhere.

Against this background, more than ever, it seems to me, it is important for DCMI not to try to tackle problems in isolation, but rather to (re)align its approaches firmly with those of the Semantic Web community, to capitalise on the momentum - and the availability of tools, expertise and experience (and good old enthusiasm!) - being generated by the wider take-up of the "Linked Data" approach, and to explore solutions to what might appear to be "DC-specific" problems (but probably aren't) within that broader community. The fact that the Architecture meeting in Pittsburgh is a joint one seems like a good first step in this direction.

September 17, 2010

On the length and winding nature of roads

I attended, and spoke at, the ISKO Linked Data - the future of knowledge organization on the Web event in London earlier this week. My talk was intended to have a kind of "what 10 years of working with the Dublin Core community has taught me about the challenges facing Linked Data" theme but probably came across more like "all librarians are stupid and stuck in the past". Oh well... apologies if I offended anyone in the audience :-).

Here are my slides:

They will hopefully have the audio added to them in due course - in the meantime, a modicum of explanation is probably helpful.

My fundamental point was that if we see Linked Data as the future of knowledge organization on the Web (the theme of the day), then we have to see Linked Data as the future of the Web, and (at the risk of kicking off a heated debate) that means that we have to see RDF as the future of the Web. RDF has been on the go for a long time (more than 10 years), a fact that requires some analysis and explanation - it certainly doesn't strike me as having been successful over that period in the way that other things have been successful. I think that Linked Data proponents have to be able to explain why that is the case rather better than simply saying that there was too much emphasis on AI in the early days, which seemed to be the main reason provided during this particular event.

My other contention was that the experiences of the Dublin Core community might provide some hints at where some of the challenges lie. DC, historically, has had a rather librarian-centric make-up. It arose from a view that the Internet could be manually catalogued for example, in a similar way to that taken to catalogue books, and that those catalogue records would be shipped between software applications for the purposes of providing discovery services. The notion of the 'record' has thus been quite central to the DC community.

The metadata 'elements' (what we now call properties) used to make up those records were semantically quite broad - the DC community used to talk about '15 fuzzy buckets' for example. As an aside, in researching the slides for my talk I discovered that the term fuzzy bucket now refers to an item of headgear, meaning that the DC community could quite literally stick it's collective head in a fuzzy bucket and forget about the rest of the world :-). But I digress... these broad semantics (take a look at the definition of dcterms:coverage if you don't believe me) were seen as a feature, particularly in the early days of DC... but they become something of a problem when you try to transition those elements into well crafted semantic web vocabularies, with domains, ranges and the rest of it.

Couple that with an inherent preference for "strings" vs. "things", i.e. a reluctance to use URIs to identify the entities at the value end of a property relationship - indeed, couple it with a distinct scepticism about the use of 'http' URIs for anything other than locating Web pages - and a large dose of relatively 'flat' and/or fuzzy modelling and you have an environment which isn't exactly a breeding ground for semantic web fundamentalism.

When we worked on the original DCMI Abstract Model, part of the intention was to come up with something that made sense to the DC community in their terms, whilst still being basically the RDF model and, thus, compatible with the Semantic Web. In the end, we alienated both sides - librarians (and others) saying it was still too complex and the RDF-crowd bemused as to why we needed anything other than the RDF model.

Oh well :-(.

I should note that a couple of things have emerged from that work that are valuable I think. Firstly, the notion of the 'record', and the importance of the 'record' as a mechanism for understanding provenance. Or, in RDF terms, the notion of bounded graphs. And, secondly, the notion of applying constraints to such bounded graphs - something that the DC community refers to as Application Profiles.

On the basis of the above background, I argued that some of the challenges for Linked Data lie in convincing people:

  • about the value of an open world model - open not just in the sense that data may be found anywhere on the Web, but also in the sense that the Web democratises expertise in a 'here comes everybody' kind of way.
  • that 'http' URIs can serve as true identifiers, of anything (web resources, real-world objects and conceptual stuff).
  • and that modelling is both hard and important. Martin Hepp, who spoke about GoodRelations just before me (his was my favorite talk of the day), indicated that the model that underpins his work has taken 10 years or so to emerge. That doesn't surprise me. (One of the things I've been thinking about since giving the talk is the extent to which 'models build communities', rather than the other way round - but perhaps I should save that as the topic of a future post).

There are other challenges as well - overcoming the general scepticism around RDF for example - but these things are what specifically struck me from working with the DC community.

I ended my talk by reading a couple of paragraphs from Chris Gutteridge's excellent blog post from earlier this month, The Modeller, which seemed to go down very well.

As to the rest of the day... it was pretty good overall. Perhaps a tad too long - the panel session at the end (which took us up to about 7pm as far as I recall) could easily have been dropped.  Ian Mulvany of Mendeley has a nice write up of all the presentations so I won't say much more here. My main concern with events like this is that they struggle to draw a proper distinction between the value of stuff being 'open', the value of stuff being 'linked', and the value of stuff being exposed using RDF. The first two are obvious - the last less-so. Linked Data (for me) implies all three... yet the examples of applications that are typically shown during these kinds of events don't really show the value of the RDFness of the data. Don't get me wrong - they are usually very compelling examples in their own right but usually it's a case of 'this was built on Linked Data, therefore Linked Data is wonderful' without really making a proper case as to why.

September 06, 2010

If I was a Batman villain I'd probably be...

The Modeller.

OK... not a real Batman villain (I didn't realise there were so many to choose from) but one made up by Chris Gutteridge in a recent blog post of the same name. It's very funny:

I’ve invented a new Batman villain. His name is “The Modeller” and his scheme is to model Gotham city entirely accurately in a way that is of no practical value to anybody. He has an OWL which sits on his shoulder which has the power to absorb huge amounts of time and energy.

...

Over the 3 issues there’s a running subplot about The Modeller's master weapon, the FRBR, which everyone knows is very very powerful but when the citizens of Gotham talk about it none of them can quite agree on exactly what it does.

...

While unpopular with the fans, issue two, “Batman vs the Protégé“, will later be hailed as a Kafkaesque masterpiece. Batman descends further into madness as he realises that every moment he’s the Batman of that second in time, and each requires a URI, and every time he considers a plan of action, the theoretical Batmen in his imagination also require unique distinct identifiers which he must assign before continuing.

I suspect there's a little bit of The Modeller in most of us - certainly those of us who have a predisposition towards Linked Data/the Semantic Web/RDF - and as I said before, I tend to be a bit of a purest, which probably makes me worse than most. I've certainly done my time with the FRBR. The trick is to keep The Modeller's influence under control as far as possible.

July 29, 2010

legislation.gov.uk

I woke up this morning to find a very excited flurry of posts in my Twitter stream pointing to the launch by the UK National Archives of the legislation.gov.uk site, which provides access to all UK legislation, including revisions made over time. A post on the data.gov.uk blog provides some of the technical background and highlights the ways in which the data is made available in machine-processable forms. Full details are provided in the "Developer Zone" documents.

I don't for a second pretend to have absorbed all the detail of what is available, so I'll just highlight a couple of points.

First and foremost, this is being delivered with an eye firmly on the Linked Data principles. From the blog post I mentioned above:

For the web architecturally minded, there are three types of URI for legislation on legislation.gov.uk. These are identifier URIs, document URIs and representation URIs. Identifier URIs are of the form http://www.legislation.gov.uk/id/{type}/{year}/{number} and are used to denote the abstract concept of a piece of legislation - the notion of how it was, how it is and how it will be. These identifier URIs are designed to support the use of legislation as part of the web of Linked Data. Document URIs are for the document. Representation URIs are for the different types of possible rendition of the document, so htm, pdf or xml.

(Aside: I admit to a certain squeamishness about the notion of "representation URIs" and I kinda prefer to think in terms of URIs for Generic Documents and for Specific Documents, along the lines described by Tim Berners-Lee in his "Generic Resources" note, but that's a minor niggle of terminology on my part, and not at all a disagreement with the model.)

A second aspect I wanted to highlight (given some of my (now slightly distant) past interests) is that, on looking at the RDF data (e.g. http://www.legislation.gov.uk/ukpga/2010/24/contents/data.rdf), I noticed that it appears to make use of a FRBR-based model to deal with the challenge of representing the various flavours of "versioning" relationships.

I haven't had time to look in any detail at the implementation, other than to observe that the data can get quite complex - necessarily so - when dealing with a lot of whole-part and revision-of/variant-of/format-of relationships. (There was one aspect where I wondered if the FRBR concepts were being "stretched" somewhat, but I'm writing in haste and I may well be misreading/misinterpreting the data, so I'll save that question for another day.)

It's fascinating to see the FRBR approach being deployed as a practical solution to a concrete problem, outside of the library community in which it originated.

Pretty cool stuff, and congratulations to all involved in providing it. I look forward to seeing how the data is used.

July 08, 2010

Going LOCAH: a Linked Data project for JISC

Recently I worked with Adrian Stevenson of UKOLN and Jane Stevenson and Joy Palmer of MIMAS, University of Manchester on a bid for a project under the JISC O2/10 call, Deposit of research outputs and Exposing digital content for education and research, and I'm very pleased to be able to say that the proposal has been accepted and the project has been funded.

The project is called "Linked Open Copac Archives Hub" (LOCAH). It aims to address the "expose" section of the call, and focuses on making available data hosted by the Copac and Archives Hub services hosted by MIMAS - i.e. library catalogue data and data from archival finding aids - in the form of Linked Data; developing some prototype applications illustrating the use of that data; and analysing some of the issues arising from that work. The main partners in the work are UKOLN and MIMAS, with contributions from Eduserv, OCLC and Talis. The Eduserv contribution will take the form of some input from me, probably mostly in the area of working with Jane on modelling some of the archival finding aid data, currently held in the form of EAD-encoded XML documents, so that it can be represented in RDF - though I imagine I'll be sticking my oar in on various other aspects along the way.

UKOLN is managing the project and hosting a project weblog. I'm not sure at the moment how I'll divide up thoughts between here and there; I'll probably end up with a bit of duplication along the way.

May 05, 2010

RDFa for the Eduserv Web site

Another post that I've been intermittently chiselling away at in the draft pile for a while... A few weeks ago, I was asked by Lisa Price, our Website Communications Manager, to make some suggestions of how Eduserv might make use of the RDFa in XHTML syntax to embed structured data in pages on the Eduserv Web site, which is currently in the process of being redesigned. I admit this is coming mostly from the starting point of wanting to demonstrate the use of the technology rather than from a pressing use case, but OTOH there is a growing interest from RDFa amongst some of Eduserv's public sector clients so a spot of "eating our own dogfood" would be a Good Thing, and furthermore there are signs of a gradual but significant adoption of RDFa by some major Web service providers.

It seems to me Eduserv might use RDFa to describe, or make assertions about:

  • (Perhaps rather trivially) Web pages themselves i.e. reformulating the (fairly limited) "document metadata" we supply as RDFa.
  • (Perhaps rather more interestingly) some of the "things" that Eduserv pages "are about", or that get mentioned in those pages (e.g. persons, organisations, activities, events, topics of interest, etc).

Within that category of data about "things", we need to decide which data it is most useful to expose. We could:

  • look at those classes of data that are processed by tools/services that currently make use of RDFa (typically using specified RDF vocabularies); or
  • focus on data that we know already exists in a "structured" form but is currently presented in X/HTML either only in human-readable form or using microformats (or even new data which isn't currently surfaced at all on the current site)

Another consideration was the question of whether data was covered by existing models and vocabularies or required some analysis and modelling.

To be honest, there's a fairly limited amount of "structured" information on the site currently. There is some data on licence agreements for software and data, currently made available as HTML tables and Excel spreadsheets. While I think some of the more generic elements of this might be captured using a product/service ontology such as Good Relations, the license-specific aspects would require some additional modelling. For the short term at least, we've taken a somewhat "pragmatic" approach and focused mainly on that first class of data for which there are some identifiable consuming applications, based on the use of specified RDF vocabularies - and more specifically on data that Google and Yahoo make particular reference to in their documentation for creators/publishers of Web pages.

That's not to say there won't be more use of RDFa on the site in the future: at the moment, this is something of a "dipping toes in the water" exercise, I think.

The following is by best effort to summarize Google and Yahoo support for RDFa at the time of writing. Please note that this is something which is evolving - as I was writing up this post, I just noticed that the Google guidelines have changed slightly since I sent my initial notes to Lisa. And I'm still not at all sure I've captured the complete picture here, so please do check their current documentation for content providers to get an idea of the current state of play.

Google and RDFa

Google's support for RDFa is part of a larger programme of support for structured data embedded in X/HTML that they call "rich snippets" (announced here), which includes support for RDFa, microformats and microdata. (The latter, I think, is a relatively recent addition).

Google functionality extends to extracting specified categories of RDFa data in (some) pages it indexes, and displaying that in search result sets (and in place pages in Google Maps). It also provides access to the data in its Custom Search platform.

Initially at least, Google required the use of its own RDF vocabularies, which attracted some criticism (see e.g. Ian Davis' response), but it appears to have fairly quietly introduced some support for other RDF vocabularies. "In addition to the Person RDFa format, we have added support for the corresponding fields from the FOAF and vCard vocabularies for all those of you who asked for it." And Martin Hepp has pointed to Google displaying data encoded using the Good Relations product/service ontology.

The nature of the RDFa syntax is such that it is often fairly straightforward to use multiple RDF vocabularies in RDFa e.g. triples using the same subject and object but different predicates can be encoded using a single RDFa attribute with multiple white-space-separated CURIEs - though things do tend to get more messy if the vocabularies are based on different models (e.g. time periods as literals v time periods as resources with properties of their own).

Google provides specific recommendations to content creators on the embedding of data to describe:

Yahoo and RDFa

Yahoo's support for RDFa is through its SearchMonkey platform. Like Google, it provides a set of "standard" result set enhancements, based on the use of specified RDF vocabularies for a small set of resource types:

In addition, my understanding is that although Yahoo defines some RDF vocabularies of its own, and describes the use of specified vocabularies in the guidelines for the resource types above, it exposes any RDFa data in pages it indexes to developers on its SearchMonkey platform, to allow the building of custom search enhancements. Several existing vocabularies are discussed in the SearchMonkey guide and the FAQ in Appendix D of that document notes "You may use any RDF or OWL vocabulary".

Linked Data

The decentralised extensibility built into RDF means that a provider can choose to extend what data they expose beyond that specified in the guidelines mentioned above.

In addition, I tried to take into account some other general "good practice" points that have emerged from the work of the Linked Data community, captured in sources such as:

So in the Eduserv case, for example (I hope!) URIs will be assigned to "things" like events, distinct from the pages describing them, with suitable redirects put in place on the HTTP server and syitable triples in the data linking those things and the corresponding pages.

Summary

Anyway, on the basis of the above sources, I tried to construct some suggestions, taking into acccount both the Google and Yahoo guidelines, for descriptions of people, organisations and events, which I'll post here in the next few entries.

Postscript: Facebook

Even more recently, of course, has come the news of Facebook's announcement at the f8 conference of their Open Graph Protocol. This makes use of RDFa embedded in the headers of XHTML pages using meta elements to provide (pretty minimal) metadata "about" things described by those pages (films, songs, people, places, hotels, restaurants etc - see the Facebook page for a full (and I imagine, growing) list of resource types supported).

Facebook makes use of the data to drive its "Like" application: a "button" can be embedded in the page to allow a Facebook user to post the data to their Fb account to signal an "I like this" relationship with the thing described. Or as Dare Obasanjo expresses it, an Fb user can add a node for the thing to their Fb social graph, making it into a "social object". This results in the data being displayed at appropriate points in their Fb stream, while the button displays, as a minimum, a count of the "likers" of the resource on the source page itself; logged-in Fb users would, I think, see information about whether any of their "friends" had liked it.

My reporting of these details of the interface is somewhat "second-hand" as I no longer use Facebook - I deleted my account some time ago because I was concerned about their approaches to the privacy of personal information (see these three recent posts by Tony Hirst for some thoughts on the most recent round of changes in that sphere).

Perhaps unsurprisingly given the popularity of Fb and its huge user base, the OGP announcement seems to have attracted a very large amount of attention within a very short period of time, and it may turn out to be a significant milestone for the use of XHTML-embedded metadata in general and of RDFa in particular. The substantial "carrot" of supporting the Fb "Like" application and attracting traffic from Fb users is likely to be the primary driver for many providers to generate this data, and indeed some commentators (see e.g. this BBC article) have gone as far as to suggest that this represents a move by Facebook to challenge Google as the primary filter of resources for people searching and navigating the Web.

However, I also think it is important to distinguish between the data on the one hand and that particular Facebook app on the other. Having this data available, minimal as it may be, also opens up the possibility of other applications by other parties making use of that same data.

And this is true also, of course, for the case of data constructed following the Google and Yahoo guidelines.

The future of UK Dublin Core application profiles

I spent yesterday morning up at UKOLN (at the University of Bath) for a brief meeting about the future of JISC-funded Dublin Core application profile development in the UK.

I don't intend to report on the outcomes of the meeting here since it is not really my place to do so (I was just invited as an interested party and I assume that the outcomes of the meeting will be made public in due course). However, attending the meeting did make me think about some of the issues around the way application profiles have tended to be developed to date and these are perhaps worth sharing here.

By way of background, the JISC have been funding the development of a number of Dublin Core application profiles in areas such as scholarly works, images, time-based media, learning objects, GIS and research data over the last few years.  An application profile provides a model of some subset of the world of interest and an associated set of properties and controlled vocabularies that can be used to describe the entities in that model for the purposes of some application (or service) within a particular domain. The reference to Dublin Core implies conformance with the DCMI Abstract Model (which effectively just means use of the RDF model) and an inherent preference for the use of Dublin Core terms whenever possible.

The meeting was intended to help steer any future UK work in this area.

I think (note that this blog post is very much a personal view) that there are two key aspects of the DC application profile work to date that we need to think about.

Firstly, DC application profiles are often developed by a very small number of interested parties (sometimes just two or three people) and where engagement in the process by the wider community is quite hard to achieve. This isn't just a problem with the UK JISC-funded work on application profiles by the way. Almost all of the work undertaken within the DCMI community on application profiles suffers from the same problem - mailing lists and meetings with very little active engagement beyond a small core set of people.

Secondly, whilst the importance of enumerating the set of functional requirements that the application profile is intended to meet has not been underestimated, it is true to say that DC application profiles are often developed in the absence of an actual 'software application'. Again, this is also true of the application profile work being undertaken by the DCMI. What I mean here is that there is not a software developer actually trying to build something based on the application profile at the time it is being developed. This is somewhat odd (to say the least) given that they are called application profiles!

Taken together, these two issues mean that DC application profiles often take on a rather theoretical status - and an associated "wouldn't it be nice if" approach. The danger is a growth in the complexity of the application profile and a lack of any real business drivers for the work.

Speaking from the perspective of the Scholarly Works Application Profile (SWAP) (the only application profile for which I've been directly responsible), in which we adopted the use of FRBR, there was no question that we were working to a set of perceived functional requirements (e.g. "people need to be able to find the latest version of the current item"). However, we were not driven by the concrete needs of a software developer who was in the process of building something. We were in the situation where we could only assume that an application would be built at some point in the future (a UK repository search engine in our case). I think that the missing link to an actual application, with actual developers working on it, directly contributed to the lack of uptake of the resulting profile. There were other factors as well of course - the conceptual challenge of basing the work on FRBR and that fact that existing repository software was not RDF-ready for example - but I think that was the single biggest factor overall.

Oddly, I think JISC funding is somewhat to blame here because, in making funding available, JISC helps the community to side-step the part of the business decision-making that says, "what are the costs (in time and money) of developing, implementing and using this profile vs. the benefits (financial or otherwise) that result from its use?".

It is perhaps worth comparing current application profile work and other activities. Firstly, compare the progress of SWAP with the progress of the Common European Research Information Format (CERIF), about which the JISC recently reported:

EXRI-UK reviewed these approaches against higher education needs and recommended that CERIF should be the basis for the exchange of research information in the UK. CERIF is currently better able to encode the rich information required to communicate research information, and has the organisational backing of EuroCRIS, ensuring it is well-managed and sustainable.

I don't want to compare the merits of these two approaches at a technical level here. What is interesting however, is that if CERIF emerges as the mandated way in which research information is shared in the UK then there will be a significant financial driver to its adoption within systems in UK institutions. Research information drives a significant chunk of institutional funding which, in turn, drives compliance in various applications. If the UK research councils say, "thou shalt do CERIF", that is likely what institutions will do.  They'll have no real choice. SWAP has no such driver, financial or otherwise.

Secondly, compare the current development of Linked Data applications within the UK data.gov.uk initiative with the current application profile work. Current government policy in the UK effectively says, 'thou shalt do Linked Data' but isn't really any more prescriptive. It encourages people to expose their data as Linked Data and to develop useful applications based on that data. Ignoring any discussion about whether Linked Data is a good thing or not, what has resulted is largely ground-up. Individual developers are building stuff and, in the process, are effectively developing their own 'application profiles' (though they don't call them that) as part of exposing/using the Linked Data. This approach results in real activity. But it also brings with it the danger of redundancy, in that every application developer may model their Linked Data differently, inventing their own RDF properties and so on as they see fit.

As Paul Walk noted at the meeting yesterday, at some stage there will be a huge clean-up task to make any widespread sense of the UK government-related Linked Data that is out there. Well, yes... there will. Conversely, there will be no clean up necessary with SWAP because nobody will have implemented it.

Which situation is better!? :-)

I think the issue here is partly to do with setting the framework at the right level. In trying to specify a particular set of application profiles, the JISC is setting the framework very tightly - not just saying, "you must use RDF" or "you must use Dublin Core" but saying "you must use Dublin Core in this particular way". On the other hand, the UK government have left the field of play much more open. The danger with the DC application profile route is lack of progress. The danger with the government approach is too little consistency.

So, what are the lessons here? The first, I think, is that it is important to lobby for your prefered technical solution at a policy level as well as at a technical level. If you believe that a Linked Data-compliant Dublin Core application profile is the best technical way of sharing research information in the UK then it is no good just making that argument to software developers and librarians. Decisions made by the research councils (in this case) will be binding irrespective of technical merit and will likely trump any decisions made by people on the ground.

The second is that we have to understand the business drivers for the adoption, or not, of our technical solutions rather better than we do currently. Who makes the decisions? Who has the money? What motivates the different parties? Again, technically beautiful solutions won't get adopted if the costs of adoption are perceived to outweigh the benefits, or if the people who hold the purse strings don't see any value in spending their money in that particular way, or if people simply don't get it.

Finally, I think we need to be careful that centralised, top-down, initiatives (particularly those with associated funding) don't distort the environment to such an extent that the 'real' drivers, both financial and user-demand, can be ignored in the short term, leading to unsustainable situations in the longer term. The trick is to pump-prime those things that the natural drivers will support in the long term - not always an easy thing to pull off.

April 27, 2010

RDFa 1.1 drafts available from W3C

Last week, the W3C RDFa Working Group announced the availability of two new "First Public Working Drafts" which it is circulating for comment:

Ivan Herman, the W3C Semantic Web Activity Lead, and a co-editor of these documents, has provided a very helpful summary of their main features, and particularly of some of the differences they introduce whem compared with the current W3C Recommendation for RDFa RDFa in XHTML: Syntax and Processing: A collection of attributes and processing rules for extending XHTML to support RDF. I think the intent is that the new drafts maintain compatibility with the current recommendation, in the sense that all the features used in XHTML+RDFa 1.0 are also present in RDFa 1.1. I should reiterate what Ivan says at the start of his piece: these are drafts and features may change based on feedback received.

Some of the most interesting features in these drafts, at least for data creators, are those which enable a more concise/compact style of RDFa markup. One of the criticisms of the initial version of RDFa, particularly from communities unfamiliar with RDF syntaxes, was the dependency on the use of prefixed names, in the form of CURIEs, two-part names made up of a "prefix" and a "reference", mapped to URIs by associating the prefix with a "base" URI, and concatenating the reference part of the CURIE with that URI. In XHTML the prefix-URI association was made through an XML Namespace Declaration. In particular, arguments against this approach focused on problems of "copy-and-paste", where a document fragment including RDFa markup was extracted from a source document without also copying the in-scope XML Namespace declarations, and as a result the RDF interpretation of the fragment in the context of a different (paste target) document was changed. More generally, there were some concerns that the use of prefixes was difficult to explain and understand, at least when compared with the "unprefixed name" styles typically adopted in approaches like microformats.

The new drafts introduce several mechanisms which can simplify markup for authors.

I should emphasise that my examples below are based on my fairly rapid reading of the drafts, and any errors and misrepresentations are mine!

The @vocab attribute

@vocab is a new RDFa attribute which provides a means for defining a "default" "vocabulary URI" to which the "terms" in attribute values are appended to construct a URI.

Aside: I should note here that I'm using the word "term" in the sense it is used in the RDFa 1.1 draft, where it refers to a datatype for a string used in an attribute value; this differs from usage by e.g. DCMI where "term" typically refers to a property, class, vocabulary encoding scheme or syntax encoding scheme i.e. to the "conceptual resource" identified by a DCMI URI, rather than to a syntactic component. In RDFa 1.1, "terms" have the syntactic constraints of the NCName production in the XML Namespaces specification.

This mechanism provides an alternative to the use of a CURIE (with prefix, reference and namespace declaration) to represent a URI.

Consider an example based on those from my recent post about RDFa (1.0) and document metadata (This is a "hybrid" of the examples 1.1.5, 1.3.5, and 2.1.5 in that post):

XHTML+RDFa 1.0:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <h1 property="dc:title">My World Cup 2010 Review</h1>

    <p>About: 
      <a rel="dc:subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        The 2010 World Cup
      </a>
    </p>

    <p>Date last modified: 
      <span property="dc:modified"
        datatype="xsd:date">2010-07-04</span>
    </p>

  </body>
  
</html>

This represents the following three triples (in Turtle):

@prefix dc: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <http://example.org/resource/> .

<> dc:title "My World Cup 2010 Review" .
<> dc:subject ex:2010_FIFA_World_Cup .
<> dc:modified "2010-07-04"^^xsd:date .

XHTML+RDFa 1.1 using @vocab:

Using the @vocab attribute on the body element to set http://purl.org/dc/terms/ as the default vocabulary URI, I could write this as:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.1">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body vocab="http://purl.org/dc/terms/">
    <h1 property="title">My World Cup 2010 Review</h1>

    <p>About: 
      <a rel="subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        The 2010 World Cup
      </a>
    </p>

    <p>Date last modified: 
      <span property="modified"
        datatype="xsd:date">2010-07-04</span>
    </p>

  </body>
  
</html>

In that case, where just three properties are referenced, the reduction in the number of characters is minimal, but if several properties from the same vocabulary were referenced, then the saving could be more substantial.

The @vocab approach provides limited help where, as is often the case, terms from multiple RDF vocabularies are used in combination (e.g. the example above continues to use a CURIE for the URI of the XML Schema date datatype), but other features of RDFa 1.1 are useful in those cases.

RDFa Profiles and the @profile attribute

Perhaps more powerful than the @vocab attribute is the new RDFa 1.1 feature known as the RDFa profile, and the @profile attribute:

RDFa Profiles are optional external documents that define collections of terms and/or prefix mappings. These documents must be defined in an approved RDFa Host Language (currently XHTML+RDFa [XHTML-RDFA]). They may also be defined in other RDF serializations as well (e.g., RDF/XML [RDF-SYNTAX-GRAMMAR] or Turtle [TURTLE]). RDFa Profiles are referenced via @profile, and can be used by document authors to simplify the task of adding semantic markup.

Let's take each of these two functions - defining terms and defining prefix mappings - in turn.

Defining term mappings in an RDFa profile

An RDFa profile can provide mappings between "terms" and URIs. The following example provides four such "term mappings", for the URIs of three properties from the DC Terms RDF vocabulary and for the URI of one XML Schema datatype:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:rdfa="http://www.w3.org/ns/rdfa#"
      version="XHTML+RDFa 1.1">
      
  <head>
    <title>My RDFa Profile for a few DC and XSD terms</title>
  </head>

  <body>

    <h1>My RDFa Profile for a few DC and XSD terms</h1>

    <ul>

      <li typeof="rdfa:TermMapping">
        <span property="rdfa:term">title</span> :
        <span property="rdfa:uri">http://purl.org/dc/terms/title</span>
      </li>

      <li typeof="rdfa:TermMapping">
        <span property="rdfa:term">about</span> :
        <span property="rdfa:uri">http://purl.org/dc/terms/subject</span>
      </li>

      <li typeof="rdfa:TermMapping">
        <span property="rdfa:term">modified</span> :
        <span property="rdfa:uri">http://purl.org/dc/terms/modified</span>
      </li>

      <li typeof="rdfa:TermMapping">
        <span property="rdfa:term">xsddate</span> :
        <span property="rdfa:uri">http://www.w3.org/2001/XMLSchema#date</span>
      </li>

    </ul>

  </body>
  
</html>

Note that - in contrast to the case of CURIE references - the content of the "term" doesn't have to match the trailing characters of the URI; so for example, here I've mapped the term "about" to the URI http://purl.org/dc/terms/subject. So sets of "terms" corresponding to various community-specific or domain=specific lexicons could be mapped to a single set of URIs.

Also a single RDFa profile might provide mappings for URIs from different URI owners - the example above reference three DCMI-owned URIs for properties and a W3C-owned URI for a datatype. Conversely, different subsets of URIs owned by a single agency may be referenced in different RDFa profiles.

If the URI of this RDFa profile is http://example.org/profile/terms/, then I can reference it in an XHTML+RDFa 1.1 document, and make use of the term mappings it defines. So taking the example above again, and now using @profile to reference the profile and its term mappings:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      version="XHTML+RDFa 1.1">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body profile="http://example.org/profile/terms/">
    <h1 property="title">My World Cup 2010 Review</h1>

    <p>About: 
      <a rel="about"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        The 2010 World Cup
      </a>
    </p>

    <p>Date last modified: 
      <span property="modified"
        datatype="xsddate">2010-07-04</span>
    </p>

  </body>
  
</html>

The @profile attribute may appear on any XML element, so it is possible that an element with a @profile attribute referencing profile A may contain as a child element with a @profile attribute referencing profile B.

  <body profile="http://example.org/profile/a/">
    <h1 property="title">My World Cup 2010 Review</h1>

    <div profile="http://example.org/profile/b/">
 
      <p>About: 
        <a rel="about"
          href="http://example.org/resource/2010_FIFA_World_Cup">
          The 2010 World Cup
        </a>
      </p>

    </div>

  </body>
  

And the value of a single @profile attribute may be a whitespace-separated list of URIs.

  <body profile="http://example.org/profile/a/
    http://example.org/profile/b/">

  </body>

One of the questions I'm not quite sure about is what happens if the same "term" is mapped to different URIs in different profiles. I think, but I'm not 100% sure, only a single mapping is used and a single triple is generated, but I'm not sure about the precedence rules for determining which mapping is to be used.

As Ivan notes, probably the most common pattern for deploying RDFa profiles will be for the owners/publishers of RDF vocabularies (such as DCMI) to publish profiles for their vocabularies, and for data providers to simply reference those profiles, rather than creating their own.

Defining prefix mappings in an RDFa profile

RDFa 1.1 continues to support the use of XML Namespace Declarations to associate CURIE prefixes with URIs (see my first example above and the use of the XML Schema datatype) but it also introduces other mechanisms for achieving this. One of these is the ability to supply CURIE prefix to URI mappings in RDFa profiles.

The following example provides four such "prefix mappings", for the URIs of three DCMI vocabularies and for the URI of the XML Schema datatype vocabulary:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:rdfa="http://www.w3.org/ns/rdfa#"
      version="XHTML+RDFa 1.1">
      
  <head>
    <title>My RDFa Profile for DC and XSD prefixes</title>
  </head>

  <body>

    <h1>My RDFa Profile for DC and XSD prefixes</h1>

    <ul>

      <li typeof="rdfa:PrefixMapping">
        <span property="rdfa:prefix">dc</span> :
        <span property="rdfa:uri">http://purl.org/dc/terms/</span>
      </li>

      <li typeof="rdfa:PrefixMapping">
        <span property="rdfa:prefix">dcam</span> :
        <span property="rdfa:uri">http://purl.org/dc/dcam/</span>
      </li>

      <li typeof="rdfa:PrefixMapping">
        <span property="rdfa:prefix">dcmitype</span> :
        <span property="rdfa:uri">http://purl.org/dc/dcmitype/</span>
      </li>

      <li typeof="rdfa:PrefixMapping">
        <span property="rdfa:prefix">xsd</span> :
        <span property="rdfa:uri">http://www.w3.org/2001/XMLSchema#</span>
      </li>

    </ul>

  </body>
  
</html>

If the URI of this RDFa profile is http://example.org/profile/prefixes/, then I can reference it in an XHTML+RDFa 1.1 document, and make use of the prefix mappings it defines. Taking the example above again, and using @profile to reference this second profile and its prefix mappings:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      version="XHTML+RDFa 1.1">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body profile="http://example.org/profile/prefixes/">
    <h1 property="dc:title">My World Cup 2010 Review</h1>

    <p>About: 
      <a rel="dc:subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        The 2010 World Cup
      </a>
    </p>

    <p>Date last modified: 
      <span property="dc:modified"
        datatype="xsd:date">2010-07-04</span>
    </p>

  </body>
  
</html>

As in the case of term mappings, the issue arises of what happens in the case that two profiles provide different prefix-URI mappings for the same prefix. I think the CURIE datatype is based on the notion that at a point in a document, for prefix p, a single prefix-URI mapping is in force for that prefix, so I assume there are precedence rules for establishing which of the profile prefix mappings is to be applied.

Access to profiles and changes to triples?

Although the RDFa 1.1 profile mechanism is a powerful mechanism, it also introduces a new element of complexity for consumers of RDFa. In RDFa 1.0, an XHTML+RDFa document is "self-contained", by which I mean an RDFa processor can construct an interpretation of the document as a set of RDF triples using only the content of the document itself. In RDFa 1.1, however, the interpretation of terms and prefixes may be determined by the term mappings and prefix mappings specified in profiles external to the document containing the RDFa markup.

Consider my last example above. When the processor encounters the @profile attribute it retrieves the profile and obtains a list of prefix-URI mappings to be applied in subsequent processing, and when it encounters the CURIE "dc:title" it generates the URI http://purl.org/dc/terms/title

But if for some reason, the processor is unable to dereference the URI, and doesn't have a cached copy of the referenced profile, then it does not have those mappings available. In that case, for my example above, when the processor encounters the CURIE "dc:title" it would not have a mapping for the "dc" prefix, and (I think?) would instead (with the new "URI everywhere" rules in force) treat the string "dc:title" as a URI? (See e.g. the section on CURIE and URI Processing)

In the case where two profiles are referenced, and both provide a mapping for the same prefix, then it seems possible that the prefix mapping in force might change depending on the availability of access to the profiles.

I lurk on the RDFa WG list, and I've seen various discussions of how these sort of issues should be handled - see, for example, this thread on "What happens when you can't dereference a profile document?", though related issues surface in other discussions too. I suspect the current draft is far from the "last word" in this area, and these are the sort of issues on which the authors are seeking feedback.

Summary

I've focused here only on a few "highlights" of the RDFa 1.1 drafts, and Ivan's post covers a couple more which I won't discuss here (the use of the @prefix attribute to provide CURIE prefix mappings and the ability to use URIs in contexts where previously CURIEs were required), but I hope they give a flavour of the sort of functionality which is being introduced. The examples here are based on my understanding of the current drafts, but I may have made mistakes, so please do check out the drafts rather than relying on my interpretations.

It seems to me the WG is trying hard to address some of the criticisms made of RDFa 1.0, and to provide mechanisms that make the provision of RDFa markup simpler while retaining the power and flexibility of the syntax and ensuring that RDFa 1.0 data remains compatible. In particular, it seems to me the "term mapping" feature of RDFa profiles may be very useful in "shielding" data providers from some of the complexity of name-URI mappings and prefixed names, especially once the owners of commonly used RDF vocabularies start to make such profiles available.

However, such flexibility doesn't come without its own challenges. and it also seems that the profile mechanism in particular introduces some complexity which I imagine will become a focus of some discussion during the comment period for these drafts. Comments on the drafts themselves should be sent to the RDFa Working Group list.

April 22, 2010

Document metadata using DC-HTML and using RDFa

In the context of various bits and pieces of work recently (more of which I'll write about in some upcoming posts), I've been finding myself describing how document metadata that can be represented using DCMI's DC-HTML meta data profile, described in Expressing Dublin Core metadata using HTML/XHTML meta and link elements, might also be represented using RDFa. (N.B. Here I'm considering only the current RDFa in XHTML W3C Recommendation, not the newly announced drafts for RDFa 1.1). So I thought I'd quickly list some examples here. Please note: I don't intend this to be a complete tutorial on using RDFa. Far from it; here I focus only on the case of "document metadata" whereas of course RDFa can be used to represent data "about" any resources. And these are really little more than a few rough notes which one day I might reuse somewhere else.

I really just wanted to illustrate that:

  • in terms of its use with the XHTML meta and link elements, RDFa has many similarities to the DC-HTML profile - unsurprisingly, as the RDF model underlies both; and
  • RDFa also provides the power and flexibility to represent data that can not be expressed using the DC-HTML profile.

The main differences between using RDFa in XHTML and using the DC-HTML profile are:

  • RDFa supports the full RDF model, not just the particular subset supported by DC-HTML
  • RDFa introduces some new XML attributes (@about, @property, @resource, @datatype, @typeof)
  • RDFa uses a datatype called CURIE for the abbreviation of URIs; DC-HTML uses a prefixed name convention which is essentially specific to that profile (though it was also adopted by the Embedded RDF profile)
  • Perhaps most significantly, RDFa can be used anywhere in an XHTML document, so the same syntactic conventions can be used both for document metadata and for data ("about" any resources) embedded in the body of the document

I'm presenting these examples following the description set model of the DCMI Abstract Model, and in more or less the same order that the DC-HTML specification presents the same set of concepts.

For each example, I present the data:

  • using DC-Text
  • using Turtle
  • in XHTML using DC-HTML
  • in XHTML+RDFa, using meta and link elements
  • in XHTML+RDFa, using block and inline elements (to illustrate that the same data could be embedded in the body of an XHTML document, rather than only in the head)

As an aside, it is possible to use the DC-HTML profile alongside RDFa in the same document, but I haven't bothered to show that here.

Footnote: Hmmm. Considering that I said to myself at the start of the year that I was rather tired of thinking/writing about syntax, I still seem to be doing an awful lot of it! Will try to write about other things soon....

1. Literal Value Surrogates

See DC-HTML 4.5.1.2.

1.1 Plain Value String

1.1.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:title )
      LiteralValueString ( "My World Cup 2010 Review" )
    )
  )
)

1.1.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
<> dc:title "My World Cup 2010 Review" .

1.1.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <meta name="DC.title" content="My World Cup 2010 Review" />
  </head>

</html>

1.1.4 XHTML+RDFa using meta and link

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <meta property="dc:title" content="My World Cup 2010 Review" />
  </head>

</html>

In this example, it would also be possible to simply add an attribute to the title element, instead of introducing the meta element:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title property="dc:title">My World Cup 2010 Review</title>
  </head>

</html>

1.1.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <h1 property="dc:title">My World Cup 2010 Review</h1>
  </body>
  
</html>

1.2 Plain Value String with Language Tag

1.2.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:title )
      LiteralValueString ( "My World Cup 2010 Review" 
        Language ( en )
      )
    )
  )
)

1.2.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
<> dc:title "My World Cup 2010 Review"@en .

1.2.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <meta name="DC.title" 
      xml:lang="en" content="My World Cup 2010 Review" />
  </head>

</html>

1.2.4 XHTML+RDFa using meta and link

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <meta property="dc:title"
      xml:lang="en" content="My World Cup 2010 Review" />
  </head>

</html>

1.2.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <h1 property="dc:title" xml:lang="en">My World Cup 2010 Review</h1>
  </body>
  
</html>

1.3 Typed Value String

1.3.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:modified )
      LiteralValueString ( "2010-07-04"
        SyntaxEncodingSchemeURI ( xsd:date )
      )
    )
  )
)

1.3.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<> dc:modified "2010-07-04"^^xsd:date .

1.3.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <link rel="schema.XSD" href="http://www.w3.org/2001/XMLSchema#" >
    <meta name="DC.modified" 
      scheme="XSD.date" content="2010-07-04" />
  </head>

</html>

1.3.4 XHTML+RDFa using meta and link

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <meta property="dc:modified" 
      datatype="xsd:date" content="2010-07-04" />
  </head>

</html>

1.3.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>Date last modified: 
      <span property="dc:modified"
        datatype="xsd:date">2010-07-04</span>
    </p>
  </body>
  
</html>

2. Non-Literal Value Surrogates

See DC-HTML 4.5.2.2.

2.1 Value URI

2.1.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:subject )
      ValueURI ( ex:2010_FIFA_World_Cup )
      )
    )
  )
)

2.1.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
<> dc:subject ex:2010_FIFA_World_Cup .

2.1.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <link rel="DC.subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
  </head>

</html>

2.1.4 XHTML+RDFa using meta and link

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
  </head>

</html>

2.1.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>About: 
      <a rel="dc:subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        The 2010 World Cup
      </a>
    </p>
  </body>
  
</html>

2.2 Value URI with Plain Value String

2.2.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:subject )
      ValueURI ( ex:2010_FIFA_World_Cup )
      ValueString ( "2010 FIFA World Cup" )
      )
    )
  )
)

2.2.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<> dc:subject ex:2010_FIFA_World_Cup .
ex:2010_FIFA_World_Cup rdf:value "2010 FIFA World Cup" .

2.2.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <link rel="DC.subject"
      href="http://example.org/resource/2010_FIFA_World_Cup"
      title="2010 FIFA World Cup" />
  </head>

</html>

2.2.4 XHTML+RDFa using meta and link

Here the single DCAM statement is made up of two RDF triples, and in RDFa both a link and a meta element are used:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/resource/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
    <meta about="[ex:2010_FIFA_World_Cup]"
      property="rdf:value" content="2010 FIFA World Cup" />
  </head>

</html>

2.2.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/resource/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>About: 
      <a rel="dc:subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
        <span property="rdf:value">2010 FIFA World Cup</span>
      </a>
    </p>
  </body>
  
</html>

2.3 Value URI with Plain Value String with Language Tag

2.3.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:subject )
      ValueURI ( ex:2010_FIFA_World_Cup )
      ValueString ( "2010 FIFA World Cup" 
        Language ( en )
        )
      )
    )
  )
)

2.3.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<> dc:subject ex:2010_FIFA_World_Cup .
ex:2010_FIFA_World_Cup rdf:value "2010 FIFA World Cup"@en .

2.3.3 XHTML using DC-HTML:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

  <head 
    profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>My World Cup 2010 Review</title>
    <link rel="schema.DC" href="http://purl.org/dc/terms/" />
    <link rel="DC.subject"
      href="http://example.org/resource/2010_FIFA_World_Cup"
      xml:lang="en" title="2010 FIFA World Cup" />
  </head>

  </html>

2.3.4 XHTML+RDFa using meta and link

Again, the single DCAM statement is made up of two RDF triples, and in RDFa both a link and a meta element are used:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/resource/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
    <meta about="[ex:2010_FIFA_World_Cup]"
      property="rdf:value" 
      xml:lang="en" content="2010 FIFA World Cup" />
  </head>

</html>

With RDFa, multiple value strings might be provided, using multiple meta elements (which is not supported in DC-HTML):

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/resource/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
    <meta about="[ex:2010_FIFA_World_Cup]"
      property="rdf:value" 
      xml:lang="en" content="2010 FIFA World Cup" />
    <meta about="[ex:2010_FIFA_World_Cup]"
      property="rdf:value" 
      xml:lang="es" content="Copa Mundial de Fútbol de 2010" />
  </head>

</html>

2.3.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>About: 
      <a rel="dc:subject" 
        href="http://example.org/resource/2010_FIFA_World_Cup">
        <span property="rdf:value" 
	  xml:lang="en">2010 FIFA World Cup</span>
      </a>
    </p>
  </body>
  
</html>

2.4 Value URI with Typed Value String

2.4.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:language )
      ValueURI ( ex:English )
      ValueString ( "en"
        SyntaxEncodingSchemeURI ( xsd:language )
      )
    )
  )
)

2.4.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<> dc:language ex:English .
ex:English rdf:value "en"^^xsd:language .

2.4.3 XHTML using DC-HTML:

Not supported by DC-HTML.

2.4.4 XHTML+RDFa using meta and link

Again, the single DCAM statement is made up of two RDF triples, and in RDFa both a link and a meta element are used:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/resource/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:language"
      href="http://example.org/resource/English" />
    <meta about="[ex:English]"
      property="rdf:value" datatype="xsd:language" content="en" />
  </head>

</html>

2.4.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>Language:
      <a rel="dc:language"
        href="http://example.org/resource/English">
        <span property="rdf:value" 
          datatype="xsd:language" content="en">English</span>
      </a>
    </p>
  </body>

</html>

2.5 Value URI with Vocabulary Encoding Scheme URI

2.5.1 DC-Text:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/resource/> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:subject )
      ValueURI ( ex:2010_FIFA_World_Cup )
      VocabularyEncodingSchemeURI ( ex:MyScheme )
      )
    )
  )
)

2.5.2 Turtle:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix dcam: <http://purl.org/dc/dcam/> .
@prefix ex: <http://example.org/resource/> .
<> dc:subject ex:2010_FIFA_World_Cup .
ex:2010_FIFA_World_Cup dcam:memberOf ex:MyScheme .

2.5.3 XHTML using DC-HTML:

Not supported by DC-HTML.

2.5.4 XHTML+RDFa using meta and link

Again, the single DCAM statement is made up of two RDF triples, and in XHTML using RDFa two link elements are used:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:dcam="http://purl.org/dc/dcam/"
      xmlns:ex="http://example.org/resource/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
    <link rel="dc:subject"
      href="http://example.org/resource/2010_FIFA_World_Cup" />
    <link about="[ex:2010_FIFA_World_Cup]"
      rel="dcam:memberOf"
      href="http://example.org/resource/MyScheme" />
  </head>

</html>

2.5.5 XHTML+RDFa in body

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:dcam="http://purl.org/dc/dcam/"
      version="XHTML+RDFa 1.0">
      
  <head>
    <title>My World Cup 2010 Review</title>
  </head>

  <body>
    <p>About: 
      <a rel="dc:subject"
        href="http://example.org/resource/2010_FIFA_World_Cup">
	<span rel="dcam:memberOf"
	  resource="http://example.org/resource/MyScheme" />
	The 2010 World Cup
      </a>
    </p>
  </body>
  
</html>

April 13, 2010

A small GRDDL (XML, really) gotcha

I've written previously here about DCMI's use of an HTML meta data profile for document metadata, and the use of a GRDDL profile transformation to extract RDF triples from an XHTML document. DCMI has had made use of an HTML profile for many years, but providing a "GRDDL-enabled" version is a more recent development - and it is one which I admit I was quietly quite pleased to see put in place, as I felt it illustrated rather neatly how DCMI was trying to implement some of the "follow your nose" principles of Web Architecture.

A little while ago, I noticed that the Web-based tools which I usually use to test GRDDL processing (the W3C GRDDL service and the librdf parser demonstrator) were generating errors when I tried to process documents which reference the profile. I've posted a more detailed account of my investigations to the dc-architecture Jiscmail list, and I won't repeat them all here, but in short it comes down to the use of the entity references (&nbsp; and &copy;) in the profile document, which itself is subject to a GRDDL transformation to extract the pointer to the profile transformation.

The problem arises because XHTML defines those entity references in the XHTML DTD, i.e. externally to the document itself, and a non-validating XML processor is not required to read that DTD when parsing the document, with the consequence that it fails to resolve the references - and there's no guarantee that a GRDDL processor will employ a validating parser. There's a more extended discussion of these issues in a post by Lachlan Hunt from 2005 which concludes:

Character entity references can be used in HTML and in XML; but for XML, other than the 5 predefined entities, need to be defined in a DTD (such as with XHTML and MathML). The 5 predefined entities in XML are: &amp;, &lt;, &gt;, &quot; and &apos;. Of these, you should note that &apos; is not defined in HTML. The use of other entities in XML requires a validating parser, which makes them inherently unsafe for use on the web. It is recommended that you stick with the 5 predefined entity references and numeric character references, or use a Unicode encoding.

And the GRDDL specification itself cautions :

Document authors, particularly XHTML document authors, who wish their documents to be unambiguous when used with GRDDL should avoid dependencies on an external DTD subset; specifically:

  • Explicitly include the XHTML namespace declaration in an XHTML document, or an appropriate namespace in an XML document.
  • Avoid use of entity references, except those listed in section 4.6 of the XML specification.
  • And, more generally, follow the rules listed for the standalone document validity constraint.

A note will be added to the DC-HTML profile document to emphasise this point (and the offending references removed).

I guess I was surprised that no-one else had reported the error, particularly as it potentially affects the processing of all instance documents. The fact that they hadn't does rather lends weight to the suspicion that I voiced here a few weeks ago that it may well be that few implementers are actually making use of the DC-HTML GRDDL profile transformation.

March 23, 2010

JISC Linked Data call possibilities

Hmmm... I should have written this post about two weeks ago but my blog-writing ain't what it used to be :-).

On that basis, this is just a quick note to say that we (Eduserv) are interested in strand B (the Linked Data strand) of the JISC Grant Funding Call 2/10: Deposit of research outputs and Exposing digital content for education and research.

What can we offer? A pretty decent level of expertise in Linked Data, RDF, RDFa and the Dublin Core and and some experience of modelling data. This comes in the form of your favorite eFoundations contributors, Pete Johnston and Andy Powell. In this instance we are not offering either hosting space or software development.

I appreciate that this comes late in the day and that people's planning will already be quite advanced. But we are keen to get involved in this programme somewhere so if you have a vacancy for the kind of expertise above, please get in touch.

February 26, 2010

The 2nd Linked Data London Meetup & trying to bridge a gap

On Wednesday I attended the 2nd London Linked Data Meetup, organised by Georgi Kobilarov and Silver Oliver and co-located with the JISC Dev8D 2010 event at ULU.

The morning session featured a series of presentations:

  • Tom Heath from Talis started the day with Linked Data: Implications and Applications. Tom introduced the RDF model, and planted the idea that the traditional "document" metaphor (and associated notions like the "desktop" and the "folder") were inappropriate and unnecessarily limiting in the context of Linked Data's Web of Things. Tom really just scratched the surface of this topic, I think, with a few examples of the sort of operations we might want to perform, and there was probably scope for a whole day of exploring it.
  • Tom Scott from the BBC on the Wildlife Finder, the ontology beind it, and some of the issues around generating and exposing the data. I had heard Tom speak before, about the BBC Programmes and Music sites, and again this time I found myself admiring the way he covered potentially quite complex issues very clearly and concisely. The BBC examples provide great illustrations of how linked data is not (or at least should not be) something "apart from" a "Web site", but rather is an integral part of it: they are realisations of the "your Web site is your API" maxim. The BBC's use of Wikipedia as a data source also led into some interesting discussion of trust and provenance, and dealing with the risk of, say, an editor of a Wikipedia page creating malicious content which was then surfaced on the BBC page. At the time of the presentation, the wildlife data was still delivered only in HTML, but Tom announced yesterday that the RDF data was now being exposed, in a similar style to that of the Programmes and Music sites.
  • John Sheridan and Jeni Tennison described their work on initiatives to open up UK government data. This was really something of a whirlwind (or maybe given the presenters' choice of Wild West metaphors, that should be a "twister") tour through a rapidly evolving landscape of current work, but I was impressed by the way they emphasised the practical and pragmatic nature of their approaches, from guidance on URI design through work on provenance, to the current work on a "Linked Data API" (on which more below)
  • Lin Clark of DERI gave a quick summary of support for RDFa in Drupal 7. It was necessarily a very rapid overview, but it was enough to make me make a mental note to try to find the time to explore Drupal 7 in more detail.
  • Georgi Kobilarov and Silver Oliver presented Uberblic, which provides a single integrated point of access to a set of data sources. One of the very cool features of Uberblic is that updates to the sources (e.g. a Wikipedia edit) are reflected in the aggregator in real time.

The morning closed with a panel session chaired by Paul Miller, involving Jeni Tennison, Tom Scott, Ian Davis (Talis) and Timo Hannay (Nature Publishing) which picked up many of the threads from the preceding sessions. My notes (and memories!) from this session seem a bit thin (in my defence, it was just before lunch and we'd covered a lot of ground...), but I do recall discussion of the trade-offs between URI readability and opacity, and the impact on managing persistence, which I think spilled out into quite a lot of discussion on Twitter. IIRC, this session also produced my favourite quote of the day, from Tom Scott, which was something along the lines of, "The idea that doing linked data is really hard is a myth".

Perhaps the most interesting (and timely/topical) session of the day was the workshop at the end of the afternoon by Jeni Tennison, Leigh Dodds and Dave Reynolds, in which they introduced a proposal for what they call a "Linked Data API".

This defines a configurable "middleware" layer that sits in front of a SPARQL endpoint to support the provision of RESTful access to the data, including not only the provision of descriptions of individual identified resources, but also selection and filtering based on simple URI patterns rather than on SPARQL, and the delivery of multiple output formats (including a serialisation of RDF in JSON - and the ability to generate HTML or XHTML). (It only addresses read access, not updates.)

This initiative emerged at least in part out of responses to the data.gov.uk work, and comments on the UK Government Data Developers Google Group and elsewhere by developers unfamiliar with RDF and related technologies. It seeks to try to address the problem that the provision of queries only through SPARQL requires the developers of applications to engage directly with the SPARQL query language, the RDF model and the possibly unfamiliar formats provided by SPARQL. At the same time, this approach also seeks to retain the "essence" of the RDF model in the data - and to provide clients with access to the underlying queries if required: it complements the SPARQL approach, rather than replaces it.

The configurability offers a considerable degree of flexibility in the interface that can be provided - without the requirement to create new application code. Leigh made the important point that the API layer might be provided by the publisher of the SPARQL endpoint, or it might be provided by a third party, acting as an intermediary/proxy to a remote SPARQL endpoint.

IIRC, mentions were made of work in progress on implementations in Ruby, PHP and Java(?).

As a non-developer myself, I hope I haven't misrepresented any of the technical details in my attempt to summarise this. There was a lot of interest in this session at the meeting, and it seems to me this is potentially an important contribution to bridging the gap between the world of Linked Data and SPARQL on the one hand and Web developers on the other, both in terms of lowering immediate barriers to access and in terms of introducing SPARQL more gradually. There is now a Google Group for discussion of the API.

All in all it was an exciting if somewhat exhausting day. The sessions I attended were all pretty much full to capacity and generated a lot of discussion, and it generally felt like there is a great deal of excitement and optimism about what is becoming possible. The tools and infrastructure around linked data are still evolving, certainly, but I was particularly struck - through initiatives like the API project above - by the sense of willingness to respond to comments and criticisms and to try to "build bridges", and to do so in very real, practical ways.

February 15, 2010

VCard in RDF

Via a post to the W3C Semantic Web Interest Group mailing list from Renato Iannella, I noticed that the W3C has published an updated specification for expressing vCard data in RDF, Representing vCard Objects in RDF.

A comment from the W3C Team provides some of the background to the document, and explains that it represents a consolidation of an earlier (2001) formulation of vCard in RDF by Renato published by the W3C and a subsequent ontology created by Norman Walsh, Brian Suda and Harry Halpin, the latter also used, at least in part, by Yahoo SearchMonkey (see Local, Event, Product, Discussion).

The new document also provides an answer to a question which I've been unsure about for years: whether the class of vCards was disjoint with the class of "agents" (e.g. persons or organisations). Or, to put it another way, I wasn't sure whether the properties of a vCard could be used to describe persons or organisations, e.g. "this document was created by the agent with the vCard name 'John Smith'":

@prefix dc: <http://purl.org/dc/terms/> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
 <> dc:creator
  [
    v:fn "John Smith"
  ] .

(This, I think, is the pattern recommended by Yahoo SearchMonkey: see, for example, the extended "product" example where the manufacturer of a product is an instance of both the v:VCard class and the commerce:Business class.)

The alternative would be that those properties had to be used to describe a vCard-as-"document" (or as something other than agent, at least), which was in turn related to a person or organisation, e.g. "this document has created by the agent who "has a vCard" with the vCard name 'John Smith'" (I invented an example "hasVCard" property here):

@prefix dc: <http://purl.org/dc/terms/> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix ex: <http://example.org/terms/> .
 <> dc:creator
  [
    ex:hasVCard
     [
       v:fn "John Smith"
     ]
 ] .

To keep the examples brief I used blank nodes in the above examples, but URIs might have been used to refer to the vCard and agent resources too.

In its description of the VCard class, the new document says: Resources that are vCards and the URIs that denote these vCards can also be the same URIs that denote people/orgs. That phrasing seems a bit awkward, but the intent seems pretty clear: the classes are not disjoint, a single resource can be both a vCard and a person, and the properties of a vCard can be applied to a person. So I can use the pattern in my first example above without creating any sort of contradiction, and the second pattern is also permitted.

One consequence of this is that consumers of data need to allow for both patterns - in the general case, at least, though it may be that they have additional information that within a particular dataset only a single pattern is in use.

In the example above, I used an example.org URI for the property that relates an agent to a vCard. The discussion on the list highlights that there are a couple of contenders for properties to meet this requirement: (ov:businessCard from Michael Hausenblas and hcterms:hasCard from Toby Inkster. A proposal to add such a property to the FOAF vocabulary has been made on the FOAF community wiki.

February 01, 2010

HTML5, document metadata and Dublin Core

I recently received a query about the encoding of Dublin Core metadata in HTML5, the revision of the HTML language being developed jointly by the W3C HTML Working Group and the Web Hypertext Application Technology Working Group (WHATWG). It has also been the topic of some recent discussion on the dc-general mailing list. While I've been aware of some of the discussions around metadata features in HTML5, until now I haven't looked in much detail at the current drafts.

There are various versions of the specification(s), all drafts under development and still changing (at times, quite quickly):

  • The WHATWG has a working draft titled HTML5 (including next generation additions still in development). This document is constantly changing; the content at the time I'm writing is dated 30 January 2010, but will no doubt have changed by the time you read this. Of this spec, the WHATWG says: This draft is a superset of the HTML5 work that is being published at the W3C: everything that is in HTML5 is also in the WHATWG HTML spec. Some new experimental features are being added to the WHATWG HTML draft, to continue developing extensions to the language while the W3C work through their milestones for HTML5. In other words, the WHATWG HTML specification is the next generation of the language, while HTML5 is a more limited subset with a narrower scope.
  • The W3C has a "latest public version" of HTML 5: A vocabulary and associated APIs for HTML and XHTML currently the version dated 25 August 2009. (The content of that "date-stamped" version should continue to be available.)
  • The W3C always has a "latest editor's draft" of that document, which at the time of writing is dated 30 January 2010, but also continues to change at frequent intervals. Note that, compared to the previous "latest public version", this draft incorporates some element of restructuring of the content, with some content separated out into "modules".

I can't emphasise too strongly that HTML5 is still a draft and liable to change; as the spec itself says in no uncertain terms: Implementors should be aware that this specification is not stable. Implementors who are not taking part in the discussions are likely to find the specification changing out from under them in incompatible ways..

For the purposes of this discussion I've looked primarily at the third document above, the W3C latest editor's draft. This post is really an attempt to raise some initial questions (and probably to expose my own confusion) rather than to provide any definitive answers. It is based on my (incomplete and very probably imperfect) reading of the drafts as they stand at this point in time - and it represents a personal view only, not a DCMI view.

1. Dublin Core metadata in HTML4 and XHTML

(This section covers DCMI's current recommendations for embedding metadata in X/HTML, so feel free to skip it if you are already familiar with this.)

To date, DCMI's specifications for embedding metadata in X/HTML documents have concerned themselves with representing metadata "about" the document as a whole, "document metadata", if you like. And in HTML4/XHTML, the principal source of document metadata is the head element (HTML4, 7.4). Within the head element:

  • the meta element (HTML4, 7.4.4.2) provides for the representation of "property name" (the value of the @name attribute)/"property value" (the value of the @content attribute) pairs which apply to the document. It also offers the ability to supplement the value with the name of a "scheme" (the value of the @scheme attribute) which is used "to interpret the property's value".
  • the link element (HTML4, 12.3) provides a means of representing a relationship between the document and another resource. It also - in attributes like @hreflang, @title, - suppports the provision of some metadata "about" that second resource.

(I should note here that the above text uses the terminology of the HTML4 specification, not of RDF or the DCMI Abstract Model (DCAM).)

The current DCMI recommendation for embedding document metadata in X/HTML, Expressing Dublin Core metadata using HTML/XHTML meta and link elements - which from here on I'll just refer to as "DC-HTML". Although the current recommendation is dated 2008, that version is only a minor "modernisation" of conventions that DCMI has recommended since the late 1990s. The specification describes a convention for representing what the DCAM calls a description (of the document) - a set of RDF triples - using the HTML meta and link elements and their attributes (and conversely, for interpreting a sequence of HTML meta and link elements as a set of RDF triples/DCAM description set). Contrary to some misconceptions, the convention is not limited to the use of DCMI-owned "terms"; indeed it does not assume the use of any DCMI-owned terms at all.

Consider the example of the following two RDF triples:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<> dc:modified "2007-07-22"^^xsd:date ;
   ex:commentsOn <http://example.org/docs/123> .

Aside: from the perspective of the DCMI Abstract Model, these would be equivalent to the following description set, expresssed using the DC-Text syntax, but for the rest of this post, to keep things simple, I'll just refer to the RDF triples.

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:modified )
      LiteralValueString ( "2007-07-22"
        SyntaxEncodingSchemeURI ( xsd:date )
      )
    )
    Statement (
      PropertyURI ( ex:commentsOn )
      ValueURI ( <http://example.org/docs/123> )
      )
    )
  )
)

Following the conventions of DC-HTML, those triples are represented in XHTML as:

Example 1: DC-HTML profile in XHTML

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>Document 001</title>
    <link rel="schema.DC"
          href="http://purl.org/dc/terms/" />
    <link rel="schema.EX"
          href="http://example.org/terms/" />
    <link rel="schema.XSD"
          href="http://www.w3.org/2001/XMLSchema#" />
    <meta name="DC.modified"
          scheme="XSD.date" content="2007-07-22" />
    <link rel="EX.commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

A few points to highlight:

  • The example is provided in XHTML but essentially the same syntax would be used in HTML4.
  • The triple with literal object is represented using a meta element.
  • The triple with the URI as object is represented using a link element
  • The predicate (property URI) may be any URI; the DC-HTML convention is not limited to DCMI-owned URIs, i.e. DC-HTML seeks to support the sort of URI-based vocabulary extensibility provided by RDF. There is no "registry" of a bounded set of terms to be used in metadata represented using DC-HTML; or, rather, "the Web is the registry". All an implementer requires to introduce a new property is the authority to assign a URI in some URI space they own (or in which they have been delegated rights to assign URIs).
  • A convention for representing property URIs and datatype URIs as "prefixed names" is used, and in this example three other link elements (with @rel="schema.{prefix})" are introduced to act as "namespace declarations" to support the convention. When a document using DC-HTML is processed, no RDF triples are generated for those link elements (Aside: I have occasionally wondered whether this is abusing the rel attribute, which is intended to capture the type of relationship between the document and the target resource, i.e. it is using a mechanism which does carry semantics for an essentially syntactic end (the abbreviation of URIs). But I'll suspend those misgivings for now...).
  • The prefixes used in these "prefixed names" are arbitrary, and DC-HTML does not specify the use/interpretation of a fixed set of @name or @rel attribute values. In the example above, I chose to associate the "DC" prefix with the "namespace URI" http://purl.org/dc/terms/, though "traditionally" it has been more commonly associated with the "namespace URI" http://purl.org/dc/elements/1.1/. Another document creator might associate the same prefix with a quite different URI again.
  • The DC-HTML profile generates triples only for those meta and link elements where the values of the @name and @rel attributes contain a prefixed name with a prefix for which there is a corresponding "namespace declaration".
  • The datatype of the typed literal is represented by the value of the meta/@scheme attribute.
  • There is no support for RDF blank nodes.

For the purposes of this discussion, perhaps the main point to make is that this use/interpretation of meta and link elements is specific to DC-HTML, not a general interpretation defined by the HTML4 specification. The mapping of prefixed names to URIs using link[@rel="schema.ABC"] "namespace declarations" is a DCMI convention, not part of X/HTML. And this is made possible through the use of a feature of HTML4 and XHTML called a "meta data profile": the document creator signals (by providing a specific URI as value of the head/@profile attribute) that they are applying the DC-HTML conventions and the presence of that attribute value licences a consumer to apply that interpretation of the data in a document. And, further, under that profile, as I noted for the example of the "DC" prefix, there is no single "meaning" assigned to meta/@name or link/@rel values.

In the XHTML case, the profile-dependent interpretation is made accessible in machine-processable form through the use of GRDDL, more specifically of a GRDDL profile transformation. i.e. a GRDDL processor uses the profile URI to access an XHTML "profile document" which provides a pointer to an XSLT transform which, when applied to an XHTML document using the DC-HTML profile, generates an RDF/XML document representing the appropriate RDF triples.

It may also be worth noting at this point that the profile attribute actually supports not just a single URI as value but a space-separated list of URIs i.e. within a single document, multiple profiles may be "applicable". And, potentially, those multiple profiles may specify different interpretations of a single @name or @rel value. I think the intent is that in that case all the interpretations should be applied - and in the case that multiple GRDDL profile transformations are provided, the output should be the result of merging the RDF graphs output from each individual transformation.

Now then, having laboured the point about the importance of the concept of the profile, I strongly suspect - though I don't have concrete evidence to support my suspicion - that it is not widely used by applications that provide and consume data using the other conventions described in the DC-HTML document.

It is certainly easy to find many providers of document metadata in X/HTML that follow some of the syntactic conventions of DC-HTML but do not include the @profile attribute. This is (currently, at least) the case even for many documents on DCMI's own site. And I suspect only a (small?) minority of applications consuming/processing DC metadata embedded in X/HTML documents do so by applying the DC-HTML GRDDL profile transform in this way. I suspect the majority of DC metadata embedded in X/HTML documents is processed without reference to the GRDDL transform, probably without using the @profile attribute value as a "trigger", possibly without generating RDF triples, and perhaps even without applying the "prefixed name"-to-URI mapping at all - i.e. these applications are "on level 1" in terms of the DC "interoperability levels" document. I suspect there are tools which use meta elements to generate simple property/(literal) value indexes, and do so on the basis of a fixed set of meta/@name values, i.e. they index on the basis that the expected values of the meta/@name attribute are "DC.title", "DC.date" (etc) and those tools would ignore values like "ABC.title", even if the "ABC" prefix was associated (via a link[@rel="schema.ABC"] "namespace declaration") with the URI http://purl.org/dc/elements/1.1/ (or http://purl.org/dc/terms/). But yes, I'm entering the realms of speculation here, and we really need some concrete evidence of how applications process such data.

2. RDFa in XHTML and HTML4

Since that DCMI document was published, the W3C has published the RDFa in XHTML specification, RDFa in XHTML: Syntax and Processing. as a W3C Recommendation. RDFa provides a syntax for embedding RDF triples in an XHTML document using attributes (a combination of pre-existing XHTML attributes and additional RDFa-specific attributes.) Unlike the conventions defined by DC-HTML, RDFa supports the representation of any RDF triple, not only triples "about" the document (i.e. with the document URI as subject), and RDFa attributes can be used anywhere in an XHTML document.

Any "document metadata" that could be encoded using the DC-HTML profile could also be represented using RDFa. DCMI has not yet published any guidance on the use of RDFa - not because it doesn't consider RDFa important, I hasten to add, but only because of a lack of available effort. Having said that, (it seems to me) it isn't an area where DCMI would need a new "recommendation", but it may be useful to have some primer-/tutorial-style materials and examples highlighting the use of common constructs used in Dublin Core metadata.

I don't intend to provide a full summary of RDFa, but it is worth noting that, at the syntax level, RDFa introduces the use of a datatype called CURIE which supports the abbreviation of URI references as prefixed names. In XHTML, at least, the prefixes are associated with URIs via XML Namespace declarations. The use of CURIEs in RDFa might be seen as a more generalised, standardised approach to the problem that DC-HTML seeks to address through its own "prefixed name"/"namespace declaration" convention.

It is perhaps worth highlighting one aspect of the RDFa in XHTML processing model here. In RDFa the XHTML link/@rel attribute is treated as supporting both XHTML link types and CURIEs, and any value that matches an entry in the list of link types in the section The rel attribute, MUST be treated as if it was a URI within the XHTML vocabulary, and all other values must be CURIEs. So, the XHTML link types are treated as "reserved keywords", if you like, and a @rel attribute value of "next" is mapped to an RDF predicate of http://www.w3.org/1999/xhtml/vocab#next. For the case of XHTML, those "reserved keywords" are defined as part of the XHTML+RDFa document. They are also listed in the "namespace document" http://www.w3.org/1999/xhtml/vocab, which itself is an XHTML+RDFa document (though, N.B., there are other terms "in that namespace" which are not intended for use as link/@rel values). For a @rel value that is neither a member of the list nor a valid CURIE (e.g. rel="foobar" or rel="DC.modified" or rel="schema.DC"), no RDF triple is generated by an RDFa processor. As a consequence, RDFa "co-exists" well with the DC-HTML profile, by which I mean that an RDFa processor should generate no unanticipated triples from DC-HTML data in an XHTML+RDFa document.

Using RDFa in XHTML, then, the two example triples above could be represented as follows:

Example 2: RDFa in XHTML

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" />
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

And of course document metadata could be embedded elsewhere in the XHTML+RDFa document, e.g. instead of using the meta and link elements, the data above could be represented in the body of the document:

Example 3: RDFa in XHTML (2)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
  </head>
  <body>
    <p>
      Last modified on:
      <span property="dc:modified"
            datatype="xsd:date" content="2007-07-22">22 July 2007</span>
    </p>
    <p>
      Comments on:
      <a rel="ex:commentsOn"
          href="http://example.org/docs/123">Document 123</a>
    </p>
  </body>
</html>

These examples do not make use of a head/@profile attribute. According to Appendix C of the RDFa in XHTML specification, the use of @profile is optional: a @profile value of http://www.w3.org/1999/xhtml/vocab may be included to support a GRDDL-based transform, but it is not required by an RDFa processor. (Having said that, looking at the profile document http://www.w3.org/1999/xhtml/vocab, I can't see a reference to a GRDDL profile transformation in that document.)

The initial RDFa in XHTML specification covered the case of XHTML only. But RDFa is intended as an approach to be used with other markup languages too, and recently a working draft HTML+RDFa has been published. Again, this is a draft which is liable to change. This document describes how RDFa can be used in HTML5 (in both the XML and non-XML syntax), but the rules are also intended to be applicable to HTML4 documents interpreted through the HTML5 parsing rules. For the most part, it provides a set of minor changes to the syntax and processing rules specified in the RDFa in XHTML document.

I think (but I'm not sure!) the above example in HTML4 would look like the following, the only differences (for this example) being the change in the empty element syntax and the use of a different DTD for validation:

Example 4: RDFa in HTML4

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/html401-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="HTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" >
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" >
  </head>
</html>

3. HTML5

The document HTML5 differences from HTML4 offers a summary of the principal differences between HTML4 and HTML5. One general point to note here is that HTML5 is defined as an "abstract language" - it is defined in terms of the HTML Document Object Model - which can be serialised in a format which is compatible with HTML4 and also in an XML format. The "differences" document has little to say on issues specifically related to "document metadata", but it does highlight the removal from the language of some elements and attributes, a topic I'll return to below.

As I mentioned above, the current editor's draft version of HTML5 separates some content out into modules. In the current drafts, three items would seem to be of interest when considering conventions for representing metadata "about" a document:

I'll discuss each of these sources in turn (though I think there is some interdependency in the first two).

3.1. Document Metadata in HTML5

The "Document metadata" section defines the meta and link elements in HTML5. In terms of evaluating how the DC-HTML conventions might be used within HTML5, the following points seem significant:

  • For the @name attribute of the meta element, the spec defines some values, and it provides for a wiki-based registry of other values (HTML5ED, 4.2.5.2).
  • The @scheme attribute of the meta element has been made obsolete and "must not be used by authors".
  • In the property/value pairs represented by meta elements, the value must not be a URI.
  • For the @rel attribute of the link element, the spec defines some values - strictly speaking, tokens that can occur , and it provides for a wiki-based registry of other values (HTML5ED, 5.12.3.19).
  • The @profile attribute of the head element has been made obsolete and "must not be used by authors"

On the validation of values for the meta/@name attribute, HTML5 says:

Conformance checkers must use the information given on the WHATWG Wiki MetaExtensions page to establish if a value is allowed or not: values defined in this specification or marked as "proposed" or "ratified" must be accepted, whereas values marked as "discontinued" or not listed in either this specification or on the aforementioned page must be rejected as invalid. Conformance checkers may cache this information (e.g. for performance reasons or to avoid the use of unreliable network connectivity).

When an author uses a new metadata name not defined by either this specification or the Wiki page, conformance checkers should offer to add the value to the Wiki, with the details described above, with the "proposed" status.

So I think this means that, in order to pass this conformance check as valid, all values of the meta/@name attribute must be registered. The registry currently contains an entry (with status "proposed") for all names beginning "DC.", though I'm not sure whether the registration process is really intended to support such "wildcard" entries. The entry does not indicate whether the intent is that the names correspond to properties of the Dublin Core Metadata Element Set (i.e. with URIs beginning http://purl.org/dc/elements/1.1/) or of the DC Terms collection (i.e. with URIs beginning http://purl.org/dc/terms/). Further, as noted above, the current DC-HTML spec does not prescribe a bounded set of @name values; rather it allows for an open-ended set of prefixed name values, not just names referring to the "terms" owned by DCMI. In HTML5, the expectation seems to be that all such values should be registered. So, for example, when DCMI worked with the Library of Congress to make available a set of RDF properties corresponding to the MARC Relator Codes, identified by LoC-owned URIs, I think the expectation would be that, for data using those properties to be encoded, a corresponding set of @name values would need to be registered. Similarly if an implementer group coins a new URI for a property they require, a new @name value would be required.

If the registration process for HTML5 turns out to be relatively "permissive" (which the text above suggests it may be), it may be that this is not an issue, but it does seem to create a new requirement not present in HTML4/XHTML. However, I note that the registration page currently includes a note that suggests a "high bar" for terms to be "Accepted": For the "Status" section to be changed to "Accepted", the proposed keyword must be defined by a W3C specification in the Candidate Recommendation or Recommendation state. If it fails to go through this process, it is "Unendorsed".

Having said that, the microdata specification refers to the possibility that @name values are URIs, and I think that the implication is that such URI values are exempt from the registration requirement (though this does not seem clear from the discussion of registration in the "core" HTML5 spec).

The meta/@scheme attribute, used in DC-HTML to represent datatype URIs for typed literals, is no longer permitted in HTML5. Section 10.2, which offers suggestions for alternatives for some of the features that have been made obsolete, suggests Use only one scheme per field, or make the scheme declaration part of the value., which I think is suggesting either using a different meta/@name value for each potential scheme value (e.g. "date-W3CDTF", "date-someOtherDateFormat") or using some sort of structured string for the @content value with the scheme name embedded (e.g. "2007-07-22|http://purl.org/dc/terms/W3CDTF")

The section on the registration of meta/@name attribute values includes the paragraph:

Metadata names whose values are to be URLs must not be proposed or accepted. Links must be represented using the link element, not the meta element

This constraint appears to rule out the use of meta/@name to represent the property in cases where (in RDF terms) the object is a literal URI. (This is different from the case where the object is an RDF URI reference, which in DC-HTML is covered by the use of the link element.) For example, the DCMI dc:identifier and dcterms:identifier properties may be used in this way to provide a URI which identifies the document - that may be the document URI itself, or it may be another URI which identifies the same document.

A similar issue to that above for the registration of meta/@name attribute values arises for the case of the link/@rel attribute, for which HTML5 says:

Conformance checkers must use the information given on the WHATWG Wiki RelExtensions page to establish if a value is allowed or not: values defined in this specification or marked as "proposed" or "ratified" must be accepted when used on the elements for which they apply as described in the "Effect on..." field, whereas values marked as "discontinued" or not listed in either this specification or on the aforementioned page must be rejected as invalid. Conformance checkers may cache this information (e.g. for performance reasons or to avoid the use of unreliable network connectivity).

When an author uses a new type not defined by either this specification or the Wiki page, conformance checkers should offer to add the value to the Wiki, with the details described above, with the "proposed" status.

AFAICT, the registry currently contains no entries related specifically to DC-HTML or the DCMI vocabularies.

As for the case of name, the microdata specification refers to the possibility that @rel values are URIs, and again I think the implication is that such URI values are exempt from the registration requirement (though, again, this does not seem clear from the discussion in the "core" HTML5 spec).

Finally, the head/@profile attribute is no longer available in HTML5. and Section 10.2 says:

When used for declaring which meta terms are used in the document, unnecessary; omit it altogether, and register the names.

When used for triggering specific user agent behaviors: use a link element instead.

I think DC-HTML's use of head/@profile places it into the second of these categories: the profile doesn't "declare" a bounded set of terms, but it specifies how a (potentially "open-ended") set of attribute values are to be interpreted/processed.

Furthermore, the draft HTML+RDFa document proposes the (optional) use of a link/@rel value of "profile", and there is a corresponding entry in the registry for @rel values. This seems to be a mechanism for (re-)introducing the HTML4 concept of the meta data profile, using a different syntactic form i.e. using link/@rel in place of the profile attribute. I'm not clear about the extent to which this has support within the wider HTML5 community. If it was adopted, I imagine the GRDDL specification would also evolve to use this mechanism, but that is guesswork on my part.

Julian Retschke summarises most of these issues related to DC-HTML in a recent message to the public-html mailing list here.

3.2. Microdata

Microdata is a new specification, specific to HTML5. The "latest editors draft" version is described as "a module that forms part of the HTML5 series of specifications published at the W3C". The content was previously a part of the "core" HTML5 specification, but the decision was taken recently to separate it from the main spec.

Microdata offers similar functionality to that offered by RDFa in that it allows for the embedding of data anywhere in an HTML5 document. Like RDFa, microdata is a generalised mechanism, not one tied to any particular set of terms, and also like RDFa, microdata introduces a new set of attributes, to be used in combination with existing HTML5 attributes. The syntactic conventions used in microdata are inspired principally by the conventions used in various microformats.

As for the case of RDFa, my purpose here is not to provide a full description of microdata, but to examine whether and how microdata can express the data that in HTML4/XHTML is expressed using the conventions of the DC-HTML profile.

The model underlying microdata is one of nested lists of name-value pairs:

The microdata model consists of groups of name-value pairs known as items.

Each group is known as an item. Each item can have an item type, a global identifier (if the item type supports global identifiers for its items), and a list of name-value pairs. Each name in the name-value pair is known as a property, and each property has one or more values. Each value is either a string or itself a group of name-value pairs (an item).

The microdata model is independent of the RDF model, and is not designed to represent the full RDF model. In particular, microdata does not require the use of URIs as identifiers for properties, though it does allow for the use of URIs. Microdata does not offer - as many RDF syntaxes, including RDFa, do - a mechanism for abbreviating property URIs. But the microdata spec does include an algorithm that provides a (partial, I think?) mapping from microdata to a set of RDF triples.

Probably the main feature of RDF that has no correspondence in microdata is literal datatyping - see the discussion by Jeni Tennison here - though there is a distinct element/attribute for date-time values.

Given this constraint, I don't think it is possible to express the first triple of my example above using microdata. If the typed literal is replaced with a plain literal (i.e. the object is "2007-07-22", rather than "2007-07-22"^^xsd:date), then I think the two triples could be encoded (using the XML serialisation) as follows, i.e. the property URIs appear in full as attribute values:

Example 5: Microdata in HTML5

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
    <meta name="http://purl.org/dc/terms/modified"
          content="2007-07-22" />
    <link rel="http://example.org/terms/commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

As for the case of RDFa, microdata supports the embedding of data in the body of the document, so the triples could (I think!) also be represented as:

Example 6: Microdata in HTML5 (2)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
  </head>
  <body>
    <div itemscope="" itemid="http://example.org/doc/001">
      <p>
        Last modified on:
        <time itemprop="http://purl.org/dc/terms/modified"
              datetime="2007-07-22">22 July 2007</time>
      </p>
      <p>
        Comments on:
        <a rel="http://example.org/terms/commentsOn"
           href="http://example.org/docs/123">Document 123</a>
      </p>
  </body>
</html>

My understanding is that the itemid attribute is necessary to set the subject of the triple to the URI of the document, but I could be wrong on this point.

Also I think it's worth noting that the microdata-to-RDF algorithm specifies an RDF interpretation for some "core" HTML5 elements and attributes. For example:

  • the head/title element is mapped to a triple with the predicate http://purl.org/dc/terms/title and the element content as literal object
  • meta elements with name and content attributes are mapped to triples where the predicate is either (if the name attribute value is a URI) the value of the name attribute, or the concatenation of the string "http://www.w3.org/1999/xhtml/vocab#" and the value of the name attribute. So, e.g., a name attribute value of "DC.modified" would generate a predicate http://www.w3.org/1999/xhtml/vocab#DC.modified.
  • A similar rule applies for the link element. So, e.g., a rel attribute value of "EX.commentsOn" would generate a predicate http://www.w3.org/1999/xhtml/vocab#EX.commentsOn and a rel attribute value of "schema.DC" would generate a predicate http://www.w3.org/1999/xhtml/vocab#schema.DC

As far as I can tell, these are rules to be applied to any HTML5 document - there is no "flag" to say that they apply to document A but not to document B - so would need to be taken into consideration in any DCMI convention for using meta/@name and link/@rel attributes in HTML5. For example, given the following HTML5 document (and leaving aside for a moment the registration issue, and assuming that "EX.commentsOn", "schema.DC" and "schema.EX" are registered values for @name and @rel)

Example 7: Microdata in HTML5 (3)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
    <link rel="schema.DC"
          href="http://purl.org/dc/terms/" />
    <link rel="schema.EX"
          href="http://example.org/terms/" />
    <meta name="DC.modified"
          content="2007-07-22" />
    <link rel="EX.commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

the microdata-to-RDF algorithm would generate the following five RDF triples:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix xhv: http://www.w3.org/1999/xhtml/vocab#> .
<> dc:title "Document 001" ;
   xhv:schema.DC <http://purl.org/dc/terms/> ;
   xhv:schema.EX <http://example.org/terms/> ;
   xhv:DC.modified "2007-07-22" ;
   xhv:EX.commentsOn <http://example.org/docs/123> .

It's probably worth emphasising that although the URIs generated here are not-DCMI-owned URIs, it would be quite possible to assert an "equivalence" between a property with a URI beginning http://www.w3.org/1999/xhtml/vocab# and a corresponding DCMI-owned URI, which would imply a second triple using that DCMI-owned URI (e.g. <http://www.w3.org/1999/xhtml/vocab#DC.modified> owl:equivalentProperty <http://purl.org/dc/terms/modified>) - though, AFAICT, no such equivalence is suggested at the moment.

3.3. RDFa in HTML5

I noted above that, although the initial RDFa syntax specification had focused on the case of XHTML, a recent draft sought to extend this by describing the use of RDFa in HTML, including the case of HTML5.

As I already discussed, using RDFa, it is quite possible to represent any data that could be represented in HTML4/XHTML using the DC-HTML profile. So, using RDFa in HTML5, my two example triples could be represented (using the XML serialisation) as:

Example 8: RDFa in HTML5

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" />
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

Note that the RDFa in HTML draft requires the use of the html/@version attribute which, in the current draft, is obsoleted by the "core" HTML5 specification. As noted above, RDFa in HTML also proposes the (optional) use of a link/@rel value of "profile".

In the initial discussion of RDFa above, I noted the existence of a list of "reserved keyword" values for the link/@rel attribute in XHTML, to which an RDFa processor prefixes the URI to generate RDF predicate URIs. In HTML5, that list of reserved values is defined not by the HTML5 specification but by the WHATWG registry of @rel values. So there may be cases where a value used in an link/@rel attribute in an HTML4/XHTML document does not result in an RDFa processor generating a triple (because that value is not included in the HTML4/XHTML reserved list), but the same value in an link/@rel attribute in an HTML5 document does cause an RDFa processor to generate a triple (because that value is included in the HTML5 @rel registry). My understanding is that the RDFa/CURIE processing model is designed to cope with such host-language-specific variations, but it is something document creators will need to be aware of.

4. Some concluding thoughts

DCMI's specifications for embedding metadata in X/HTML have focused on "document metadata", data describing the document. The current DCMI Recommendation for encoding DC metadata in HTML was created in 2008, and is based on the DCMI Abstract Model and on RDF. The syntactic conventions are largely those developed by DCMI in the late 1990s. The current document was developed with reference to HTML4 and XHTML only, and it does not take into consideration the changes introduced by HTML5. The conventions described are not limited to the use of a fixed set of DCMI-owned properties, but support the representation of document metadata using any RDF properties.

Looking at the HTML5 drafts raises various issues:

  1. HTML5 removes some of the syntactic components used by the DC-HTML profile in HTML4/XHTML, namely the scheme and profile attributes.
  2. HTML5 introduces a requirement for the registration of meta/@name and link/@rel attribute values; the current DC-HTML specification makes the assumption that an "open-ended" set of values is available for those attributes.
  3. The status of the concept of the meta data profile in HTML5 seems unclear. On the one hand, the profile attribute has been removed, but the proposed registration of a link/@rel value suggests that the profile approach is still available in HTML5.
  4. The microdata specification provides "global" RDF interpretations for meta and link elements in HTML5.
  5. Microdata offers a mechanism for embedding data in HTML5 documents, and that mechanism can be used for embedding some RDF data in HTML5 documents. Microdata has some limitations (the absence of support for typed literals), but microdata could be used to express a large subset of the data that (in HTML4/XHTML) is expressed using the DC-HTML profile. The microdata specification is still a working draft and liable to change.
  6. RDFa also offers a mechanism for embedding RDF data in HTML5 documents. RDFa is designed to support the RDF model, and RDFa could be used in HTL5 to express the same data that (in HTML4/XHTML) is expressed using the DC-HTML profile. The specification for using RDFa in HTML5 is still a working draft and liable to change.

There seems to be a possible tension between HTML5's requirement for the registration of meta/@name and link/@rel values and the assumption in DC-HTML that an "open-ended" set of values is available. Also, the microdata specification's mapping by simple concatenation of registered meta/@name and link/@rel values to URIs differs from DC-HTML's use of a prefix-to-URI mapping. However, as I suggested above, it seems quite probable that at least some applications using Dublin Core metadata in HTML do indeed operate on the basis of a small set of invariant meta/@name and link/@rel values corresponding to (a subset of the) DCMI-owned properties, i.e. they use a subset of the conventions of DC-HTML to represent document metadata using only a limited set of DCMI-owned properties - to represent data conforming to what DCMI would call a single "description set profile". With the addition of assertions of equivalence between properties (see above), then it would be possible to represent data conforming to the version of "Simple Dublin Core" that I described a little while ago - i.e. using only the properties of the Dublin Core Metadata Element Set with plain literal values - using HTML5 meta/@name (with a registered set of 15 values) and meta/@content attributes.

Both microdata and RDFa, on the other hand, are extensions to HTML5 that are designed to provide generalised, "vocabulary-agnostic" conventions for embedding data in HTML5 documents. Using both microdata and RDFa, data may include, but is not limited to, "document metadata". RDFa is designed to represent RDF data; and microdata can be used to represent some RDF data (it lacks support for typed literals). RDFa includes abbreviation mechanisms for URIs that are broadly similar to those used in the DC-HTML profile (in the sense that they both use a sort of "prefixed name" to abbreviate URIs); microdata does not provide such a mechanism and (I think) would require the use of URIs in full.

Both microdata and RDFa address the problem that DCMI seeks to address via the DC-HTML profile, in the context of a more generalised mechanism for embedding data in HTML5 documents. Both microdata and RDFa could be used in HTML5 to represent document metadata that in HTML4/XHTML is represented using the DC-HTML profile (partially, for the case of microdata because of the absence of datatyping support). Currently, the documents describing RDFa in HTML5 and microdata are both still in draft and the subjects of vigorous debate, and it remains to be seen how/whether they progress through W3C process, and how implementers respond. But it seems to me the use of either would offer an opportunity for DCMI to move away from the maintenance of a DCMI-defined convention and to align with more generalised approaches.

January 31, 2010

Readability and linkability

In July last year I noted that the terminology around Linked Data was not necessarily as clear as we might wish it to be.  Via Twitter yesterday, I was reminded that my colleague, Mike Ellis, has a very nice presentation, Don't think websites, think data, in which he introduces the term MRD - Machine Readable Data.

It's worth a quick look if you have time:

We also used the 'machine-readable' phrase in the original DNER Technical Architecture, the work that went on to underpin the JISC Information Environment, though I think we went on to use both 'machine-understandable' and 'machine-processable' in later work (both even more of a mouthful), usually with reference to what we loosely called 'metadata'.  We also used 'm2m - machine to machine' a lot, a phrase introduced by Lorcan Dempsey I think.  Remember that this was back in 2001, well before the time when the idea of offering an open API had become as widespread as it is today.

All these terms suffer, it seems to me, from emphasising the 'readability' and 'processability' of data over its 'linkedness'. Linkedness is what makes the Web what it is. With hindsight, the major thing that our work on the JISC Information Environment got wrong was to play down the importance of the Web, in favour of a set of digital library standards that focused on sharing 'machine-readable' content for re-use by other bits of software.

Looking at things from the perspective of today, the terms 'Linked Data' and 'Web of Data' both play up the value in content being inter-linked as well as it being what we might call machine-readable.

For example, if we think about open access scholarly communication, the JISC Information Environment (in line with digital libraries more generally) promotes the sharing of content largely through the harvesting of simple DC metadata records, each of which typically contains a link to a PDF copy of the research paper, which, in turn, carries only human-readable citations to other papers.  The DC part of this is certainly MRD... but, overall, the result isn't very inter-linked or Web-like. How much better would it have been to focus some effort on getting more Web links between papers embedded into the papers themselves - using what we would now loosely call a 'micro format'?  One of the reasons I like some of the initiatives around the DOI (though I don't like the DOI much as a technology), CrossRef springs to mind, is that they potentially enable a world where we have the chance of real, solid, persistent Web links between scholarly papers.

RDF, of course, offers the possibility of machine-readability, machine-processable semantics, and links to other content - which is why it is so important and powerful and why initiatives like data.gov.uk need to go beyond the CSV and XML files of this world (which some people argue are good enough) and get stuff converted into RDF form.

As an aside, DCMI have done some interesting work on Interoperability Levels for Dublin Core Metadata. While this work is somewhat specific to DC metadata I think it has some ideas that could be usefully translated into the more general language of the Semantic Web and Linked Data (and probably to the notions of the Web of Data and MRD).

Mike, I think, would probably argue that this is all the musing of a 'purist' and that purists should be ignored - and he might well be right.  I certainly agree with the main thrust of the presentation that we need to 'set our data free', that any form of MRD is better than no MRD at all, and that any API is better than no API.  But we also need to remember that it is fundamentally the hyperlink that has made the Web what it is and that those forms of MRD that will be of most value to us will be those, like RDF, that strongly promote the linkability of content, not just to other content but to concepts and people and places and everything else.

The labels 'Linked Data' and 'Web of Data' are both helpful in reminding us of that.

December 21, 2009

Scanning horizons for the Semantic Web in higher education

The week before last I attended a couple of meetings looking at different aspects of the use of Semantic Web technologies in the education sector.

On the Wednesday, I was invited to a workshop of the JISC-funded ResearchRevealed project at ILRT in Bristol. From the project weblog:

ResearchRevealed [...] has the core aim of demonstrating a fine-grained, access controlled, view layer application for research, built over a content integration repository layer. This will be tested at the University of Bristol and we aim to disseminate open source software and findings of generic applicability to other institutions.

ResearchRevealed will enhance ways in which a range of user stakeholder groups can gain up-to-date, accurate integrated views of research information and thus use existing institutional, UK and potentially global research information to better effect.

I'm not formally part of the project, but Nikki Rogers of ILRT mentioned it to me at the recent VoCamp Bristol meeting, and I expressed a general interest in what they were doing; they were also looking for some concrete input on the use of Dublin Core vocabularies in some of their candidate approaches.

This was the third in a series of small workshops, attended by representatives of the project from Bristol, Oxford and Southampton, and the aim was to make progress on defining a "core Research ontology". The morning session circled mainly around usage scenarios (support for the REF (and other "impact" assessment exercises), building and sustaining cross-institutional collaboration etc), and the (somewhat blurred) boundaries between cross-institutional requirements and institution-specific ones; what data might be aggregated, what might be best "linked to"; and the costs/benefits of rich query interfaces (e.g. SPARQL endpoints) v simpler literal- or URI-based lookups. In the afternoon, Nick Gibbins from the University of Southampton walked through a draft mapping of the CERIF standard to RDF developed by the dotAC project. This focused attention somewhat and led to some - to me - interesting technical discussions about variant ways of expressing information with differing degrees of precision/flexibility. I had to leave before the end of the meeting, but I hope to be able to continue to follow the project's progress, and contribute where I can.

A long train journey later, the following day I was at a meeting in Glasgow organised by the CETIS Semantic Technologies Working Group to discuss the report produced by the recent JISC-funded Semtech project, and to try to identify potential areas for further work in that area by CETIS and/or JISC. Sheila MacNeill from CETIS liveblogged proceedings here. Thanassis Tiropanis from the University of Southampton presented the project report, with a focus on its "roadmap for semantic technology adoption". The report argues that, in the past, the adoption of semantic technologies may have been hindered by a tendency towards a "top-down" approach requiring the widespread agreement on ontologies; in contrast the "linked data" approach encourages more of a "bottom-up" style in which data is first made available as RDF, and then later application-specific or community-wide ontologies are developed to enable more complex reasoning across the base data (which may involve mapping that initial data to those ontologies as they emerge). While I think there's a slight risk of overstating the distinction - in my experience many "linked data" initiatives do seem to demonstrate a good deal of thinking about the choice of RDF vocabularies and compatibility with other datasets - and I guess I see rather more of a continuum, it's probably a useful basis for planning. The report recommends a graduated approach which focusses initially on the development of this "linked data field" - in particular where there are some "low-hanging fruit" cases of data already made available in human-readable form which could relatively easily be made available in RDF, especially using RDFa.

One of the issues I was slightly uneasy with in the Glasgow meeting was that occasionally there were mentions of delivering "interoperability" (or "data interoperability") without really saying what was meant by that - and I say this as someone who used to have the I-word in my job title ;-) I feel we probably need to be clearer, and more precise, about what different "semantic technologies" (for want of a better expression) enable. What does the use of RDF provide that, say, XML typically doesn't? What does, e.g., RDF Schema add to that picture? What about convergence on shared vocabularies? And so on. Of course, the learners, teachers, researchers and administrators using the systems don't need to grapple with this, but it seems to me such aspects do need to be conveyed to the designers and developers, and perhaps more importantly - as Andy highlighted in his report of related discussions at the CETIS conference - to those who plan and prioritise and fund such development activity. (As an aside, I this is also something of an omission in the current version of the DCMI document on "Interoperability Levels": it tells me what characterises each level, and how I can test for whether an application meets the requirements of the level, but it doesn't really tell me what functionality each level provides/enables, or why I should consider level n+1 rather than level n.)

Rather by chance, I came across a recent presentation by Richard Cyganiak to the Vienna Linked Data Camp, which I think addresses some similar questions, albeit from a slightly different starting point: Richard asks the questions, "So, if we have linked data sources, what's stopping the development of great apps? What else do we need?", and highlights various dimensions of "heterogeneity" which may exist across linked data sources (use of identifiers, differences in modelling, differences in RDF vocabularies used, differences in data quality, differences in licensing, and so on).

Finally, I noticed that last Friday, Paul Miller (who was also at the CETIS meeting) announced the availability of a draft of a "Horizon Scan" report on "Linked Data" which he has been working on for JISC, as part of the background for a JISC call for projects in this area some time early in 2010. It's a relatively short document (hurrah for short reports!) but I've only had time for a quick skim through. It aims for some practical recommendations, ranging from general guidance on URI creation and the use of RDFa to more specific actions on particular resources/datasets. And here I must reiterate what Paul says in his post - it's a draft on which he is seeking comments, not the final report, and none of those recommendations have yet been endorsed by JISC. (If you have comments on the document, I suggest that you submit them to Paul (contact details here or comment on his post) rather than commenting on this post.)

In short, it's encouraging to see the active interest in this area growing within the HE sector. On reading Paul's draft document, I was struck by the difference between the atmosphere now (both at the Semtech meeting, and more widely) and what Paul describes as the "muted" conclusions of Brian Matthews' 2005 survey report on Semantic Web Technologies for JISC Techwatch. Of course, many of the challenges that Andy mentioned in his report of the CETIS conference session remain to be addressed, but I do sense that there is a momentum here - an excitement, even - which I'm not sure existed even eighteen months ago. It remains to be seen whether and how that enthusiasm translates into applications of benefit to the educational community, but I look forward to seeing how the upcoming JISC call, and the projects it funds, contribute to these developments.

December 03, 2009

On being niche

I spoke briefly yesterday at a pre-IDCC workshop organised by REPRISE.  I'd been asked to talk about Open, social and linked information environments, which resulted in a re-hash of the talk I gave in Trento a while back.

My talk didn't go too well to be honest, partly because I was on last and we were over-running so I felt a little rushed but more because I'd cut the previous set of slides down from 119 to 6 (4 really!) - don't bother looking at the slides, they are just images - which meant that I struggled to deliver a very coherent message.  I looked at the most significant environmental changes that have occurred since we first started thinking about the JISC IE almost 10 years ago.  The resulting points were largely the same as those I have made previously (listen to the Trento presentation) but with a slightly preservation-related angle:

  • the rise of social networks and the read/write Web, and a growth in resident-like behaviour, means that 'digital identity' and the identification of people have become more obviously important and will remain an important component of provenance information for preservation purposes into the future;
  • Linked Data (and the URI-based resource-oriented approach that goes with it) is conspicuous by its absence in much of our current digital library thinking;
  • scholarly communication is increasingly diffusing across formal and informal services both inside and outside our institutional boundaries (think blogging, Twitter or Google Wave for example) and this has significant implications for preservation strategies.

That's what I thought I was arguing anyway!

I also touched on issues around the growth of the 'open access' agenda, though looking at it now I'm not sure why because that feels like a somewhat orthogonal issue.

Anyway... the middle bullet has to do with being mainstream vs. being niche.  (The previous speaker, who gave an interesting talk about MyExperiment and its use of Linked Data, made a similar point).  I'm not sure one can really describe Linked Data as being mainstream yet, but one of the things I like about the Web Architecture and REST in particular is that they describe architectural approaches that haven proven to be hugely successful, i.e. they describe the Web.  Linked data, it seems to me, builds on these in very helpful ways.  I said that digital library developments often prove to be too niche - that they don't have mainstream impact.  Another way of putting that is that digital library activities don't spend enough time looking at what is going on in the wider environment.  In other contexts, I've argued that "the only good long-term identifier, is a good short-term identifier" and I wonder if that principle can and should be applied more widely.  If you are doing things on a Web-scale, then the whole Web has an interest in solving any problems - be that around preservation or anything else.  If you invent a technical solution that only touches on scholarly communication (for example) who is going to care about it in 50 or 100 years - answer, not all that many people.

It worries me, for example, when I see an architectural diagram (as was shown yesterday) which has channels labelled 'OAI-PMH', XML' and 'the Web'!

After my talk, Chris Rusbridge asked me if we should just get rid of the JISC IE architecture diagram.  I responded that I am happy to do so (though I quipped that I'd like there to be an archival copy somewhere).  But on the train home I couldn't help but wonder if that misses the point.  The diagram is neither here nor there, it's the "service-oriented, we can build it all", mentality that it encapsulates that is the real problem.

Let's throw that out along with the diagram.

December 01, 2009

On "Creating Linked Data"

In the age of Twitter, short, "hey, this is cool" blog posts providing quick pointers have rather fallen out of fashion, but I thought this material was worth drawing attention to here. Jeni Tennison, who is contributing to the current work with Linked Data in UK government, has embarked on a short series of tutorial-style posts called "Creating Linked Data", in which she explains the steps typically involved in reformulating existing data as linked data, and discusses some of the issues arising.

Her "use case" is the scenario in which some data is currently available in CSV format, but I think much of the discussion could equally be applied to the case where the provider is making data available for the first time. The opening post on the sequence ("Analysing and Modelling") provides a nice example of working through the sort of "things or strings?" questions which we've tried to highlight in the context of designing DC Application Profiles. And as Jeni emphasises, this always involves design choices:

It’s worth noting that this is a design process rather than a discovery process. There is no inherent model in any set of data; I can guarantee you that someone else will break down a given set of data in a different way from you. That means you have to make decisions along the way.

And further on in the piece, she rationalises her choices for this example in terms of what those choices enable (e.g. "whenever there’s a set of enumerated values it’s a good idea to consider turning them into things, because to do so enables you to associate extra information about them").

The post on URI design offers some tips, not only on designing new URIs but also on using existing URIs where appropriate: I admit I tend to forget about useful resources like placetime.com "a URI space containing URIs that represent places and times" (and provides redirects to descriptions in various formats).

On a related note, the post on choosing/coining properties, classes and datatypes includes a pointer to the OWL Time ontology. This is something I was aware of, but only looked at in any detail relatively recently. At first glance it can seem rather complex; Ian Davis has a summary graphic which I found helpful in trying to get my head round the core concepts of the ontology.

It seems to me these sort of very common areas like time data are those around which some shared practice will emerge, and articles like these, by "hands-on" practitioners, are important contributions to that process.

November 20, 2009

COI guidance on use of RDFa

Via a post from Mark Birbeck, I notice that the UK Central Office for Information has published some guidelines called Structuring information on the Web for re-usability which include some guidance on the use of RDFa to provide specific types of information, about government consultations and about job vacancies.

This is exciting news as, as far as I know, this is the first formal document from UK central government to provide this sort of quite detailed, resource-type-specific guidance with recommendations on the use of particular RDF vocabularies - guidance of the sort I think will be an essential component in the effective deployment of RDFa and the Linked Data approach. It's also the sort of thing that is of considerable interest to Eduserv, as a developer of Web sites for several government agencies. The document builds directly on the work Mark has been doing in this area, which I mentioned a while ago.

As Mark notes in his post, the document is unequivocal in its expression of the government's commitment to the Linked Data approach:

Government is committed to making its public information and data as widely available as possible. The best way to make structured information available online is to publish it as Linked Data. Linked Data makes the information easier to cut and combine in ways that are relevant to citizens.

Before the announcement of these guidelines, I recently had a look at the "argot" for consultations - "argot" is Mark's term for a specification of how a set of terms from multiple RDF vocabularies is used to meet some application requirement; as I noted in that earlier post, I think it is similar to what DCMI calls an "application profile" - , and I had intended to submit some comments. I fear it is now somewhat late in the day for me to be doing this, but the release of this document has prompted me to write them up here. My comments are concerned primarily with the section titled "Putting consultations into Linked Data"

The guidelines (correctly, I think) establish a clear distinction between the consultation on the one hand and the Web page describing the consultation on the other by (in paragraphs 30/31) introducing a fragment identifier for the URI of the consultation (via the about="#this" attribute). The consultation itself is also modelled as a document, an instance of the class foaf:Document, which in turn "has as parts" the actual document(s) on which comment is being sought, and for which a reply can be sent to some agent.

I confess that my initial "instinctive" reaction to this was that this seemed a slightly odd choice, as a "consultation" seemed to me to be more akin to an event or a process, taking place during an interval of time, which had a as "inputs" to that process a set of documents on which comments were sought, and (typically at least) resulted in the generation of some other document as a "response". And indeed the page describing the Consultation argot introduces the concept as follows (emphasis added):

A consultation is a process whereby Government departments request comments from interested parties, so as to help the department make better decisions. A consultation will usually be focused on a particular topic, and have an accompanying publication that sets the context, and outlines particular questions on which feedback is requested. Other information will include a definite start and end date during which feedback can be submitted and contact details for the person to submit feedback to.

I admit I find it difficult to square this with the notion of a "document". And I think a "consultation-as-event" (described by a Web page) could probably be modelled quite neatly using the Event Ontology or the similar LODE ontology (with some specialisation of classes and properties if required).

Anyway, I appreciate that aspect may be something of a "design choice". So for the remainder of the comments here, I'll stick to the actual approach described by the guidelines (consultation as document).

The RDF properties recommended for the description of the consultation are drawn mainly from Dublin Core vocabularies, and more specifically from the "DC Terms" vocabulary.

The first point to note is that, as Andy noted recently, DCMI recently made some fairly substantive changes to the DC Terms vocabulary, as a result of which the majority of the properties are now the subject of rdfs:range assertions, which indicate whether the value of the property is a literal or a non-literal resource. The guidelines recommend the use of the publisher (paragraphs 32-37), language(paragraphs 38-39), and audience (paragraph 46) properties, all with literal values, e.g.

<span property="dc:publisher" content="Ministry of Justice"></span>

But according to the term descriptions provided by DCMI, the ranges of these properties are the classes dcterms:Agent, dcterms:LinguisticSystem and dcterms:AgentClass respectively. So I think that would require the use of an XHTML-RDFa construct something like the following, introducing a blank node (or a URI for the resource if one is available):

<div rel="dc:publisher"><span property="foaf:name" content="Ministry of Justice"></span></div>

Second, I wasn't sure about the recommendation for the use of the dcterms:source property (paragraph 37). This is used to "indicate the source of the consultation". For the case where this is a distinct resource (i.e. distinct from the consultation and this Web page describing it), this seems OK, but the guidelines also offer the option of referring to the current document (i.e. the Web page) as the source of the consultation:

<span rel="dc:source" resource=""></span>

DCMI's definition of the property is "A related resource from which the described resource is derived", but it seems to me the Web page is acting as a description of the consultation-as-document, rather than a source of it.

Third, the guidelines recommend the use of some of the DCMI date properties (paragraph 42):

  • dcterms:issued for the publication date of the consultation
  • dcterms:available for the start date for receiving comments
  • dcterms:valid for the closing date ("a date through which the consultation is 'valid'")

I think the use of dcterms:valid here is potentially confusing. DCMI's definition is "Date (often a range) of validity of a resource", so on this basis I think the implication of the recommended usage is that the consultation is "valid" only on that date, which is not what is intended. The recommendations for dcterms:issued and dcterms:available are probably OK - though I do think the event-based approach might have helped make the distinction between dates related to documents and dates related to consultation-as-process rather clearer!

Oh dear, this must read like an awful lot of pedantic nitpicking on my part, but my intent is to try to ensure that widely used vocabularies like those provided by DCMI are used as consistently as possible. As I said at the start I'm very pleased to see this sort of very practical guidance appearing (and I apologise to Mark for not submitting my comments earlier!)

November 16, 2009

The future has arrived

Cetis09

About 99% of the way thru Bill Thompson's closing keynote to the CETIS 2009 Conference last week I tweeted:

great technology panorama painted by @billt in closing talk at #cetis09

And it was a great panorama - broad, interesting and entertainingly delivered. It was a good performance and I am hugely in awe of people who can give this kind of presentation. However, what the talk didn't do was move from the "this is where technology has come from, this is where it is now and this is where it is going" kind of stuff to the "and this is what it means for education in the future". Which was a shame because in questioning after his talk Thompson did make some suggestions about the future of print news media (not surprising for someone now at the BBC) and I wanted to hear similar views about the future of teaching, learning and research.

As Oleg Liber pointed out in his question after the talk, universities, and the whole education system around them, are lumbering beasts that will be very slow to change in the face of anything. On that basis, whilst it is interesting to note that (for example) we can now just about store a bit on an atom (meaning that we can potentially store a digital version of all human output on something the weight of a human body), that we can pretty much wire things directly into the human retina, and that Africa will one-day overtake 'digital' Britain in the broadband stakes are interesting individual propositions in their own right, there comes a "so what?" moment where one is left wondering what it actually all means.

As an aside, and on a more personal note, I suggest that my daughter's experience of university (she started at Sheffield Hallam in September) is not actually going to be very different to my own, 30-odd years ago. Lectures don't seem to have changed much. Project work doesn't seem to have changed much. Going out drinking doesn't seem to have changed much. She did meet all her 'hall' flat-mates via Facebook before she arrived in Sheffield I suppose :-) - something I never had the opportunity to do (actually, I never even got a place in hall). There is a big difference in how it is all paid for of course but the interesting question is how different university will be for her children. If the truth is, "not much", then I'm not sure why we are all bothering.

At one point, just after the bit about storing a digital version of all human output I think, Thompson did throw out the question, "...and what does that mean for copyright law?". He didn't give us an answer. Well, I don't know either to be honest... though it doesn't change the fact that creative people need to be rewarded in some way for their endeavours I guess. But the real point here is that the panorama of technological change that Thompson painted for us, interesting as it was, begs some serious thinking about what the future holds.  Maybe Thompson was right to lay out the panorama and leave the serious thinking to us?

He was surprisingly positive about Linked Data, suggesting that the time is now right for this to have a significant impact.  I won't disagree because I've been making the same point myself in various fora, though I tend not to shout it too loudly because I know that the Semantic Web has a history of not quite making it.  Indeed, the two parallel sessions that I attended during the conference, University API and the Giant Global Graph both focused quite heavily on the kinds of resources that universities are sitting on (courses, people/expertise, research data, publications, physical facilities, events and so on) that might usefully be exposed to others in some kind of 'open' fashion.  And much of the debate, particularly in the second session (about which there are now some notes), was around whether Linked Data (i.e. RDF) is the best way to do this - a debate that we've also seen played out recently on the uk-government-data-developers Google Group.

The three primary issues seemed to be:

  • Why should we (universities) invest time and money exposing our data in the hope that people will do something useful/interesting/of value with it when we have many other competing demands on our limited resources?
  • Why should we take the trouble to expose RDF when it's arguably easier for both the owner and the consumer of the data to expose something simpler like a CSV file?
  • Why can't the same ends be achieved by offering one or more services (i.e. a set of one or more APIs) rather than the raw data itself?

In the ensuing debate about the why and the how, there was a strong undercurrent of, "two years ago SOA was all the rage, now Linked Data is all the rage... this is just a fashion thing and in two years time there'll be something else".  I'm not sure that we (or at least I) have a well honed argument against this view but, for me at least, it lies somewhere in the fit with resource-orientation, with the way the Web works, with REST, and with the Web Architecture.

On the issue of the length of time it is taking for the Semantic Web to have any kind of mainstream impact, Ian Davis has an interesting post, Make or Break for the Semantic Web?, arguing that this is not unusual for standards track work:

Technology, especially standards track work, takes years to cross the chasm from early adopters (the technology enthusiasts and visionaries) to the early majority (the pragmatists). And when I say years, I mean years. Take CSS for example. I’d characterise CSS as having crossed the chasm and it’s being used by the early majority and making inroads into the late majority. I don’t think anyone would seriously argue that CSS is not here to stay.

According to this semi-official history of CSS the first proposal was in 1994, about 13 years ago. The first version that was recognisably the CSS we use today was CSS1, issued by the W3C in December 1996. This was followed by CSS2 in 1998, the year that also saw the founding of the Web Standards Project. CSS 2.1 is still under development, along with portions of CSS3.

Paul Walk has also written an interesting post, Linked, Open, Semantic?, in which he argues that our discussions around the Semantic Web and Linked Data tend to mix up three memes (open data, linked data and semantics) in rather unhelpful ways. I tend to agree, though I worry that Paul's proposed distinction between Linked Data and the Semantic Web is actually rather fuzzier than we may like.

On balance, I feel a little uncomfortable that I am not able to offer a better argument against the kinds of anti-Linked Data views expressed above. I think I understand the issues (or at least some of them) pretty well but I don't have them to hand in a kind of this is why Linked Data is the right way forward 'elevator pitch'.

Something to work on I guess!

[Image: a slide from Bill Thompson's closing keynote to the CETIS 2009 Conference]

October 19, 2009

Helpful Dublin Core RDF usage patterns

In the beginning [*] there was the HTML meta element and we used to write things like:

<meta name="DC.Creator"content="Andy Powell">
<meta name="DC.Subject" content="something, something else, something else again">
<meta name="DC.Date.Available" scheme="W3CDTF" content="2009-10-19">
<meta name="DC.Rights" content="Open Database License (ODbL) v1.0">

Then came RDF and a variety of 'syntax' guidance from DCMI and we started writing:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dc:creator>Andy Powell</dc:creator>
    <dcterms:available>2009-10-19</dcterms:available>
    <dc:subject>something</dc:subject>
    <dc:subject>something else</dc:subject>
    <dc:subject>something else again</dc:subject>
    <dc:rights>Open Database License (ODbL) v1.0</dc:rights>
  </rdf:Description>
</rdf:RDF>

Then came the decision to add 15 new properties to the DC terms namespace which reflected the original 15 DC elements but which added a liberal smattering of domains and ranges.  So, now we write:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dcterms:creator>
      <dcterms:Agent>
        <rdf:value>Andy Powell</rdf:value>
        <foaf:name>Andy Powell</foaf:name>
      </dcterms:Agent>
    </dcterms:creator>
    <dcterms:available
rdf:datatype="http://purl.org/dc/terms/W3CDTF">2009-10-19</dcterms:available>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something else</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something else again</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:rights
rdf:resource="http://opendatacommons.org/licenses/odbl/1.0/" />
  </rdf:Description>
</rdf:RDF>

Despite the added verbosity and rather heavy use of blank nodes in the latter, I think there are good reasons why moving towards this kind of DC usage pattern is a 'good thing'.  In particular, this form allows the same usage pattern to indicate a subject term by URI or literal (or both - see addendum below) meaning that software developers only need to code to a single pattern. It also allows for the use of multiple literals (e.g. in different languages) attached to a single value resource.

The trouble is, a lot of existing usage falls somewhere between the first two forms shown here.  I've seen examples of both coming up in discussions/blog posts about both open government data and open educational resources in recent days.

So here are some useful rules of thumb around DC RDF usage patterns:

  • DC properties never, ever, start with an upper-case letter (i.e. dcterms:Creator simply does not exist).
  • DC properties never, ever, contain a full-stop character (i.e. dcterms:date.available does not exist either).
  • If something can be named by its URI rather than a literal (e.g. the ODbL licence in the above examples) do so using rdf:resource.
  • Always check the range of properties before use.  If the range is anything other than a literal (as is the case with both dc:subject and dc:creator for example) and you don't know the URI of the value, use a blank or typed node to indicate the value and rdf:value to indicate the value string.
  • Do not provide lists of separate keywords as a single dc:subject value.  Repeat the property multiple times, as necessary.
  • Syntax encoding schemes, W3CDTF in this case, are indicated using rdf:datatype.

See the Expressing Dublin Core metadata using the Resource Description Framework (RDF) DCMI Recommendation for more examples and guidance.

[*] The beginning of Dublin Core metadata obviously! :-)

Addendum

As Bruce notes in the comments below, the dcterms:subject pattern that I describe above applies in those situations where you do not know the URI of the subject term. In cases where you do know the URI (as is the case with LCSH for example), the pattern becomes:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dcterms:subject>
      <rdf:Description rdf:about="http://id.loc.gov/authorities/sh85101653#concept">
        <rdf:value>Physics</rdf:value>
      </rdf:Description>
    </dcterms:subject>
  </rdf:Description>
</rdf:RDF>

October 14, 2009

Open, social and linked - what do current Web trends tell us about the future of digital libraries?

About a month ago I travelled to Trento in Italy to speak at a Workshop on Advanced Technologies for Digital Libraries organised by the EU-funded CACOA project.

My talk was entitled "Open, social and linked - what do current Web trends tell us about the future of digital libraries?" and I've been holding off blogging about it or sharing my slides because I was hoping to create a slidecast of them. Well... I finally got round to it and here is the result:

Like any 'live' talk, there are bits where I don't get my point across quite as I would have liked but I've left things exactly as they came out when I recorded it. I particularly like my use of "these are all very bog standard... err... standards"! :-)

Towards the end, I refer to David White's 'visitors vs. residents' stuff, about which I note he has just published a video. Nice one.

Anyway... the talk captures a number of threads that I've been thinking and speaking about for the last while. I hope it is of interest.

October 12, 2009

Description Set Profiles for DC-2009

For the first time in a few years, I'm not attending the Dublin Core conference, which is taking place in Seoul, Korea this week. I've been actively trying to cut back on the long-haul travel, mainly to shrink my personal carbon footprint, which for several years was probably the size of a medium-sized Hebridean island. But also I have to admit that increasingly I've found it pretty exhausting to try to engage with a long meeting or deliver a presentation after a long flight, and on occasions I've found myself questioning whether it's the "best" use of my own energy reserves!

More broadly, I suppose I think we - the research community generally, not just the DC community - really have to think carefully about the sustainability of the "traditional" model of large international conferences with hundreds of people, some of them travelling thousands of miles to participate.

But that's a topic for another time, and of course, this week I will miss catching up with friends, practising the fine art of waving my hands around animatedly while also trying to maintain control of a full lunch plate, and hatching completely unrealistic plans over late-night beers. Good luck and safe travels to everyone who is attending the conference in Seoul.

Anyway, Liddy Nevile, who is chairing the Workshop Committee (and has over recent years made efforts to enable and encourage remote participation in the conference), invited Karen Coyle and me to contribute a pre-recorded presentation on Description Set Profiles. We only have a ten minute slot so it's necessarily a rather hasty sprint through the topic, but the results are below.

I think this is the first time I've recorded my own voice since I was in a language lab in an A-Level French class in 1990! To date, I've done my best to avoid anything resembling podcasting, or retrospectively adding audio to any of my presentation slide decks on Slideshare, mostly because I hate listening to the sound of my own voice, perhaps all the more so because I know I tend to "umm" and "err" quite a lot. And even if I write myself a "script" (which in most cases I don't) I seem to find it hard to resist the temptation to change bits and insert additional comments on the fly, and then I realise I'm altering the structure of a sentence and repeating things, and I "umm" and "err" even more... Argh. I'm sure I recorded every sentence of this in Audacity at least three times over! :-)

October 05, 2009

QNames, URIs and datatypes

I posted a version of this on the dc-date Jiscmail list recently, but I thought it was worth a quick posting here too, as I think it's a slightly confusing area and the differences in naming systems is something which I think has tripped up some DC implementers in the past.

The Library of Congress has developed a (very useful looking) XML schema datatype for dates, called the Extended Date Time Format (EDTF). It appears to address several of the issues which were discussed some time ago by the DCMI Date working group. It looks like the primary area of interest for the LoC is the use of the data type in XML and XML Schema, and the documentation for the new datatype illustrates its usage in this context.

However, as far as I can see, it doesn't provide a URI for the datatype, which is a prerequisite for referencing it in RDF. I think sometimes we assume that providing what the XML Namespaces spec calls an expanded name is sufficient, and/or that doing that automatically also provides a URI, but I'm not sure that is the case.

In XML Schema, a datatype is typically referred to using an XML Qualified Name (QName), which is essentially a "shorthand" for an "expanded name". An expanded name has two parts: a (possibly null-valued) "namespace name" and a "local name".

Take the case of the built-in datatypes defined as part of the XML Schema specification. In XML/XML Schema, these datatypes are referenced using these two part "expanded names", typically represented in XML documents as QNames e.g.

<count xsi:type="xsd:int">42</count>

where the prefix xsd is bound to the XML namespace name "http://www.w3.org/2001/XMLSchema". So the two-part expanded name of the XML Schema integer datatype is

( http://www.w3.org/2001/XMLSchema , int )

Note: no trailing "#" on that namespace name.

The key point here is that the expanded name system is a different naming system from the URI system. True, the namespace name component of the expanded name is a URI (if it isn't null), but the expanded name as a whole is not. And furthermore, there is no generalised mapping from expanded name to URI.

To refer to a datatype in RDF, I need a URI, and for the built-in datatypes provided, the XML Schema Datayping spec tells me explicitly how to construct such a URI:

Each built-in datatype in this specification (both *primitive* and *derived*) can be uniquely addressed via a URI Reference constructed as follows:

  1. the base URI is the URI of the XML Schema namespace
  2. the fragment identifier is the name of the datatype

For example, to address the int datatype, the URI is:

* http://www.w3.org/2001/XMLSchema#int

The spec tells me that for the names of this set of datatypes, to generate a URI from the expanded name, I use the local part as the fragment id - but some other mechanism might have been applied.

And this is different from, say, the way RDF/XML maps the QNames of some XML elements to URIs, where the mechanism is simple concatenation of "namespace name" and "local name"

(And, as an aside, I think this makes for a bit of "gotcha" for XML folk coming to RDF syntaxes like Turtle where datatype URIs can be represented as prefixed names, because the "namespace URI" you use in Turtle, where the simple concatenation mechanism is used, needs to include the trailing "#" (http://www.w3.org/2001/XMLSchema#), whereas in XML/XML Schema people are accustomed to using the "hash-less" XML namespace name (http://www.w3.org/2001/XMLSchema)).

But my understanding is that that the mapping defined by the XML Schema datatyping document is specific to the datatypes defined by that document ("Each built-in datatype in this specification... can be uniquely addressed..." (my emphasis)), and the spec is silent on URIs for other user-defined XML Schema datatypes.

So it seems to me that if a community defines a datatype, and they want to make it available for use in both XML/XML Schema and in RDF, they need to provide both an expanded name and a URI for the datatype.

The LoC documentation for the datatype has plenty of XML examples so I can deduce what the expanded name is:

(info:lc/xmlns/edtf-v1 , edt)

But that doesn't tell me what URI I should use to refer to the datatype.

I could make a guess that I should follow the same mapping convention as that provided for the XML Schema built-in datatypes, and decide that the URI to use is

info:lc/xmlns/edtf-v1#edt

But given that there is no global expanded name-URI mapping, and the XML Schema specs provide a mapping only for the names of the built-in types, I think I'd be on shaky ground if I did that, and really the owners of the datatype need to tell me explicitly what URI to use - as the XML Schema spec does for those built-in types.

Douglas Campbell responded to my message pointing to a W3C TAG finding which provides this advice:

If it would be valuable to directly address specific terms in a namespace, namespace owners SHOULD provide identifiers for them.

Ray Denenberg has responded with a number of suggestions: I'm less concerned about the exact form of the URI than that a URI is explicitly provided.

P.S. I didn't comment on the list on the choice of URI scheme, but of course I agree with Andy's suggestion that value would be gained from the use of the http URI scheme.... :-)

September 22, 2009

VoCamp Bristol

At the end of the week before last, I spent a couple of days (well, a day and a half as I left early on Friday) at the VoCamp Bristol meeting, at ILRT, at the University of Bristol.

To quote the VoCamp wiki:

VoCamp is a series of informal events where people can spend some dedicated time creating lightweight vocabularies/ontologies for the Semantic Web/Web of Data. The emphasis of the events is not on creating the perfect ontology in a particular domain, but on creating vocabs that are good enough for people to start using for publishing data on the Web.

I admit that I went into the event slightly unprepared, as I didn't have any firm ideas about any specific vocabulary I wanted to work on, but happy to join in with anyone who was working on anything of interest. Some of the outputs of the various groups are listed on the wiki page.

As well as work on specific vocabularies, the opening discussions highlighted an interest in a small set of more general issues, which included the expression of "structural constraints" and "validation"; broader questions of collecting and interpreting vocabulary usage; representing RDF data using JSON; and the features available in OWL 2. Friday morning was set aside for those topics, which meant I had an opportunity to talk a little bit about the work being done within the Dublin Core Metadata Initiative on "Description Set Profiles", which I've mentioned in some recent posts here. I did hastily knock up a few slides, mainly as an aide memoire to make sure I mentioned various bits and pieces:

There was a useful discussion around various different approaches for representing such patterns of constraints at the level of the RDF graph, either based on query patterns, or on the use of OWL (with a "closed-world" assumption that the "world" in question is the graph at hand). Some of the new features in OWL 2, such as capabilities for expressing restrictions on datatypes seem to make it quite an attractive candidate for this sort of task.

I was asked about whether we had considered the use of OWL in the DCMI context. IIRC, we decided against it mostly because we wanted an approach that built explicitly on the description model of the DCMI Abstract Model (i.e. talked in terms of "descriptions" and "statements" and patterns of use of those particular constructs), though I think the "open-world" considerations were also an issue (See this piece for a discussion of some of the "gotchas" that can arise).

Having said that, it would seem a good idea to explore to what extent the constraint types permitted by the DSP model might be mapped into other form(s) of expressing constraints which might be adopted.

All in all, it was a very enjoyable couple of days: a fairly low-key, thoughtful, gentle sort of gathering - no "pitches", no prizes, no "dragons" in their "dens", or other cod-"bizniz" memes :-) - just an opportunity for people to chat and work together on topics that interested them. Thank you to Tom & Damian & Libby for doing the organisation (and introducing me to a very nice Chinese restaurant in Bristol on the Thursday night!)

August 20, 2009

A socio-political analysis of the history of Dublin Core terms

No... this isn't one. Sorry about that! But reading Eric Hellman's recent blog post, Can Librarians Be Put Directly Onto the Semantic Web?, made me wonder if such a thing would be interesting (in a "how much metadata can you fit on the head of a pin" kind of way!).

Eric's post discusses the need for, or not, inverse properties in the Semantic Web and the necessary changes of mindset in moving from thinking about 'metadata' to thinking about 'ontologies':

In many respects, the most important question for the library world in examining semantic web technologies is whether librarians can successfully transform their expertise in working with metadata into expertise in working with ontologies or models of knowledge. Whereas traditional library metadata has always been focused on helping humans find and make use of information, semantic web ontologies are focused on helping machines find and make use of information. Traditional library metadata is meant to be seen and acted on by humans, and as such has always been an uncomfortable match with relational database technology. Semantic web ontologies, in contrast, are meant to make metadata meaningful and actionable for machines. An ontology is thus a sort of computer program, and the effort of making an RDF schema is the first step of telling a computer how to process a type of information.

I think there's probably some interesting theorising to be done about the history of the Dublin Core metadata properties, in particular about the way they have been named over the years and the way some have explicit inverse properties but others don't.

So, for example, dcterms:creator and dcterms:hasVersion use different naming styles ('creator' rather than 'hasCreator') and dcterms:hasVersion has an explicit inverse (dcterms:isVersionOf) whereas dcterms:creator does not (there is no dcterms:isCreatorOf).

Unfortunately, I don't recall much of the detail of why these changes in attitude to naming occured over the years. My suspicion is that it has something to do with the way our understanding of 'web' metadata has evolved over time.  Two things in particular I guess.  Firstly, the way in which there has been a gradual change from understanding properties as being 'attributes with string values' (very much the view when dc:creator was invented) to understanding properties as 'the meat between two resources in an RDF triple'.  And, secondly, a change in thinking first and foremost about 'card catalogues' and/or relational databases to thinking about triple stores (perhaps better characterised (as Eric did) as a transition between thinking about metadata as something that is viewed by humans to something that is acted upon by software).

I strongly suspect that both these changes in attitude are very much ongoing (at least in the DC community - possibly elsewhere?).

Note also the difference in naming between dcterms:valid and dcterms:dateCopyrighted (both of which are refinements of dcterms:date). The former emerged at a time when the prefered encoding syntaxes tended to prefix 'valid' with 'DC.date.' to give 'DC.date.valid' whereas the latter emerged at a time when properties where recognised as being stand-alone entities (i.e. after the emergence of Semantic Web thinking).

If nothing else, working with the Dublin Core community over the years has served as a very useful reminder about the challenges 'ordinary' (I don't mean that in any way negatively) people face in understanding what some 'geeks' might perceive to be very simple Semantic Web concepts.  I've lost track of the number of 'strings vs. things' type discussions I've been involved in!  And to an extent, part of the reason for developing the DCMI Abstract Model was to try to bridge the gap between a somewhat 'old-skool' (dare I say, 'traditional librarian'?) view of the world and the Semantic Web view of the world.  Of course, one can argue about whether we succeeded in that aim.

July 29, 2009

Enhanced PURL server available

A while ago, I posted about plans to revamp the PURL server software to (amongst other things) introduce support for a range of HTTP response codes. This enables the use of PURLs for the identification of a wider range of resources than "Web documents" using the interaction patterns specified by current guidelines provided by the W3C in the TAG's httpRange-14 resolution and the Cool URIs for the Semantic Web document.

Lorcan Dempsey posted on Twitter yesterday to indicate that OCLC have deployed the new software, developed by Zepheira, on the OCLC purl.org service. Although I've just had time for a quick poke around, and I need to read the documentation more carefully to understand all the new features, it looks like it does the job quite nicely.

This should mean that existing PURL owners who have used PURLs to identify things other than "Web documents" - like DCMI, who use PURLs like http://purl.org/dc/terms/title to identify their "metadata terms", "conceptual" resources - should be able to adjust the appropriate entries on the purl.org server so that interactions follow those guidelines. It also offers a new option for those who wish to set up such redirects but perhaps don't have suitable levels of access to configure their own HTTP server to perform those redirects.

July 21, 2009

Linked data vs. Web of data vs. ...

On Friday I asked what I thought would be a pretty straight-forward question on Twitter:

is there an agreed name for an approach that adopts the 4 principles of #linkeddata minus the phrase, "using the standards (RDF, SPARQL)" ??

Turns out not to be so straight-forward, at least in the eyes of some of my Twitter followers. For example, Paul Miller responded with:

@andypowe11 well, personally, I'd argue that Linked Data does NOT require that phrase. But I know others disagree... ;-)

and

@andypowe11 I'd argue that the important bit is "provide useful information." ;-)

Paul has since written up his views more thoughtfully in his blog, Does Linked Data need RDF?, a post that has generated some interesting responses.

I have to say I disagree with Paul on this, not in the sense that I disagree with his focus on "provide useful information", but in the sense that I think it's too late to re-appropriate the "Linked Data" label to mean anything other than "use http URIs and the RDF model".

To back this up I'd go straight to the horses mouth, Tim Berners-Lee, who gave us his personal view way back in 2006 with his 'design issues' document on Linked Data. This gave us the 4 key principles of Linked Data that are still widely quoted today:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
  4. Include links to other URIs. so that they can discover more things.

Whilst I admit that there is some wriggle room in the interpretation of the 3rd point - does his use of "RDF, SPARQL" suggested these as possible standards or is the implication intended to be much stronger? - more recent documents indicate that the RDF model is mandated. For example, in Putting Government Data online Tim Berners-Lee says (refering to Linked Data):

The essential message is that whatever data format people want the data in, and whatever format they give it to you in, you use the RDF model as the interconnection bus. That's because RDF connects better than any other model.

So, for me, Linked Data implies use of the RDF model - full stop. If you put data on the Web in other forms, using RSS 2.0 for example, then you are not doing Linked Data, you're doing something else! (Addendum: I note that Ian Davis makes this point rather better in The Linked Data Brand).

Which brings me back to my original question - "what do you call a Linked Data-like approach that doesn't use RDF?" - because, in some circumstances, adhering to a slightly modified form of the 4 principles, namely:

  1. use URIs as names for things
  2. use HTTP URIs so that people can look up those names
  3. when someone looks up a URI, provide useful information
  4. include links to other URIs. so that they can discover more things

might well be a perfectly reasonable and useful thing to do. As purists, we can argue about whether it is as good as 'real' Linked Data but sometimes you've just got to get on and do whatever you can.

A couple of people suggested that the phrase 'Web of data' might capture what I want. Possibly... though looking at Tom Coates' Native to a Web of Data presentation it's clear that his 10 principles go further than the 4 above.  Maybe that doesn't matter? Others suggested "hypermedia" or "RESTful information systems" or "RESTful HTTP" none of which strikes me as quite right.

I therefore remain somewhat confused. I quite like Bill de hÓra's post on "links in content", Snowflake APIs, but, again, I'm not sure it gets us closer to an agreed label?

In a comment on a post by Michael Hausenblas, What else?, Dan Brickley says:

I have no problem whatsoever with non-RDF forms of data in “the data Web”. This is natural, normal and healthy. Stastical information, geographic information, data-annotated SVG images, audio samples, JSON feeds, Atom, whatever.

We don’t need all this to be in RDF. Often it’ll be nice to have extracts and summaries in RDF, and we can get that via GRDDL or other methods. And we’ll also have metadata about that data, again in RDF; using SKOS for indicating subject areas, FOAF++ for provenance, etc.

The non-RDF bits of the data Web are – roughly – going to be the leaves on the tree. The bit that links it all together will be, as you say, the typed links, loose structuring and so on that come with RDF. This is also roughly analagous to the HTML Web: you find JPEGs, WAVs, flash files and so on linked in from the HTML Web, but the thing that hangs it all together isn’t flash or audio files, it’s the linky extensible format: HTML. For data, we’ll see more RDF than HTML (or RDFa bridging the two). But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much! :)

Count me in as a worrier then!

I ask because, as a not-for-profit provider of hosting and Web development solutions to the UK public sector, Eduserv needs to start thinking about the implications of Tim Berners-Lee's appointment as an advisor to the UK government on 'open data' issues on the kinds of solutions we provide.  Clearly, Linked Data is going to feature heavily in this space but I fully expect that lots of stuff will also happen outside the RDF fold.  It's important for us to understand this landscape and the impact it might have on future services.

April 24, 2009

More RDFa in UK government

It's quite exciting to see various initiatives within UK government starting to make use of Semantic Web technologies, and particularly of RDFa. At the recent OKCon conference, I heard Jeni Tennison talk about her work on using RDFa in the London Gazette. Yesterday, Mark Birbeck published a post outlining some of his work with the Central Office of Information.

The example Mark focuses on is that of a job vacancy, where RDFa is used to provide descriptions of various related resources: the vacancy, the job for which the vacancy is available, a person to contact, and so on. Mark provides an example of a little display app built on the Yahoo SearchMonkey platform which processes this data.

As a a footnote (a somewhat lengthy one, now that I've written it!), I'd just draw attention to Mark's description of developing what he calls an RDF "argot" for constructing such descriptions:

The first vocabularies -- or argots -- that I defined were for job vacancies, but in order to make the terminology usable in other situations, I broke out argots for replying to the vacancy, the specification of contact details, location information, and so on.

An argot doesn't necessarily involve the creation of new terms, and in fact most of the argots use terms from Dublin Core, FOAF and vCard. So although new terms have been created if they are needed, the main idea behind an argot is to collect together terms from various vocabularies that suit a particular purpose.

I was struck by some of the parallels between this and DCMI's descriptions of developing what it calls an "DC application profile" - with the caveat that DCMI typically talks in terms of the DCMI Abstract Model rather than directly of the RDF model. e.g. the Singapore Framework notes:

In a Dublin Core Application Profile, the terms referenced are, as one would expect, terms of the type described by the DCMI Abstract Model, i.e. a DCAP describes, for some class of metadata descriptions, which properties are referenced in statements and how the use of those properties may be constrained by, for example, specifying the use of vocabulary encoding schemes and syntax encoding schemes. The DC notion of the application profile imposes no limitations on whether those properties or encoding schemes are defined and managed by DCMI or by some other agency

And in the draft Guidelines for Dublin Core Application Profiles:

the entities in the domain model -- whether Book and Author, Manifestation and Copy, or just a generic Resource -- are types of things to be described in our metadata. The next step is to choose properties for describing these things. For example, a book has a title and author, and a person has a name; title, author, and name are properties.

The next step, then, is to scan available RDF vocabularies to see whether the properties needed already exist. DCMI Metadata Terms is a good source of properties for describing intellectual resources like documents and web pages; the "Friend of a Friend" vocabulary has useful properties for describing people. If the properties one needs are not already available, it is possible to declare one's own

And indeed the Job Vacancy argot which Mark points to would, I think, probably be fairly recognisable to those familiar with the DCAP notion: compare, for example, with the case of the Scholarly Works Application Profile. The differences are that (I think) an "argot" focuses on the description of a single resource type, and I don't think it goes as far as a formal description of structural constraints in quite the same way DCMI's Description Set Profile model does.

January 14, 2009

Resource List Management on the Semantic Web

Via a post by Ivan Herman of the W3C, I came across a W3C case study titled A Linked Open Data Resource List Management Tool for Undergraduate Students, based on work done between Talis and the University of Plymouth.

Andy and I visited Talis, well, I was going to say a few months ago, but it was probably the middle of last year, and Rob Styles, Chris Clarke & other Talisians talked to us a little bit then about this work, but at that point I don't think they had a live system to show.

This looks pretty neat stuff. It's an RDF application, based on the Talis Platform. They make use of a number of existing ontologies (SIOC, BIBO) and have designed a simple ontology for the Reading Lists themselves and also one for the organisational structure of an academic institution, the AIISO ontology - which I imagine may be of interest to other projects working in this area.

Intelligent "bookmarking" tools for adding items to lists use a variety of techniques to extract metadata from Web pages (in a similar way to the Zotero citation manager tool); the metadata is exposed as RDFa in XHTML representations of the lists, which makes it available to systems like Yahoo's SearchMonkey; other RDF formats are available via content negotiation (following the Linked Data/Cool URIs for the Semantic Web principles); and a SPARQL endpoint for the dataset is available (though I'm not sure whether this is public). The system also allows students to provide annotations, which are also stored as RDF data, but in a separate data store from the "primary" reading list data, allowing different access controls.

About

Search

Loading
eFoundations is powered by TypePad