January 10, 2012

Introducing Bricolage

My last post here (two months ago - yikes, must do better...) was an appeal to anyone who might be interested in my making a contribution to a project under the JISC call 16/11: JISC Digital infrastructure programme to get in touch. I'm pleased to say that Jasper Tredgold of ILRT at the University of Bristol contacted me about a proposal he was putting together for a project called Bricolage, with the prospect of my doing some consultancy. The proposal was to work with the University Library's Special Collections department and the University Geology Museum to make available metadata for two collections - the archive of Penguin Ltd. and the specimen collection of the Geology Museum - as Linked Open Data.

And just before I went off for the Christmas break, Jasper let me know that the proposal had been accepted and the project was being funded. I'm very pleased to have another opportunity to carry on applying some of what I've learned in the other JISC-funded projects I've contributed to recently, and also to explore some new categories of data. It's also nice to be working with a local university - I worked on a few projects with folks from ILRT during my time at UKOLN, and from a selfish perspective I look forward to project meetings which involve a twenty-minute walk up the hill for me rather than a 7am start and a three or four hour train journey!

The project will start in February and run through to July. I'm sure there'll be a project blog once we get going and I'll add a URI here when it is available.

November 04, 2011

JISC Digital Infrastructure programme call

JISC currently has a call, 16/11: JISC Digital infrastructure programme, open for project proposals in a number of "strands"/areas, including the following:

  • Resource Discovery: "This programme area supports the implementation of the resource discovery taskforce vision by funding higher education libraries archives and museums to make open metadata about their collections available in a sustainable way. The aim of this programme area is to develop more flexible, efficient and effective ways to support resource discovery and to make essential resources more visible and usable for research and learning."

This strand advances the work of the UK Discovery initiative, and is similar to the "Infrastructure for Resource Discovery" strand of the JISC 15/10 call under which the SALDA project (in which I worked with the University of Sussex Library on the Mass Observation Archive data) was funded. There is funding for up to ten projects of between £25,000 and £75,000 per project in this strand

First, I should say this is a great opportunity to explore this area of work and I think we're fortunate that JISC is able to fund this sort of activity. A few particular things I noticed about the current call:

  • a priority for "tools and techniques that can be used by other institutions"
  • a focus on unique resources/collections not duplicated elsewhere
  • should build on lessons of earlier projects, but must avoid duplication/reinvention
  • a particular mention of "exploring the value of integrating structured data into webpages using microformats, microdata, RDFa and similar technologies" as an area in scope
  • an emphasis on sharing the experience/lessons learned: "The lessons learned by projects funded under this call are expected to be as important as the open metadata produced. All projects should build sharing of lessons into their plans. All project reporting will be managed by a project blog. Bidders should commit to sharing the lessons they learn via a blog"

Re that last point, as I've said before, one of the things I most enjoyed about the SALDA and LOCAH projects was the sense that we were interested in sharing the ideas as well as getting the data out there.

I'm conscious the clock is ticking towards the submission deadline, and I should have posted this earlier, but if anyone reading is considering a proposal and thinks that I could make a useful contribution, I'd be interested to hear from you. My particular areas of experience/interest are around Linked Data, and are probably best reflected by the posts I made on the LOCAH and SALDA blogs, i.e. data modelling, URI pattern design, identification/selection of useful RDF vocabularies, identification of potential relationships with things described in other datasets, construction of queries using SPARQL, etc. I do have some familiarity with RDFa, rather less with microdata and microformats. I'm not a software developer, but I can do a little bit of XSLT (and maybe enough PHP to be dangerous hack together rather flakey basic demonstrators). And I'm not a technical architect, but I did get some experience of working with triple stores in those recent projects.

My recent work has been mainly with archival metadata, and I'd be particularly interested in further work which complements that. I'm conscious of the constraint in the call of not repeating earlier work, so I don't think "reapplying" the sort of EAD to RDF work I did with LOCAH and SALDA would fit the bill. (I'd love to do something around the event/narrative/storytelling angle that I wrote about recently here, for example.) Having said that, I certainly don't mean to limit myself to archival data. Anyway, if you think I might be able to help, please do get in touch (pete.johnston@eduserv.org.uk).

October 05, 2011

Storytelling, archives and linked data

Yesterday on Twitter I saw Owen Stephens (@ostephens) post a reference to a presentation titled "Every Story has a Beginning", by Tim Sherratt (@wragge), "digital historian, web developer and cultural data hacker" from Canberra, Australia.

The presentation content is available here, and the text of the talk is here. I think you really need to read the text in one window and click through the presentation in another. I found it fascinating, and pretty inspiring, from several perspectives.

First, I enjoyed the form of the presentation itself. The content is built up incrementally on the screen, with an engaging element of "dynamism" but kept simple enough to avoid the sort of vertiginous barrage that seems to characterise the few Prezi presentations I've witnessed. And perhaps most important of all, the presentation itself is very much "a thing of the Web": many of the images are hyperlinked through to the "live" resources pictured, providing not only a record of "provenance" for the examples, but a direct gateway into the data sources themselves, allowing people to explore the broader context of those individual records or fragments or visualisations.

Second, it provides some compelling examples of how digitised historical materials and data extracted or derived from them can be brought together in new combinations and used to uncover and (re)tell stories - and stories not just of the "famous", the influential and powerful, but of ordinary people whose life events were captured in historical records of various forms. (Aside: Kate Bagnall has some thoughtful posts looking at some of the ethical issues of making people who were "invisible" "visible").

Finally, what really intrigued me from the technical perspective was that - if I understand correctly - the presentation is being driven by a set of RDF data. (Tim said on Twitter he'd post more explaining some of the mechanics of what he has done, and I admit I'm jumping the gun somewhat in writing this post, so I apologise for any misinterpretations.) In his presentation, Tim says:

What we need is a data framework that sits beneath the text, identifying people, dates and places, and defining relationships between them and our documentary sources. A framework that computers could understand and interpret, so that if they saw something they knew was a placename they could head off and look for other people associated with that place. Instead of just presenting our research we’d be creating a whole series of points of connection, discovery and aggregation.

Sounds a bit far-fetched? Well it’s not. We have it already — it’s called the Semantic Web.

The Semantic Web exposes the structures that are implicit in our web pages and our texts in ways that computers can understand. The Linked Data movement takes the basic ideas of the Semantic Web and turns them into a collaborative activity. You share vocabularies, so that other people (and computers) know when you’re talking about the same sorts of things. You share identifiers, so that other people (and computers) know that you’re talking about a specific person, place, object or whatever.

Linked Data is Storytelling 101 for computers. It doesn’t have the full richness, complexity and nuance that we invest in our narratives, but it does at least help computers to fit all the bits together in meaningful ways. And if we talk nice to them, then they can apply their newly-acquired interpretative skills to the things that they’re already good at — like searching, aggregating, or generating the sorts of big pictures that enable us to explore the contexts of our stories.

So, if we look at the RDF data for Tim's presentation, it includes "descriptions" of many different "things", including people, like Alexander Kelley, the subject of his first "story" (to save space, I've skipped the prefix declarations in these snippets but I hope they convey the sense of the data):

story:kelley a foaf1:Person ;
     bio:death story:kelley_death ;
     bio:event
         story:kelley_cremation,
         story:kelley_discharge,
         story:kelley_enlistment,
         story:kelley_reunion,
         story:kelley_wounded_1,
         story:kelley_wounded_2 ;
     foaf1:familyName "Kelley"@en-US ;
     foaf1:givenName "Alexander"@en-US ;
     foaf1:isPrimaryTopicOf story:kelley_moa ;
     foaf1:name "Alexander Dewar Kelley"@en-US ;
     foaf1:page 
       <http://discontents.com.au/shoebox/every-story-has-a-beginning> . 

There is data about events in his life:

story:kelley_discharge a bio:Event ;
     rdfs:label 
       "Discharged from the Australian Imperial Force."@en-US ;
     dc:date "1918-11-22"@en-US . 

story:kelley_enlistment a bio:Event ;
     rdfs:label 
       "Enlistment in the Australian Imperial Force for 
        service in the First World War."@en-US ;
     dc:date "1916-01-22"@en-US . 
     
story:kelley_ww1_service a bio:Interval ;
     bio:concludingEvent story:kelley_discharge ;
     bio:initiatingEvent story:kelley_enlistment ;
     foaf1:isPrimaryTopicOf story:kelley_ww1_record . 

and about the archival materials that record/describe those events:

story:kelley_ww1_record a locah:ArchivalResource ;
     locah:accessProvidedBy 
       <http://dbpedia.org/resource/National_Archives_of_Australia> ;
     dc:identifier "B2455, KELLEY ALEXANDER DEWAR"@en-US ;
     bibo:uri 
       "http://www.aa.gov.au/cgi-bin/Search?O=I&Number=7336927"@en-US . 

The presentation itself, the conference at which it was presented, various projects and researchers mentioned - all of these are also "things" described in the data.

I'd be interested in hearing more about how this data was created, the extent to which it was possible to extract the description of people, events, archival resources etc directly from existing data sources and the extent to which it was necessary to "hand craft" parts of it.

But I get very excited when I think about the potential in this sort of area if (when!?) we do have the data for historical records available as linked data (and available under open licences that support its free use).

Imagine having a "story building tool" which enables a "narrator" to visit a linked data page provided by the National Archives of Australia or the Archives Hub or one of the other projects Tim refers to, and to select and "intelligently clip" a chunk of data which you can then arrange into the "story" you are constructing - in much the way that bookmarklets for tools like Tumblr and Posterous enable you to do for text and images now. That "clipped chunk of data" could include a description of a person and some of their life events and metadata about digitised archival resources, including URIs of images - as in Tim's examples. You might follow pointers to other datasets from which additional data could be pulled. You might supplement the "clipped" data with your own commentary. Then imagine doing the same with data from the BBC describing a TV programme or radio broadcast related to the same person or events, or with data from a research repository describing papers about the person or events. The tool could generate some "provenance data" for each "chunk" saying "this subgraph was part of that graph over there, which was created by agency A on date D" in much the way that the blogging bookmarklets provide backlinks to their sources.

And the same components might be reorganised, or recombined with others, to tell different stories, or variants of the same story.

Now, yes, I'm sure there are some thorny issues to grapple with here, and coming up with an interface that balances usability and the required degree of control may be a challenge - so maybe I'm getting carried way with my enthusiasm, but it doesn't seem to be an entirely unreasonable proposition.

I think it's important here that, as Tim emphasises towards the end of his text, it is the human "narrator", not an algorithm, who decides on the structure of the story and selects its components and brings them into (possibly new) relationships with each other.

I'm aware that there's other work in this area of narratives and stories, particularly from some of the people at the BBC, but I admit I haven't been able to keep up with it in detail. See e.g. Tristan Ferne on "The Mythology Engine" and Michael Smethurst's thoughtful "Storytellin'".

For me, Tim's very concrete examples made the potential of these approaches seem very real. They suggest a vision of Linked Data not as one more institutional "output", but as a basis for individual and collective creativity and empowerment, for the (re)telling of stories that have been at least partly concealed - stories which may even challenge the "dominant" stories told by the powerful. It seems all too infrequent these days that I come across something that reminds me why I bothered getting interested in metadata and data on the Web in the first place: Tim's presentation was one of those things.

September 19, 2011

Things & their conceptualisations: SKOS, foaf:focus & modelling choices

I thought I'd try to write up some thoughts around an issue which I've come across in a few different contexts recently, and which as a shorthand I sometimes think of as "the foaf:focus question". It was prompted mainly by:

  • my work on modelling the Archives Hub data during the LOCAH project, and looking at datasets to which we wanted to make links, such as VIAF;
  • looking at the data model for the recent Linked Open BNB dataset from the British Library, and how some Dublin Core properties were being used, and some email discussions I had around that;
  • a recent message by Dan Brickley to the foaf-dev mailing list, explaining how the design of some FOAF properties was conditioned by the context at the time, and reflecting on how that context had changed, and what the implications of those changes might be.

Rather by chance on Friday evening, just as I was about to try to tie up what had become a rather long and rambling post, I noticed a conversation on Twitter, initiated by John Goodwin (@gothwin), which I think addressed the broader issue which I'd been circling around without quite addressing it: the use of different "modelling styles" and the issues which arise as a result when we try to link or merge data.

After much chopping and changing, the post adopts a very roughly "chronological" approach. The initial parts cover areas and activities that I wasn't directly involved in at the time, so I am providing my own retrospective interpretation based on my reading of the sources rather than a first-hand account, and I apologise in advance for any omissions or misrepresentations.

FOAF and "interests"

The FOAF Project was an initiative launched by a community of Semantic Web enthusisasts back in 2000, which explored the use of the - then newly emerging - Resource Description Framework specification to express information about individuals, their contact details and their interests and projects - the sort of information that was typically presented on a "personal home page" - and also some of the practical considerations in providing and consuming such information on the Web as RDF. The principle "deliverable" of the project is the Friend of a Friend (FOAF) RDF vocabulary, which continues to evolve and is now very widely used.

As Dan Brickley notes in his recent post, when looking at some of the FOAF properties from the perspective of 2011, their design may seem slightly "unwieldy" to the newcomer, and this is in part because their design was shaped by the context of how the Web was being used at the time of their creation, perhaps nine or ten years ago. At that point, as Dan notes, while there were URIs available for Web documents, and a growing recognition of the importance of maintaining the stability of those URIs, the use of http URIs to identify things other than documents was much less widely adopted, and we often lacked stable URIs for those things of other types that we wanted to "talk about".

One of the use cases covered by FOAF is to express the "interests" of an individual - where that "interest" might be an idea, a place, a person, an event or series of events, anything at all. To work around the issue of the availability of a URI of that thing, FOAF adopted a convention of "indirection" in some of its early properties. So, for example, the foaf:interest property expresses a relation, not between agent and idea/place/person (etc), but between agent and document: it says "this agent is interested in the topic of this page - the thing the page is 'about'" - where that might be anything at all. Using this convention, the topic itself is not explicitly referenced, so the question of its URI does not arise.

So, for example, to express an interest in the Napoleonic Wars, one might make use of the URI of the Wikipedia page 'about' that thing, and say:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:interest
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

A second property, foaf:topic_interest, does allow for the expression of that "direct" relationship, linking the agent to the thing of interest - which again might be anything at all. (I'm not sure whether thse two properties were created at the same time or whether one preceded the other). Even in the absence of URIs for concepts and people and places, RDF allows for the use of "blank nodes" to refer to such things.

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest [ rdfs:label "Napoleonic Wars"@en ] .

However, a blank node is limited in its scope as an identifier to the graph within which it is found: if Fred provides a blank node for the notion of the Napoleonic Wars in his graph and Freda provides a blank node for an interest in her graph, I can't tell (from that information alone) whether those two nodes are references to the same thing or to two different things (e.g. the historical event and a book about the event). Again, historically, one solution to this problem was to introduce the URI of a document, together with some inferencing based on OWL:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest 
          [ rdfs:label "Napoleonic Wars"@en ;
            foaf:isPrimaryTopicOf 
              <http://en.wikipedia.org/wiki/Napoleonic_Wars> ] .

According to the FOAF documentation for foaf:isPrimaryTopicOf:

The isPrimaryTopicOf property is inverse functional: for any document that is the value of this property, there is at most one thing in the world that is the primary topic of that document. This is useful, as it allows for data merging

i.e. if Freda's graph says:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:freda 
        foaf:topic_interest 
          [ rdfs:label "Napoleonic Wars"@en ;
            foaf:isPrimaryTopicOf 
              <http://en.wikipedia.org/wiki/Napoleonic_Wars> ] .

then my application can conclude that they are indeed both interested in the same thing - though that does depend on that application having some "built-in knowledge" of the OWL inferencing rules (or access to another service which does).

http URIs for Things

The "httpRange-14 resolution" by the W3C Technical Architecture Group, the publication of the W3C Note on Cool URIs for the Semantic Web, the adoption of the principles of Linked Data and the emergence of a large number of datasets based on those principles has, of course, changed the landscape considerably, and the use of http URIs to identify things other than documents has become commonplace - even if there remain concerns about the practical challenges of implementing of some of the recommended techniques.

So, now DBpedia assigns a distinct http URI for the thing the Wikipedia page http://en.wikipedia.org/wiki/Napoleonic_Wars "is about", and provides a description of that thing in which it says (amongst other things):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://dbpedia.org/resource/Napoleonic_Wars> 
        rdfs:label "Napoleonic Wars"@en ;
        a <http://dbpedia.org/ontology/Event> ,
          <http://dbpedia.org/ontology/MilitaryConflict> ,
          <http://umbel.org/umbel/rc/Event> ,
          <http://umbel.org/umbel/rc/ConflictEvent> ;
        foaf:page 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

i.e. that thing, the topic of the page, is an event, a military conflict etc.

We could substitute this new DBpedia URI for the blank nodes in our foaf:topic_interest data above:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

<http://dbpedia.org/resource/Napoleonic_Wars>
        rdfs:label "Napoleonic Wars"@en ;
        foaf:isPrimaryTopicOf 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:freda 
        foaf:topic_interest
          <http://dbpedia.org/resource/Napoleonic_Wars> .

<http://dbpedia.org/resource/Napoleonic_Wars>
        rdfs:label "Napoleonic Wars"@en ;
        foaf:isPrimaryTopicOf 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

When those two graphs are merged, the use of the common URI now makes it trivial to determine that Fred and Freda share the same interest.

Concept Schemes, SKOS and Document Metadata

The other factor Dan mentions in his message was the emergence of of the Simple Knowledge Organisation System (SKOS) RDF vocabulary, which after a long evolution became a W3C Recommendation in 2009.

SKOS is designed to provide an RDF representation of the various flavours of "knowledge organisation systems" and "controlled vocabularies" which information managers have traditionally used to organise information about various resources (books in libraries, objects in museums etc etc etc).

The core class in SKOS is that of the concept (skos:Concept). Each concept can be labelled with one or more names; documented with notes of various types; grouped into collections; related to other concepts through relationships such as "broader"/"narrower"/"related"; or mapped to other concepts in other collections.

The Library of Congress has published several library thesauri/classification schemes/controlled vocabularies as SKOS RDF data, including the Library of Congress Subject Headings, which includes a concept named "Napoleonic Wars, 1800-1815" with the URI http://id.loc.gov/authorities/subjects/sh85089767 (this is a subset of the actual data provided):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

lcsh:sh85089767
        a skos:Concept ;
        rdfs:label "Napoleonic Wars, 1800-1815"@en ;
        skos:prefLabel "Napoleonic Wars, 1800-1815"@en ;
        skos:altLabel "Napoleonic Wars, 1800-1814"@en ;
        skos:broader lcsh:sh85045703 ;
        skos:narrower lcsh:sh85144863 ;
        skos:inScheme 
          <http://id.loc.gov/authorities/subjects> .

A metadata creator coming from a bibliographic background and providing Dublin Core-based metadata for the Wikipedia page http://en.wikipedia.org/wiki/Napoleonic_Wars might well use this concept URI to provide the "subject" of that page:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars>
        a foaf:Document ;
        rdfs:label "Napoleonic Wars"@en ;
        dcterms:subject lcsh:sh85089767 .

And indeed the concept URI could (I think) also be used with the foaf:topic or foaf:primaryTopic properties:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars>
        a foaf:Document ;
        rdfs:label "Napoleonic Wars"@en ;
        foaf:topic lcsh:sh85089767 ;
        foaf:primaryTopic lcsh:sh85089767 .

Note that all three of these properties (dcterms:subject, foaf:topic, and foaf:primaryTopic) are defined in such a way that they are not limited to being used with concepts. We've seen this above where the DBpedia data includes:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://dbpedia.org/resource/Napoleonic_Wars> 
        foaf:page 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

which (since foaf:page and foaf:topic are inverse properties) implies:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars> 
        foaf:topic 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

It is also true for the dcterms:subject property. Although traditionally the Dublin Core community has highlighted the use of formal classification schemes like LCSH, and it may well be true that the dcterms:subject property is often used to link things to concepts, it is not limited to taking concepts as values, and one could also link the document to the event using dcterms:subject:

@prefix dcterms: <http://purl.org/dc/terms/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars> 
        dcterms:subject 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

I should acknowledge here that some in the Dublin Core community might disagree with my last example above, and argue that the values of dcterms:subject should be concepts. I think my position is backed up by the current DCMI documentation, and particularly by the fact that when they assigned ranges to the DCMI Terms properties in 2008, the DCMI Usage Board did not specify a range for the dcterms:subject property i.e. the intention is that dcterms:subject may link to a resource of any type.

I also note in passing that when DCMI created the DCMI Abstract Model in its attempt to reflect the "classical view" of Dublin Core (perhaps best expressed in Tom Baker's "A Grammar of Dublin Core") in an RDF-based model, the notion of a "Vocabulary Encoding Scheme" was defined as a set of things of any type, not specifically as a set of concepts.

Things and their Conceptualisations: foaf:focus

The SKOS approach and the SKOS Concept class introduce a new sort of "indirection" from our "things-in-the-world". As Dan puts it in a message to the W3C public-esw-thes list:

a SKOS "butterflies" concept is a social and technological artifact designed to help interconnect descriptions of butterflies, documents (and data) about butterflies, and people with interest or expertise relating to butterflies. I'm quite consciously avoiding saying what a "butterflies" concept in SKOS "refers to", because theories of reference are hard to choose between. Instead, I prefer to talk about why we bother building SKOS and what we hope can be achieved by it.

So, although both the DBpedia URI http://dbpedia.org/resource/Napoleonic_Wars) and the Library of Congress URI http://id.loc.gov/authorities/subjects/sh85089767) may be used in the triple patterns shown above, those two URIs identify two different resources - and both of them are distinct from the Wikipedia page which we cited in the examples back at the very start of this post.

i.e. we now have three separate URIs identifying three separate resources:

  1. a Wikipedia page, a document, created and modified by Wikipedia contributors between 2002 and the present identified by the URI http://en.wikipedia.org/wiki/Napoleonic_Wars
  2. the Napoleonic Wars as event taking place between 1800 and 1815, something with a duration in time, which occurred in physical locations, and in which human beings participated, identified by the DBpedia URI http://dbpedia.org/resource/Napoleonic_Wars
  3. a "conceptualisation of" the Napoleonic Wars, "a social and technological artifact designed to help interconnect", an "abstraction" created by the authors of LCSH editors for the purposes of classifying works; it has "semantic" relationships to other concepts, and is identified by the Library of Congress URI http://id.loc.gov/authorities/subjects/sh85089767

As we've seen, properties like dcterms:subject, foaf:topic/foaf:page, foaf:primaryTopic/foaf:isPrimaryTopicOf provide the vocabulary to express the relationships between the first and second of these resources, and between the first and third. But what about the relationship between the second and third, between the "thing in the world" and its conceptualisation in a classification scheme? Or to make the issue concrete, what happens if, in their "interest" graphs, Fred cites the DBpedia event URI and Frida cites the LCSH concept URI? How do we establish that their interests are indeed related? Can a publisher of an SKOS concept scheme indicate a relationship between a concept and a "conceptualised thing" (person, place, event etc)?

Dan provides a rather neat diagram illustrating the issue, using the example of Ronald Reagan. The arcs Dan labels "it" represent this "missing" (at the time the diagram was drawn) relationship type/property.

The resolution was to create a new property in the FOAF vocabulary, called foaf:focus. (See this page on the FOAF Project Wiki) for some of the discussion of its name).

The FOAF vocabulary specification says of the property:

The focus property relates a conceptualisation of something to the thing itself. Specifically, it is designed for use with W3C's SKOS vocabulary, to help indicate specific individual things (typically people, places, artifacts) that are mentioned in different SKOS schemes (eg. thesauri).

W3C SKOS is based around collections of linked 'concepts', which indicate topics, subject areas and categories. In SKOS, properties of a skos:Concept are properties of the conceptualization (see 2005 discussion for details); for example administrative and record-keeping metadata. Two schemes might have an entry for the same individual; the foaf:focus property can be used to indicate the thing in they world that they both focus on. Many SKOS concepts don't work this way; broad topical areas and subject categories don't typically correspond to some particular entity. However, in cases when they do, it is useful to link both subject-oriented and thing-oriented information via foaf:focus.

It's worth emphasising the point made in the penultimate sentence: not all concepts "have a focus"; some concepts are "just concepts" (poetry, slavery, conscientious objection, anarchism etc etc etc).

Dan summarises how he sees the new property being used in a message to the W3C public-esw-thes list:

The addition of foaf:topic is intended as a modest and pragmatic bridge between SKOS-based descriptions of topics, and other more entity-centric RDF descriptions. When a SKOS Concept stands for a person or agent, FOAF and its extensions are directly applicable; however we expect foaf:focus to also be used with places, events and other identifiable entities that are covered both by SKOS vocabularies as well as by factual datasets like wikipedia/dbpedia and Freebase.

A single "thing in the world" may be "the focus of" multiple concepts: e.g. several different library classification schemes may include concepts for the Napoleonic Wars or Ronald Reagan or Paris. Even within a single scheme, it may be that there are multiple concepts each reflecting different facets or aspects of a single entity.

VIAF

This aspect of the relationship between conceptualisation and "thing in the world" is illustrated in VIAF, the Virtual International Authority File, a service provided by OCLC. VIAF aggregates library "authority records" from multiple library "name authority files" maintained mainly by national libraries. Each record provides a "preferred form" of the name of a person or corporate entity, and multiple "alternate forms" - though that preferred form may vary from one file to the next. VIAF analyses and collates the aggregated data to establish which records refer to the same person or corporate entity, and presents the results as Linked Data.

Jeff Young of OCLC summarises the VIAF model in a post on the Outgoing blog. The post actually describes the transition between an earlier, slightly more complex model and the current model. For the purposes of this discussion, the thing to look at is the "Example (After)" graphic at the top right of the post (direct link)

Consider Dan's example of Ronald Reagan, identified by the VIAF URI http://viaf.org/viaf/76321889. The RDF description provided shows that there are eleven concepts linked to the person resource by a foaf:focus link. Each of those concepts (I think!) corresponds to a record in an authority file harvested by VIAF. Each concept has a preferred label (the preferred form of the name in that authority file) and may have a number of alternate labels. An abridged version of the VIAF description is below:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> . 

<http://viaf.org/viaf/76321889> a foaf:Person ;
        foaf:name "Reagan, Ronald" ,
                  "Reagan, Ronald W." , 
                  "Reagan, Ronald, 1911-2004" , 
                  "Reagan, Ronald W. 1911-2004" , 
                  "Reagan, Ronald Wilson 1911-2004" , 
                  "Reagan, Elvis 1911-2004" ;
# plus various other names!
        owl:sameAs 
          <http://d-nb.info/gnd/118598724> ,
          <http://dbpedia.org/resource/Ronald_Reagan> , 
          <http://libris.kb.se/resource/auth/237204> , 
          <http://www.idref.fr/027091775/id> .

<http://viaf.org/viaf/sourceID/BNE%7CXX1025345#skos:Concept>
        a skos:Concept ;
        skos:prefLabel "Reagan, Ronald, 1911-2004" ;
        skos:altLabel "Reagan, Elvis 1911-2004" , 
                      "Reagan, Ronald W. 1911-2004", 
                      "Reagan, Ronald Wilson 1911-2004" ;
        skos:inScheme 
          <http://viaf.org/authorityScheme/BNE> ;
        foaf:focus 
          <http://viaf.org/viaf/76321889> .

<http://viaf.org/viaf/sourceID/BNF%7C11921304#skos:Concept>
        a skos:Concept ;
        skos:prefLabel "Reagan, Ronald, 1911-2004" ;
        skos:altLabel "Reagan, Ronald Wilson 1911-2004" ;
        skos:inScheme 
          <http://viaf.org/authorityScheme/BNF> ;
        foaf:focus 
          <http://viaf.org/viaf/76321889> .

# plus nine other concepts

(There really is an alternate label of "Reagan, Elvis 1911-2004" in the actual data!)

Fig1

Note that the owl:sameAs links here are between the VIAF person resource and person resources in external datasets.

LOCAH and index terms

My own first engagement with foaf:focus came during the LOCAH project. In deciding how to represent the content of the Archives Hub EAD documents as RDF, we had to decide how to model the use of "index terms" provided using the EAD <controlaccess> element. Within the Hub EAD documents, those index terms are names of one of the following categories of resource:

  • Concepts
  • Persons
  • Families
  • Organisations
  • Places
  • Genres or Forms
  • Functions

The names are sometimes (but not always) drawn from some sort of "controlled list", which is also named in the data. In other cases, they are constructed using some specified set of rules, again named in the data.

For some of these categories (Concepts, Genres/Forms, Functions), the "thing" named is simply a concept, an abstraction; for others (Persons, Families, Organisations and Places), there is a second "non-abstract" entity "out there in the world". And for this second case, we chose to represent the two distinct things, each with their own distinct URI, linked by a triple using the foaf:focus property.

The LOCAH data model is illustrated in the diagram in this post. The Concept entity type is in the lower centre; directly below are four boxes for the related "conceptualised" types (Person, Family, Organisation and Place), each linked from the Concept by the foaf:focus property.

As in the VIAF case, for a single person/family/organisation/place, there may be multiple distinct "conceptualisations", reflecting the fact that different data providers have referred to the same "thing in the world" by citing entries from different "authority files". The nature of the process by which the LOCAH RDF data is generated - the EAD documents are processed on a "document by document" basis - means that in this case, multiple URIs for the person are generated, and the "reconciliation" of these URIs as co-references to a single entity is performed as a subsequent step.

The British Library Linked Data

The British Library recently announced the release of Linked Open BNB, a new Linked Data dataset covering a subset of the British National Bibliography. The approach taken is described in a post by Richard Wallis of Talis, who worked with the BL as consultants in preparing the dataset.

The data model for the BNB data shows quite extensive use of the Concept-foaf:focus-Thing pattern.

For the "subjects" of Bibliographic Resources, a dcterms:subject link is made to a skos:Concept, reflecting an authority file or classification scheme entry, and which is in turn linked using foaf:focus to a Person, Family, Organisation or Place. In other cases, the Bibliographic Resource is linked directly to the "thing-in-the-world" and a corresponding Concept is also provided, linking to the "thing-in-the world" using foaf:focus. This is the case for languages, for persons as creators or contributors to the bibliographic resource, and for the Dublin Core "spatial coverage property".

So for the example http://bnb.data.bl.uk/id/resource/009436036 (again, this is a very stripped-down version of the actual data):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix wgs84_pos: <http://www.w3.org/2003/01/geo/wgs84_pos#> .

<http://bnb.data.bl.uk/id/resource/009436036>
        a dcterms:BibliographicResource ;
        dcterms:creator 
          <http://bnb.data.bl.uk/id/person/KingGRD%28GeoffreyRD%29> ; 
        dcterms:subject
          <http://bnb.data.bl.uk/id/concept/place/lcsh/Aden%28Yemen%29> ;
        dcterms:spatial
          <http://bnb.data.bl.uk/id/place/Aden%28Yemen%29> ;
        dcterms:language 
          <http://lexvo.org/id/iso639-3/eng> .

<http://bnb.data.bl.uk/id/concept/place/lcsh/Aden%28Yemen%29> 
        a skos:Concept ;
        foaf:focus 
          <http://bnb.data.bl.uk/id/place/Aden%28Yemen%29> .

<http://bnb.data.bl.uk/id/place/Aden%28Yemen%29>        
        a dcterms:Location, wgs84_pos:SpatialThing .

In this case, the subject-concept and the spatial coverage-place happen to be linked to each other, but the point I wanted to illustrate was that the object of the dcterms:subject triple is the URI of a concept, and is the subject of an "outgoing" foaf:focus link to a "thing", but the object of the dcterms:spatial triple is the URI of a location/place, and is the object of an "incoming" foaf:focus link from a concept.

The two cases are perhaps best illustrated using the graph representation:

Fig2

Authority Files, Concepts, foaf:focus and Dublin Core

As discussed above, the dcterms:subject property is defined in such a way that it can be used to link to a thing of any type, although there may be a preference amongst some implementers to use dcterms:subject to link only to concepts.

For the other four properties I highlighted in the BL model, DCMI specifies an rdfs:range for the properties:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .

dcterms:creator rdfs:range dcterms:Agent .
dcterms:contributor rdfs:range dcterms:Agent .
dcterms:spatial rdfs:range dcterms:Location .
dcterms:language rdfs:range dcterms:LinguisticSystem .

Those three classes (dcterms:Agent, dcterms:LinguisticSystem, dcterms:Location) are described using RDFS, and although there is no formal statement that they are disjoint from the class skos:Concept, I think their human-readable descriptions, taken together with those of the properties, carry a fairly strong suggestion that instances of these classes are the "things in the world" rather than their conceptualisations - certainly for the first three cases at least. A dcterms:Agent is "A resource that acts or has the power to act", which a concept can not; a dcterms:Location is "A spatial region or named place", which again seems distinct from a concept.

The case of dcterms:LinguisticSystem, "A system of signs, symbols, sounds, gestures, or rules used in communication", seems a bit less clear, as this is a "conceptual thing" but I think one can argue that the actual linguistic system practiced by a community of language speakers is a distinct concept from the "conceptualisation" created within a classification scheme.

And this is, I think, reflected in the patterns used in the BL data model.

As I noted earlier, the Library of Congress has published SKOS representations of a number of controlled vocabularies, and these include:

In each case, the entries/members of the vocabularies are modelled as instances of skos:Concept. Following the argument I just constructed above, then, to use these vocabularies with the dcterms:spatial and dcterms:languageproperties, strictly speaking, one should adopt the patterns used in the BL model, where the concept URI is not the direct object, but is linked to a thing (location, linguistic system) URI by a foaf:focus link.

Finally, I'd draw attention to lexvo.org, which also provides Linked Data representations for ISO639-3 languages and ISO 3166-1 / UN M.49 geographical regions. In contrast to the Library of Congress SKOS representations, lexvo.org models "the things themselves", the languages and geographical regions, e.g. (again, a subset of the actual data)

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix lvont: <http://lexvo.org/ontology#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://lexvo.org/id/iso639-3/spa>
        a lvont:Language ;
        rdfs:label "Spanish"@en ;
        lvont:usedIn 
          <http://lexvo.org/id/iso3166/ES> ;
        owl:sameAs 
          <http://dbpedia.org/resource/Spanish_language> .
        
<http://lexvo.org/id/iso3166/ES>
        a lvont:GeographicRegion ;
        rdfs:label "Spain"@en ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/039> ;
        owl:sameAs 
          <http://sws.geonames.org/2510769> .
        
<http://lexvo.org/id/un_m49/039>
        a lvont:GeographicRegion ;
        rdfs:label "Southern Europe"@en ;
        lvont:hasMember 
          <http://lexvo.org/id/iso3166/ES> ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/150> .
 
<http://lexvo.org/id/un_m49/150>
        a lvont:GeographicRegion ;
        rdfs:label "Europe"@en ;
        lvont:hasMember 
          <http://lexvo.org/id/un_m49/039> ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/001> .

In the BL data one finds lexvo.org language URIs used as objects of dcterms:language (which we do in the LOCAH data too).

The lexvo.org language URIs are the subjects of properties such as lvont:usedIn linking language to place where it is used or spoken and of owl:sameAs triples linking to language URIs in other datasets. And the geographic region URIs are the subjects of properties such as lvont:memberOf linking one region to another region of which it is part and of owl:sameAs triples linking to place URIs in other datasets.

Compare this with the relationships between the SKOS-based "conceptualisations of languages" and "conceptualisations of geographic in the Library of Congress dataset (again, a subset of the actual data):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<http://id.loc.gov/vocabulary/iso639-2/spa>
        a skos:Concept ;
        skos:prefLabel "Spanish | Castilian"@en ;
        skos:altLabel "Spanish"@en ,
                      "Castilian"@en ;
        skos:note "Bibliographic Code"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/languages/spa> ,
          <http://id.loc.gov/vocabulary/iso639-1/es> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/iso639-2> .
                        
<http://id.loc.gov/vocabulary/countries/sp>
        a skos:Concept ;
        skos:prefLabel "Spain"@en ;
        skos:altLabel "Balearic Islands"@en ,
                      "Canary Islands"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/geographicAreas/e-sp> ;
        skos:broadMatch 
          <http://id.loc.gov/vocabulary/geographicAreas/e> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/countries> .
        
<http://id.loc.gov/vocabulary/geographicAreas/e-sp>
        a skos:Concept ;        
        skos:prefLabel "Spain"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        skos:broader 
          <http://id.loc.gov/vocabulary/geographicAreas/e> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/geographicAreas> .

<http://id.loc.gov/vocabulary/geographicAreas/e>
        a skos:Concept ;        
        skos:prefLabel "Europe"@en ;
        skos:narrower 
          <http://id.loc.gov/vocabulary/geographicAreas/e-sp> ;
        skos:narrowMatch 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/geographicAreas> .
        

Here the types of relationships involved are the SKOS "semantic relations" and "mapping properties" between concepts: e.g. Spain-as-concept has-broader-concept Europe-as-concept, and so on.

Conclusions

If anyone has read this far, I can imagine it is not without some rolling of eyes at the pedantic distinctions I seem to be unpicking!

Given my understanding of SKOS, FOAF and Dublin Core, I do think the designers of the BL data have done a sterling job in trying to "get it right", carefully observing the way terms have been described by their owners and seeking to use those terms in ways that are consistent with those descriptions.

At the same time, I admit I can well imagine that to many Dublin Core implementers who see the Dublin Core properties as providing a relatively simple approach, this will seem over-complicated.

And I rather expect that we will see uses of the dcterms:language and dcterms:spatial properties which simply link directly to the concept:

@prefix dcterms: <http://purl.org/dc/terms/> .

<http://example.org/doc/1234>
        a dcterms:BibliographicResource ;
        dcterms:spatial 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        dcterms:language 
          <http://id.loc.gov/vocabulary/iso639-2/spa> .

I think this is particularly likely for the case of dcterms:spatial as it is perceived as "similar" to dcterms:subject - amongst other things, it covers "aboutness" in the case where the topic is a place - particularly since, as in the BL example above, the concept URI may be used with dcterms:subject in the very same graph.

Returning to the recent post by Dan which in part prompted me to start this post, he makes a similar point, focusing on the FOAF properties with which I introduced the post. He recognises that "in some contexts having careful names for all these inter-relations is very important" but suggests

We should consider making foaf:interest vaguer, such that any of those three are acceptable. Publishers aren't going to go looking up obscure distinctions between foaf:interest and foaf:topic_interest and that other pattern using SKOS ... they just want to mark that this is a relationship between someone and a URI characterising one of their interests.

So I suggest we become more tolerant about the URIs acceptable as values of foaf:interest

(See the message in full for Dans examples.)

Part of me is tempted to suggest that similar reasoning might be applied to the Dublin Core Terms properties i.e. that consideration might be given to "relaxing" the range of dcterms:spatial to allow for the use of either the place or the concept, to resolve the dilemna that, with the current design, the patterns are different for dcterms:subject and dcterms:spatial. But where do we stop? Do we also relax dcterms:language? I really don't know. And I think I'd be quite uneasy making the suggestion for the "agent" properties.

The fundamental issue here is that the thesaurus-based approach introduces a layer of abstraction - Dan's "social and technological artifact[s] designed to help interconnect descriptions" - and that is reflected in SKOS's notion of the skos:Concept. Outside the "knowledge organisation" community, however, many RDFS/OWL modellers "model the world directly": they consider the language as system used by a community of speakers (as modelled by lexvo.org), the place as thing in space (as modelled by Geonames), and so on. Constructs such as foaf:focus help us bridge the two approaches - but not without some complexity.

On Twitter on Friday evening, I noticed a (slightly provocative!) comment on Twitter from John Goodwin (@gothwin) which I think is related:

coming to the conclusion that #SKOS is being waaay over used #linkeddata

It attracted some sympathetic responses, e.g. from Rob Styles (@mmmmmmrob) 1, 2:

@gothwin @juansequeda if you're not publishing an existing subject heading scheme on the cheap then SKOS is the wrong tool.

@johnlsheridan @juansequeda @gothwin Paris is not "narrower than" France, it's "capital of". That's the problem.

which I think echoes my examples above of Spain and Europe.

And from Leigh Dodds (@ldodds): 1, 2:

@gothwin that's all it was designed for: converting one specific class of datasets into RDF. It's just been mistaken for a modelling tool

@juansequeda @gothwin @kendall only use #skos if you're CONVERTING a taxonomy. Otherwise, just model the domain

These comments struck home with me, and I think I may have made exactly this sort of mistake in some other work I've been doing recently, and I need to revisit it, to check whether what I've done is really necessary or useful. As John suggests, I think sometimes I reach for SKOS as a tool for something that has a "controlled list" feel to it without thinking hard enough whether it is really the appropriate tool for the task at hand.

Having said that, I do also think we will need to manage the "mixed" cases - particularly in datasets coming from communities such as libraries where the use of thesauri is commonplace, and a growing number are available as SKOS - where we end up needing the sort of bridge which foaf:focus provides, so some of this complexity may be unavoidable.

In any case, I think it's an area where some guidance/best practice notes - with lots of examples! - would be very helpful.

August 15, 2011

Two ends and one start

The end of July formally marked the end of two projects I've been contributing to recently:

There are still some things to finish for LOCAH, particularly the publication of the Copac data.

A few (not terribly original or profound) thoughts, prompted mainly by my experience of working with the Archives Hub data:

  • Sometimes data is "clean" and consistent and "normalised", but more often than not it has content in inconsistent forms or is incomplete: it is "messy". Data aggregated from multiple sources over a long period of time is probably going to be messier. (With an XML format like EAD, with what I think of as its somewhat "hybrid" document/data character, the potential for messiness increases).
  • Doing things with data requires some sort of processing by software, and while there are off-the-shelf apps and various configurable tools and frameworks that can provide some functions, some element of software development is usually required.
  • It may be relatively easy to identify in advance the major tasks where developer effort is required, and to plan for that, but sometimes there are additional tasks which it's more difficult to anticipate; rather they emerge as you attempt some of the other processes (and I think messy data probably makes that more likely).
  • Even with developers on board, development work has to be specified, and that is a task in itself, and one that can be quite complex and time-consuming (all the more so if you find yourself trying to think through and describe what to do with a range of different inputs from a messy data source).

It's worth emphassing that most of the above is not specific to generating Linked Data: it applies to data in all sorts of formats, and it applies to all sorts of tasks, whether you're building a plain old Web site or exposing RSS feeds or creating some other Web application to do something with the data.

Sort of related to all of the above, I was reminded that my limited programming skills often leave me in the position where I can identify what needs doing but I'm not able to "make it happen", and that is something I'd like to try to change. I can "get by" in XSLT, and I can do a bit of PHP, but I'd like to get to the point where I can do more of the simpler data manipulation tasks and/or pull together simple presentation prototypes.

I've enjoyed working on both the projects. I'm pleased that we've managed to deliver some Linked Data datasets, and I'm particularly pleased to have been able to contribute to doing this for archival description data, as it's a field in which I worked quite a long time ago.

Both projects gave me opportunities to learn by actually doing stuff, rather than by reading about it or prodding at other people's data. And perhaps more than anything, I've enjoyed the experience of working together as a group. We had a lot of discussion and exchange of ideas, and I'd like to think that this process was valuable in itself. In an early post on the SALDA blog, Karen Watson noted:

it is perhaps not the end result that is most important but the journey to get there. What we hope to achieve with SALDA is skills and knowledge to make our catalogues Linked Data and use those skills and that knowledge to inform decisions about whether it would be beneficial for make all our data Linked Data.

Of course it wasn't all plain sailing, and particularly near the start there were probably times when we ran up against differences of perception and terminology. Jane Stevenson has written about some of these issues from her point of view on the LOCAH blog (e.g. here, here and here). As the projects progressed, I think we moved closer towards more of a shared understanding - and I think that is a valuable "output" in its own right, even if it is one which it may be rather hard to "measure".

So, a big thank you to everyone I worked with on both projects.

Looking forward, I'm very pleased to be able to say that Jane prepared a proposal for some further work with the Archives Hub data, under JISC's Capital Funded Service Enhancements initiative, and that bid has been successful, and I'll be contributing some work to that project as a consultant. The project is called "Linking Lives" and is focused on providing interfaces for researchers to explore the data (as well as making any enhancements/extensions to the data and the "back-end" processes that are required to enable that). More on that work to come once we get moving.

Finally, as I'm sure many of you are aware, JISC recently issued some new calls for projects, including call 13/11 for projects "to develop services that aim to make content from museums, libraries and archives more visible and to develop innovative ways to reuse those resources". If there are any institutions out there who are considering proposals and think that I could make a useful contribution - despite my lamenting my limitations above, I hope my posts on the LOCAH and SALDA blogs give an idea of the sorts of areas I can contribute to! - , please do get in touch.

May 11, 2011

LOCAH releases Linked Archives Hub dataset

The LOCAH project, one of the two JISC-funded projects to which I've been contributing, this week announced the availability of an initial batch of data derived from a small subset of the Archives Hub EAD data as linked data. The homepage for what we have called the "Linked Archives Hub" dataset is http://data.archiveshub.ac.uk/

The project manager, Adrian Stevenson of UKOLN, provides an overview of the dataset, and yesterday I published a post which provides a few example SPARQL queries.

I'm very pleased we've got this data "out there": it feels as if we've been at the point of it being "nearly ready" for a few weeks now, but a combination of ironing out various technical wrinkles (I really must remember to look at pages in Internet Explorer now and again) and short working weeks/public holidays held things up a little. It is very much a "first draft": we have additional work planned on making more links with other resources, and there are various things which could be improved (and it seems to be one of those universal laws that as soon as you open something up, you spot more glitches...). But I hope it's enough to demonstrate the approach we are taking to the data, and to provide a small testbed that people can poke at and experiment with.

(If you have any specific comments on content of the LOCAH dataset, it's probably better to post them over on the LOCAH blog where other members of the project team can see and respond to them).

March 25, 2011

RDTF metadata guidelines - next steps

A few weeks ago I blogged about the work that Pete and I have been doing on metadata guidelines as part of the JISC/RLUK Resource Discovery Task Force, RDTF metadata guidelines - Limp Data or Linked Data?.

In discussion with the JISC we have agreed to complete our current work in this area by:

  • delivering a single summary document of the consultation process around the current draft guidelines, incorporating the original document and all the comments made using the JISCpress site during the comment period; and
  • suggesting some recommendations about any resulting changes that we would like to see made to the draft guidelines.

For the former, a summary view of the consultation is now available. It's not 100% perfect (because the links between the comments and individual paragraphs are not retained in the summary) but I think it is good enough to offer a useful overview of the draft and the comments in a single piece of text. Furthermore, the production of this summary was automated (by post-processing the export 'dump' from Wordpress), so the good news is that a similar view can be obtained for any future (or indeed past) JISCpress consultations.

For the latter, this blog post forms our recommendations.

As noted previously, there were 196 comments during the comment period (which is not bad!), many of which were quite detailed in terms of particular data models, formats and so on. On the basis that we do not know quite what form any guidelines might take from here on (that is now the responsibility of the RDTF team at MIMAS I think), it doesn't seem sensible to dig into the details too much. Rather, we will make some comments on the overall shape of the document and suggest some areas where we think it might be useful for JISC and RLUK to undertake additional work.

You may recall that our original draft proposed three approaches to exposing metadata, which we refered to as the community formats approach, the RDF data approach and the Linked Data approach. In light of comments (particularly those from Owen Stephens and Paul Walk) we have been putting some thought into how the shape of the whole document might be better conceptualised. The result is the following four-quadrant model:

Rdtf
Like any simple conceptualisation, there is some fuzziness in this but we think it's a useful way of thinking about the space.

Traditionally (in the world of library, museum and archives at least), most sharing of metadata has happened in the bottom-left quadrant - exchanging bulk files of MARC records for example. And, to an extent, this approach continues now, even outside those sectors. Look at the large amount of 'open data' that is shared as CSV files on sites like data.gov.uk for example. Broadly speaking, this is what we refered to as the community formats approach (though I think our inclusion of the OAI-PMH in that area probably muddied the waters a little - see below).

One can argue that moving left to right across the quadrants offers semantically richer metadata in a 'small pieces, loosely joined' kind of way (though this quickly becomes a religious argument with no obvious point of closure! :-) ) and that moving bottom to top offers the ability to work with individual item descriptions rather than whole collections of them - and that, in particular, it allows for the assignment of 'http' URIs to those descriptions and the dereferencing of those URIs to serve them.

Our three approaches covered the bottom-left, bottom-right and top-right quadrants. The web, at least in the sense of serving HTML pages about things of interest in libraries, museums and archives, sits in the top-left quadrant (though any trend towards embedded RDFa in HTML pages moves us firmly towards the top-right).

Interestingly, especially in light of the RDTF mantra to "build better websites", our guidelines managed to miss that quadrant. In their comments, Owen and Paul argued that moving from bottom to top is more important than moving left to right - and, on balance, we tend to agree.

So, what does this mean in terms of our recommendations?

We think that the guidelines need to cover all four quadrants and that, in particular, much greater emphasis needs to be placed on the top-left quadrant. Any guidance needs to be explicit that the 'http' URIs assigned to descriptions served on the web are not URIs for the things being described; that, typically, multiple descriptions may be served for the things being described (an HTML page and an XML document for example, each of which will have separate URIs) and that mechanisms such as '<link rel="alternative" ...>' can be used to tie them together; and that Google sitemaps (on the left) and semantic sitemaps (on the right) can be used to guide robots to the descriptions (either individually or in collections).

Which leaves the issue of the OAI-PMH. In a sense, this sits along-side the top-left quadrant - which is why, I think, it didn't fit particularly well with our previous three approaches. If you think about a typical repository for example, it is making descriptions of the content it holds available as HTML 'splash' pages (sometimes with embedded links to descriptions in other formats). In that sense it is functioning in top-left, "page per thing", mode. What the OAI-PMH does is to give you a protocol mechanism for getting at that those descriptions in a way that is useful for harvesting.

Several people noted that Atom and RSS might be used as an alternative to both sitemaps and the OAI-PMH, and we agree - though it may be that some additional work is needed to specify the exact mechanisms for doing so.

There were some comments on our suggestion to follow the UK government guidelines on assigning URIs. On reflection, we think it would make more sense to recommend only the W3C guidelines on Cool URIs for the Semantic Web, particularly on the separation of things from the desriptions of things, suggesting that it may be sensible to fund (or find) more work in this area making specific recommendations around persistent URIs (for both things and their descriptions).

Finally, there were a lot of comments on the draft guidelines about our suggested models and formats - notably on FRBR, with many commenters suggesting that this was premature given significant discussion around FRBR elsewhere. We think it would make sense to separate out any guidance on conceptual models and associated vocabularies, probably (again) as a separate piece of work.

To summarise then, we suggest:

  • that the four-quadrant model above is used to frame the guidelines - we think all four quadrants are useful, and that there should probably be some guidance on each area;
  • that specific guidance be developed for serving an HTML page description per 'thing' of interest (possibly with associated, and linked, alternative formats such as XML);
  • that guidance be developed (or found) about how to sensibly assign persistent 'http' URIs to everything of interest (including both things and descriptions of things);
  • that the definition of 'open' needs more work (particularly in the context of whether commercial use is allowed) but that this needs to be sensitive to not stirring up IPR-worries in those domains where they are less of a concern currently;
  • that mechanisms for making statement of provenance, licensing and versioning be developed where RDF triples are being made available (possibly in collaboration with Europeana work); and
  • that a fuller list of relevant models that might be adopted, the relationships between them, and any vocabularies commonly associated with them be maintained separately from the guidelines themselves (I'm trying desperately not to use the 'registry' word here!).

March 10, 2011

Term-based thesauri and SKOS (Part 4): Change over time (ii)

This is the fourth in a series of posts (previously: part 1, part 2, part 3) on making a thesaurus available as linked data using the SKOS and SKOS-XL RDF vocabularies. In the previous post, I examined some of the ways the thesaurus can change over time, and problems that arose with my proposed mapping to RDF. Here I'll outline one potential solution to those problems.

The last three cases I described in the previous post, where an existing preferred term loses that status and is "relegated" to a non-preferred term, all present a problem for my suggested simple mapping, because the URI for a concept disappears from the generated RDF graph - and this creates a conflict with the principles of URI stability and reliability I advocated at the start of that post.

My first thoughts on a solution circled around generating concept URIs, not just for the preferred term, but also for all the non-preferred terms, and using owl:sameAs (or skos:exactMatch?) to indicate that the concept URIs derived from the terms associated with a single preferred term were synonyms, i.e. each of them identified the same concept. That way the change from preferred term to non-preferred term would not result in the loss of a concept URI. But the proliferation of URIs here feels fundamentally flawed - the problem is not one that is solved by having multiple URIs for a single concept; the issue is the persistence of a single URI. Introducing the multiple URIs also seems like a recipe for a lot of practical difficulties in managing the impact of changes on external applications, particularly if URIs which were once synonyms cease to be so.

After some searching, I found a couple of useful pages on the W3C wiki: some notes on versioning (which as far as I know didn't make it into the final SKOS specifications) and particularly this page on "Concept Evolution" in SKOS. The latter is rather more a collection of pointers than the concrete set of examples and guidelines I was hoping for, but one of those pointers is to a thread on the W3C public-esw-thes mailing list, starting with this message from Rob Tice, which I think describes (in his point 2) exactly the situation I'm dealing with in the problem cases in the previous post:

How should we identify and manage change between revisions of concept schemes as this 'seems' to result in imprecision. e.g. a concept 'a' is currently in thes 'A' and only has a preferred label. A new revision of thes 'A' is published and what was concept 'a' is now a non preferred concept and thus becomes simply a non preferred label for a new concept 'b'.

It seems to me that this operation loses some of the semantic meaning of the change as all references to the concept id of 'concept a' would be lost as it now is only a non preferred label of a different concept with a different id (concept 'b').

The suggested approach emerging from that discussion has two elements:

  1. A notion that a concept can be marked as "deprecated" (using e.g. a "status" property with a value of "deprecated" or a "deprecated" property with Boolean (yes/no) values) or as being "valid" or "applicable" only for a specified bounded period of time (see the messages from Johan De Smedt and from Margarita Sini)
  2. Such a "deprecated" concept can be the subject of a "replaced by" relationship linking it to the "preferred term" concept (see the message from Michael Panzer)

The application of these two elements in combination is illustrated in this example by Joachim Neubert (again, I think, addressing the same scenario).

I wasn't aware of the owl:deprecated property before, but as far as I can tell, it would be appropriate for this case.

Joachim's message highlights the question of what to do about skos:prefLabel/skosxl:prefLabel or skos:altLabel/skosxl:altLabel properties for the deprecated concept. In the term-based thesaurus, the term has become a non-preferred term for another term: in the SKOS model, the term is now the alternate label for a different concept, and the preferred label for no concept. So on that basis, I'm inclined to follow Joachim's suggestion that the deprecated concept should be the subject of neither skos:prefLabel/skosxl:prefLabel nor skos:altLabel/skosxl:altLabel properties, though it could, as Joachim's example shows, retain an rdfs:label property. And similarly it is no longer the subject or object of semantic relationships.

I did wonder about the option of introducing a set of properties, parallel to the SKOS ones, to indicate those former relationships, e.g. ex:hadPrefLabel, ex:hadAltLabel, ex:hadRelated, ex:hadBroader, ex:hadNarrower, essentially as documentation. But I'm really not sure how useful this is: the semantic relationships in which those other target concepts are involved may themselves change. And I suppose in principle (though it seems unlikely in practice) a single concept may itself go through several status changes (e.g. from active to deprecated to active to deprecated) and accrue various different "former" relationships in the course of that. If this level of information is required, then I think it probably has to be provided using some other approach - like the use of a set of date-stamped graphs/documents that reflect the state of a concept at a point in time.

So applying Joachim's approach to Case 8 from the examples in the previous post, where the current preferred term "Political violence" is to become a non-preferred term for "Collective violence", we end up with the concept con:C2 as a "deprecated" concept with a Dublin Core dcterms:isReplacedBy relationship to concept con:C6 (and the inverse from con:C6 to con:C2):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C6 .
       
term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2; 
       dcterms:replaces con:C2 .       

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

Using this approach then, the full output graph for Case 8 would be as follows (the highlighting indicates the difference between this graph and that for Case 8 in the previous post):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3; 
       dcterms:replaces con:C2 .       

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C6 .
       
term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C6 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

Now our graph retains the URI con:C2 and provides a description of that resource as a "deprecated concept".

And for Case 9 (again the highlighting indicates the difference from the initial graph for Case 9):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3; 
       dcterms:replaces con:C2 .       
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C1 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C1 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

In the (unlikely?) event that the (previously preferred) non-preferred term is once again restored to the state of preferred term, then the concept con:C2 loses its deprecated status and the dcterms:isReplacedBy relationship, and acquires skos:prefLabel/skos:altLabel properties as normal.

Generating these graphs does, however, imply a change to the process of generating the RDF representation. As I noted at the start of the previous post, my first cut at this was based on being able to process a snaphot of the thesaurus "stand-alone" without knowledge of previous versions. But the capacity to detect deprecated concepts depends on knowledge of the current state of the thesaurus, i.e. when the transformation process encounters a non-preferred term x, it needs to behave differently depending on whether:

  1. concept con:Cx exists in the current thesaurus dataset (as either an "active" or "deprecated" concept), in which case a "deprecated concept" con:Cx should be output, as well as term:Tx (as alternate label for some other concept, con:Cy); or
  2. concept con:Cx does not exist in the current thesaurus dataset, in which case only term:Tx (as alternate label for a concept con:Cy) is required

I think that test has to be made against the current RDF thesaurus dataset rather than using the previous XML snapshot in time, as the "deprecation" may have taken place several snapshots ago.

I have to admit this does make the transformation process rather more complicated than I had hoped. The only way alternative would be if it is somehow possible to distinguish the "deprecation" case from the "static" non-preferred term case from the input data alone, but as far as I know this isn't possible.

Summary

The previous post highlighted that for one particular category of change, where an existing preferred term is "relegated" to the status of a non-preferred term, the results of the suggested simple mapping into SKOS had problematic consequences.

Based on some investigation of how others approach similar scenarios (and here I should note I'm very grateful to the contributors to the wiki page on concept evolution and to those discussions linked from it, as I was struggling to see clearly how to deal with these scenarios), I've sketched above an approach to representing a concept which has been "deprecated", or is no longer applicable, and is replaced by another concept. I'm sure it isn't the only way of addressing the problem, but it seems a reasonable one to try.

I think this creates new challenges for implementing this approach in the transformation process and I need to work on that to test it, but I think it is achievable. But I would also be very grateful for any comments, particularly if there are gaping holes in this which I haven't spotted!

Term-based thesauri and SKOS (Part 3): Change over time (i)

This is the third in a series of posts (previously: part 1, part 2) on making a thesaurus available as linked data using the SKOS and SKOS-XL RDF vocabularies. In this post, I'll examine some of the ways the thesaurus can change over time, and how such changes are reflected when applying the mapping I described earlier.

A note on "workflow"

In the case I'm working on, the term-based thesaurus is managed in a purpose-built application, from which a snapshot is exported (as an XML document) at regular intervals. These XML documents are the inputs to a transformation process which generates an SKOS/SKOS-XL RDF version, to be exposed as linked data.

Currently at least, each "run" of that transformation operates on a single snaphot of the thesaurus "stand-alone" i.e. the transform process has no "knowledge" of the previous snapshot, and the expectation is that the output generated from processing will replace the output of the previous run (either in full, or through a process of establishing the differences and then removing some triples and adding others). This "stand-alone" approach may be something I have to revisit.

The mapping

To summarise the transformation described in the previous post, a single preferred term and its set of zero or more non-preferred terms are treated as labels for a single concept. For each such set:

  • a single SKOS concept is created with a URI based on the term number of the preferred term
  • the concept is related to the literal form of the preferred term by an skos:prefLabel property
  • an SKOS-XL label is created with a URI based on the term number of the preferred term
  • the label is related to the literal form of the preferred term by an skosxl:literalForm property
  • the concept is related to the label by an skosxl:prefLabel property
  • the "hierarchical" (broader term, narrower term) and "associative" (related term) relationships between preferred terms are represented as "semantic" relationships between concepts
  • And for each non-preferred term in the set
    • the concept is related to the literal form of the non-preferred term by an skos:altLabel property
    • an SKOS-XL label is created with a URI based on the term number of the non-preferred term
    • the label is related to the literal form of the preferred term by an skosxl:literalForm property
    • the concept is related to the label by an skosxl:altLabel property

In the discussion below, I'll take the following "snapshot" of a notional thesaurus - it's another version of the example used in the previous posts, extended with an additional preferred term - as a starting point:

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

Using the mapping above, it is represented as follows in RDF using SKOS/SKOS-XL:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.
Fig1

"Versioning" and change over time

Once our resource URIs are generated and published, they will be used/cited by other agencies in their data - in other linked data datasets, in other thesauri, or in simple Web documents which reference terms or concepts using those URIs. From the linked data perspective, it is important that once generated and published the resource URIs, which will be http: URIs, remain stable and reliable. I'm using the terms "stable" and "reliable" as they are used by Henry Thompson and Jonathan Rees in their note Guidelines for Web-based naming, which I've found very helpful in breaking down the various aspects of what we tend to call "persistence". And for "stability", I'm thinking particularly of what they call "resource stability". So

  • once a URI is created, we should continue to use that URI to denote/identify the same resource
  • it should continue to be possible to obtain some information "about" the identified resource using the HTTP protocol - though that information obtained may change over time

For our particular case, the requirement is only that the "current version" of the thesaurus is available at any point in time, i.e. for each concept and for each term/label, at any point in time, it is necessary to serve only a description of the current state of that resource.

So, in my previous post, I mentioned that the Cabinet Office guidelines Designing URI Sets for the UK Public Sector allow for the case of creating a set of "date-stamped" document URIs, to provide variant descriptions of a resource at different points in time. I don't think that is required for this case, so for each term and concept, we'll have a URI for the that "thing", a "Document URI" for a "generic document" (current) description of that thing, and "Representation URIs" for each "specific document" in a particular format.

The formats provided will include a human-readable HTML version, an RDF/XML version and possibly other RDF formats. Over time, additional formats can be added as required through the addition of new "Representation URIs".

My primary focus here is the changes to the thesaurus content. Over time, various changes are possible. New terms may be added, and the relationships between terms may change. Terms are not deleted from the theasurus, however.

The most common type of change is the "promotion" of an existing non-preferred term to the status of a preferred term, but all of the following types of change can occur, even if some are infrequent:

  1. Addition of new semantic relationships between existing preferred terms
  2. Removal of existing semantic relationships between existing preferred terms
  3. Addition of new preferred terms
  4. Addition of new non-preferred terms (for existing preferred terms)
  5. An existing non-preferred term becomes a new preferred term
  6. An existing non-preferred term becomes a non-preferred term for a different existing preferred term
  7. An existing non-preferred term becomes a non-preferred term for a newly-added preferred term
  8. An existing preferred term becomes a non-preferred term for another existing preferred term
  9. An existing preferred term become a non-preferred term for a term which is currently a non-preferred term for it (and vice versa)
  10. An existing preferred term becomes a non-preferred term for a newly added preferred term

Below, I'll try to walk through an example of each of those changes in turn, starting from the example thesaurus above, showing the results using the mapping suggested above, and examining any issues which arise.

Case 1: Addition of new semantic relationship

The addition of new broader term (BT), narrower term (NT) or related term (RT) relationships is straightforward, as it involves only the creation of additional assertions of relationships between concepts, using the skos:broader, skos:narrower or skos:related properties, not the creation of new resources.

So if the example above is extended to add a BT relation between the "Collective violence" (term no 6) and "Violence" (term no 4) terms (and the inverse NT relation):

Civil violence
USE Political violence
TNR 1

Collective violence
BT Violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
NT Collective violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the RDF graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:broader con:C4 .

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 ;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

i.e. two new triples are added to the RDF graph

con:C6 skos:broader con:C4 .
con:C4 skos:narrower con:C6 .

The addition of the triples means that, from a linked data perspective, the graphs served as descriptions of the resources con:C6 and con:C4 change. They each include one additional triple for the concise bounded description case; two triples for the symmetric bounded description case (see the previous post for the discussion of different forms of bounded description). So the contents of the representations of documents http://example.org/doc/concept/polthes/C4 and http://example.org/doc/concept/polthes/C6 change - but no new resources are generated, and no new URIs required.

Case 2: Removal of existing semantic relationship

The removal of existing broader term (BT), narrower term (NT) or related term (RT) relationships is similarly straightforward, as it involves only the deletion of assertions of relationships between concepts, using the skos:broader, skos:narrower or skos:related properties, without the removal of existing resources.

I won't bother writing out an example in full for this case, but imagine the case of the previous example reverting to its initial state.

Again, from a linked data perspective, the graphs served as descriptions of the resources con:C6 and con:C4 change, with each containing one triple less for the CBD case or two triples less for the SCBD case, but we still have the same set of term URIs and concept URIs.

Case 3: Addition of new preferred terms

The addition of a new preferred term is again a matter of extending the graph with new information, though in this case some new URIs are also introduced.

Suppose a new preferred term "Revolution" (term no 7) is added to our initial example:

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Revolution
TNR 7

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the following graph:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C7 a skos:Concept;
       skos:prefLabel "Revolution"@en;
       skosxl:prefLabel term:T7 .

term:T7 a skosxl:Label;
        skosxl:literalForm "Revolution"@en.
        
con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following triples are added:

con:C7 a skos:Concept;
       skos:prefLabel "Revolution"@en;
       skosxl:prefLabel term:T7.

term:T7 a skosxl:Label;
        skosxl:literalForm "Revolution"@en.

The RDF representation now includes an additional concept and label, each with a new URI. So now there are two new resources, with new URIs (con:C7 and term:T7), and a corresponding set of new Document URIs and Representation URIs for descriptions of those resources.

It is quite probable that the addition of a new preferred term is accompanied by the assertion of semantic relationships with other existing preferred terms. This is the equivalent of following this step, then a second step of the type shown in case 1.

Case 4: Addition of new non-preferred term (for existing preferred term)

The addition of a new non-preferred term is, again, a matter of adding new information, and new URIs.

Suppose a new term "Assault" (term no 8) is added as a new non-preferred term for "Violence" (term no 4):

Assault
USE Violence
TNR 8

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
UF Assault
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

which results in the graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T8 a skosxl:Label;
        skosxl:literalForm "Assault"@en.

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:altLabel "Assault"@en;
       skosxl:altLabel term:T8;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

i.e. the following triples are added:

term:T8 a skosxl:Label;
        skosxl:literalForm "Assault"@en.

con:C4 skos:altLabel "Assault"@en;
       skosxl:altLabel term:T8 .

So from a linked data perspective, there is a new resource with a new URI (term:T8) (and its own new description with a new Document URI), and the existing URI con:C4 is the subject of two new triples, an skos:altLabel for the literal, and an skosxl:altLabel link to the new label, so the graph served as description of that existing resource changes to include additional triples.

Case 5: Existing non-preferred term becomes new preferred term

Suppose the existing term "Civil violence", initially a non-preferred term for "Political violence" is "promoted" and made a preferred term in its own right

Civil violence
USE Political violence
BT Violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Civil violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the following graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:broader con:C4 .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

For this case, the following new triples are added

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:broader con:C4 .

con:C4 skos:narrower con:C1 .

and also the following existing triples are removed

con:C2 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1 .

So from a linked data perspective, there is a new resource with a new URI (concept:C1) (and its own new description with a new Document URI), and the graph served as description of the existing resources con:C2 and con:C4 both change: the former loses the skos:altLabel and skosxl:altLabel triples and the latter includes a new skos:narrower triple. If symmetric bounded descriptions are used, the description of term:T1 changes too.

Case 6: Existing non-preferred term becomes non-preferred term for a different existing preferred term

Suppose we decide that "Civil violence", initially a non-preferred term for "Political violence", is to become a non-preferred term for "Collective violence".

Civil violence
USE Political violence
USE Collective violence
TNR 1

Collective violence
UF Civil violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

This generates the following graph:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

For this case, the following new triples are added

con:C6 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1.

and also the following existing triples are removed

con:C2 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1 .

The graphs served as descriptions of the existing resources con:C2 and con:C6 both change: the former loses the skos:altLabel and skosxl:altLabel triples and the latter gains skos:altLabel and skosxl:altLabel triples. If symmetric bounded descriptions are used then the description of term:T1 also changes.

Case 7: Existing non-preferred term becomes non-preferred term for a newly added preferred term

I think this case is just a combination of Case 3 (addition of new preferred term) and Case 6 (existing non-preferred term becomes non-preferred term for a different existing preferred term) in sequence. We've seen above that these changes can be made without problems, so the "composite" case should be OK too, and I won't bother working through a full example here.

Case 8: An existing preferred term becomes a non-preferred term for another existing preferred term

Suppose the current preferred term "Political violence" is to be "relegated" to become a non-preferred term for "Collective violence", with the latter becoming the participant in hierarchical relations previously involving the former. (I appreciate that these two terms probably don't constitute a great example, but let’s suppose it works, for the sake of the discussion!)

Civil violence
USE Political violence
USE Collective violence
TNR 1

Collective violence
UF Civil violence
UF Political violence
UF Violent protest
BT Violence
NT Terrorism
TNR 6

Political violence
USE Collective violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
BT Collective violence
TNR 3

Violence
NT Political violence
NT Collective violence
TNR 4

Violent protest
USE Political violence
USE Collective violence
TNR 5

This maps to the rather substantially changed RDF graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C6 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following RDF triples have been added

con:C6 skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C6 .

con:C4 skos:narrower con:C6 .

And the following RDF triples have been removed

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .

So the graphs served as descriptions of the concepts con:C3 and con:C4 change (with the removal of a triple and the addition of a new one); and that for concept con:C6 changes with the addition of several triples.

So far, so good.

However, the URI con:C2 has now completely disappeared from the graph. If this new graph simply replaces the previous graph, then there will be no description available for resource con:C2.

Case 9: An existing preferred term become a non-preferred term for a term which is currently a non-preferred term for it (and vice versa)

Suppose that the current non-preferred term "Civil violence" is to become preferred to "Political violence", and the latter is to become a non-preferred term for the former - both "relegation" and "promotion" taking place together, if you like.

Civil violence
USE Political violence
UF Political violence
UF Violent protest
BT Violence
NT Terrorism
TNR 1

Collective violence
TNR 6

Political violence
USE Civil violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
BT Civil violence
TNR 3

Violence
NT Political violence
NT Civil violence
TNR 4

Violent protest
USE Political violence
USE Civil violence
TNR 5
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C1 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following RDF triples have been added

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
       
con:C3 skos:broader con:C1 .

con:C4 skos:narrower con:C1 .

And the following RDF triples have been removed

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .

The outcome here is similar to that of the previous case.

The graphs served as descriptions of the concepts con:C3 and con:C4 change (with the removal of a triple and the addition of a new one). A new concept con:C1 is created. But again the URI con:C2 has completely disappeared from the graph, with the same consequences that no description will be available.

Case 10: An existing preferred term becomes a non-preferred term for a newly added preferred term

I think this case is just a combination of Case 3 (addition of new preferred term) and Case 8 (existing preferred term becomes a non-preferred term for another existing preferred term) in sequence.

The same problem will arise with the URI of the existing concept disappearing from the new output graph.

Summary

I've walked through in detail the different types of changes which can occur to the content of the thesaurus. This highlighted that for one particular category of change, where an existing preferred term is "relegated" to the status of a non-preferred term, exemplified by my cases 8, 9 and 10 above, the results of the suggested simple mapping into SKOS had problematic consequences: the URI for a concept disappears from the generated RDF graph - and this creates a conflict with the principles of URI stability and reliability I advocated at the start of this post.

In the next post, I'll suggest one way of (I hope!) addressing this problem.

March 01, 2011

Term-based thesauri and SKOS (Part 2): Linked Data

In my previous post on this topic I outlined how I was approaching making a thesaurus available using the SKOS and SKOS-XL RDF vocabularies. In that post I focused on:

  • how the thesaurus content is modelled using a "concept-based" approach - what are the "things of interest", their attributes, and the relationships between them;
  • how those things (concepts, terms/labels) are named/identified using http URIs;
  • how those things can be "described" using the simple "triple" statement model of RDF, and using the SKOS and SKOS-XL RDF vocabularies;
  • an example of how an expresssion of the thesaurus using the term-based model is mapped or transformed into an SKOS RDF expression

What I didn't really address in that post is how that resulting RDF data is made available and accessed on the Web - which is more specifically where the "Linked Data" principles articulated by Tim Berners-Lee come into play.

(A good deal of the content of this post is probably familiar stuff for those of you already working with Linked Data, but I thought it was worth documenting it both to fill out the picture of some of the "design choices" to be made in this particular example, and to provide some more background to others less familiar with the approaches.)

Linked Data, URIs, things, documents and HTTP

The use of http URIs as identifiers provides two features:

  • a global naming system, and a set of processes by which authority for assigning names can be delegated/distributed;
  • through the HTTP protocol, a well understood and widely deployed mechanism for providing access to information "about", or descriptions of, the things identified by those URIs (in our case, the concepts and terms/labels).

As a user/consumer of an http URI, given only the URI, I can "look up" that URI using the HTTP protocol, i.e. I can provide it to a tool (like a Web browser) and that tool can issue a request to obtain a description of the thing identified by the URI. And conversely as the owner/provider of a URI, I can configure my server to respond to such requests with a document providing a description of the thing.

And the HTTP protocol incorporates features which we can use to "optimise" this process. So, for example, the "content negotiation" feature allows a client to specify a preference for the format in which it wishes to receive data, and allows a server to select - from amongst several it may have available - format which the server determines is most appropriate for the client. In the terminology of the Web Architecture, the description can have multiple "representations", each of which can vary by format (or by other criteria). In the context of Linked Data, this technique is typically used to support the provision of document representations in formats suitable for a human reader (HTML, XHTML) and in one or more RDF syntaxes (usually, at least as RDF/XML). (The emergence of the RDFa syntax, which enables the embedding of RDF data in HTML/XHTML documents, and the growing support for RDFa in RDF tools, offers the possibility, in principle at least, of a single format serving both purposes.)

The widespread use of the HTTP protocol and tools that support it mean that these techniques are widely available (in theory at least; experience suggests that in practice the ability (or authority) to set up things like HTTP redirects (more below) can create something of a barrier). It also means that the "Web of Linked Data" is part of the existing Web of documents that we are accustomed to navigating using a Web browser.

One of the principles underpinning RDF's use of URIs as names is that we should try to avoid ambiguity in our use of those names, i.e. we should use different URIs for different things, and avoid using the same URI as a name for two different things. One of the issues I've slightly glossed over in the last few paragraphs is the distinction between a "thing" and a document describing that thing as two different resources. After all, if I provide a page describing the Mona Lisa, both the page and the Mona Lisa have creators, creation dates, terms of use, but they have different creators, creation dates and terms of use. And if I want to provide such information in RDF, then I need to take care to avoid confusing the two objects - by using two different URIs, one for my document and one for the painting, and citing those URIs in my RDF statements accordingly.

However, as I emphasise above, we also want to be in a position where, given only a "thing URI", I can obtain a document describing that thing: I shouldn't need in advance information about a second URI, the URI of a document about that thing.

The W3C Note Cool URIs for the Semantic Web describes some possible approaches to addressing this issue, broadly using two different techniques:

  • the use of URIs containing "fragment identifiers" ("hash URIs") (i.e. URIs of the form http://example.org/doc/123#thing). In this case, the "fragment identifier" part of the URI is always "trimmed" from the URI when the client makes a request to the server, and this allows the use of the URI with fragment identifier as "thing URI", leaving the trimmed URI without fragment id as a document URI.
  • the use of a convention of HTTP "redirects". In this case, when a server receives a request for a URI which it "knows" is a "thing URI" rather than a document URI, it returns a response which provides a second URI as a source of information about the thing, and the client then sends a second request for that second URI. Formally, the initial response uses the HTTP "303 See Also" status code, which sometimes leads to these being called colloquially "303 URIs", even though there's nothing special about the URIs themselves.

I'm conscious that I'm skipping over some of the details here; for a more detailed description, particularly of the "flow" of the interactions involved, and some consideration of the pros and cons of the two approaches, see Cool URIs for the Semantic Web.

URI Sets for the UK Public Sector

The Cool URIs note focuses mainly on the patterns of "interaction" for handling the two approaches to moving from "thing URI" to document URI. Its examples include example URIs, but the exact form of those URIs is intended to be illustrative rather than prescriptive. I think it's important to note that in the redirect case, it is the server's notification to the client of the second URI that provides the client with that information. There is no technical requirement for a structural similarity in the forms of the "thing URI" and the document URI, and consumers of the URIs should rely on the information provided to them by the server rather than making assumptions about URI structure.

Having said that, the use of a shared, consistent set of URI patterns within a community can provide some useful "cues" to human readers of those URIs. It can also simplify the work of data providers - for example by facilitating the use of similar HTTP server configurations or the reuse of scripts/tools for serving "Linked Data" documents. With this (and other factors such as URI stability) in mind, the UK Cabinet Office has provided a set of guidelines, Designing URI Sets for the UK Public Sector which build on the W3C Cool URIs note, but offer more specific guidance, particularly on the design of URIs.

For the purposes of this discussion, of particular interest is the document's specification (in the "Definitions, frameworks and principles" section) of several distinct "types of URI", or perhaps more accurately, URIs for different categories of resource, and (in the "The path structure for URIs" section) of suggested structural patterns for each:

  • Identifier URIs (what I have been calling above "thing URIs") name "real-world things" and should use patterns like:
    • http://{domain}/{concept}/{reference}#id or
    • http://{domain}/id/{concept}/{reference}
    where:
    • {concept} is "a word or string to capture the essence of the real-world 'Thing' that the set names e.g. school". (I think this is roughly what I think of as the name of a "resource type" - note this is a more generic use of the word "concept" than that of the SKOS concept)
    • {reference} is "a string that is used by the set publisher to identify an individual instance of concept".
    The document allows for the use of a hierarchy of concept-reference pairs in a single URI if appropriate, so for a specific class within a specific school, the path might be /id/school/123/class/5
  • Document URIs name the documents that provide information about, descriptions of, "real-world things". The suggested pattern is
    • http://{domain}/doc/{concept}/{reference}
    These documents are, I think, what Berners-Lee calls Generic Resources. For each such document, multiple representations may be available, each in different formats, and each of those multiple "more specific" concrete formats may be available as a separate resource in its own right (see "Representation URIs" below). If descriptions vary over time, and those variants are to be exposed then a series of "date-stamped" URIs can be used, with the pattern
    • http://{domain}/doc/{concept}/{reference}/{yyyy-mm-dd}
  • Representation URIs name a document in a specific format. The suggested pattern is
    • http://{domain}/doc/{concept}/{reference}/{doc.file-extension}
    This can also be applied to a date-stamped version:
    • http://{domain}/doc/{concept}/{reference}/{yyyy-mm-dd}/{doc.file-extension}

The guidelines also distinguish a category of "Ontology URIs" which use the pattern http://{domain}/def/{concept}. I had interpreted "Ontology URIs" as applying to the identification of classes and properties, and I was treating the terms/concepts of a thesaurus as "conceptual things" which would fall under the /id/ case. But I do notice that in an example in which she refers to these guidelines, Jeni Tennison uses the /def/ pattern for an SKOS example. I don't think it's really much of an issue - and pretty much all the other points I discuss apply anyway - but any advice on this point would be appreciated.

So, applying these general rules for the thesaurus case, where, as I discussed in the previous post, the primary types of thing of interest in our SKOS-modelled thesaurus are "concepts" and "terms":

  • Term URI Pattern: http://example.org/id/term/T{termid}
  • Concept URI Pattern: http://example.org/id/concept/C{termid}

However, if we bear in mind that within the URI space of the example.org domain, we're likely to want to represent, and coin URIs for the components of, multiple thesauri, and our "termid" references (drawn from the term numbers in the input) are unique only within the scope of a single thesaurus, then we should include some sort of thesaurus-specific component in the path to "qualify" those term numbers. Let's use the token "polthes" for this example:

  • Term URI Pattern: http://example.org/id/term/{schemename}/T{termid}
    Example: http://example.org/id/term/polthes/T2
  • Concept URI Pattern: http://example.org/id/concept/{schemename}/C{termid}
    Example: http://example.org/id/concept/polthes/C2

We should also include a URI for the thesaurus as a whole. The SKOS model provides a generic class of "concept scheme" to cover aggregations of concepts:

  • Concept Scheme URI Pattern: http://example.org/id/concept-scheme/{schemename}
    Example: http://example.org/id/concept-scheme/polthes

where each concept and term in the thesaurus is linked to this concept scheme by a triple using the skos:inScheme property. (I omitted this from the example in the previous post so that it was easier to focus on the concept-term and concept-concept relationships, and to try to keep the already rather complex diagrams slightly readable!)

Aside: An alternative for the concept and term URI patterns would be to use the "hierarchical concept-reference" approach and use patterns like:

  • Term URI Pattern: http://example.org/id/concept-scheme/{schemename}/term/T{termid}
    Example: http://example.org/id/concept-scheme/polthes/term/T2
  • Concept URI Pattern: http://example.org/id/concept-scheme/{schemename}/concept/C{termid}
    Example: http://example.org/id/concept-scheme/polthes/concept/C2

My only slight misgiving about this approach is that (bearing in mind that strictly speaking the URIs should be treated as opaque and such information obtained from the descriptions provided by the server) in the (non-hierarchical) form I suggested initially, the string indicating the resource type ("concept", "term") is fairly clear to the human reader from its position following the "/id/" component in the URI (e.g. http://example.org/id/concept/polthes/C2). But with the hierarchical form, it perhaps becomes slightly less clear (e.g. http://example.org/id/concept-scheme/polthes/concept/C2). But that is a minor gripe, and really the hierarchical form would serve just as well. For the remainder of this document, in the examples, I'll continue with the initial non-hierarchical pattern I suggested above, but it may be something to revisit if the hierarchical form is more in line with the intent - and current usage - of the guidelines. (So again, comments are welcome on this point.)

For each of these "Identifier URIs", there should be a corresponding "Document URI" naming a document describing the thing, and following the /doc/ pattern:

  • Description of Concept Scheme: http://example.org/doc/concept-scheme/polthes
  • Description of Term: http://example.org/doc/term/polthes/T{termid}
  • Description of Concept: http://example.org/doc/concept/polthes/C{termid}

And for each format in which the description is available, a corresponding "Representation URI":

  • Description of Concept Scheme (HTML): http://example.org/doc/concept-scheme/polthes/doc.html
  • Description of Concept Scheme (RDF/XML): http://example.org/doc/concept-scheme/polthes/doc.rdf
  • Description of Concept (HTML): http://example.org/doc/concept/polthes/C{termid}/doc.html
  • Description of Concept (RDF/XML): http://example.org/doc/concept/polthes/C{termid}/doc.rdf
  • Description of Term (HTML): http://example.org/doc/term/polthes/T{termid}.html
  • Description of Term (RDF/XML): http://example.org/doc/term/polthes/T{termid}.html

Descriptions and "boundedness"

The three documents I've mentioned so far (Berners-Lee's Linked Data Design Issues note, the W3C Cool URIs document, or the Cabinet Office URI patterns document) don't have a great deal to say on the topic of the content of the document which is returned as a description of a "thing". This is discussed briefly in the "Linked Data Tutorial" document by Chris Bizer, Richard Cyganiak and Tom Heath, How to Publish Linked Data on the Web.

In principle at least, it is quite possible to provide a single document which describes several resources. This approach has been quite common in association with the use of "hash URIs" in a pattern where a number of "thing URIs" differ only by fragment identifier, and share the same "non-fragment" part (http://example.org/school#1, http://example.org/school#2, ... http://example.org/school#99), and a number of common ontologies make use of this sort of approach. One consequence is that a client interested only in a single resource always retrieves the full set of descriptions. If my thesaurus really did consist only of the half-dozen concepts and terms I described in the example in my previous post, retrieving a document describing them all would probably not be a problem, but for the "real world" case where there are several thousand terms involved, it would represent a significant overhead if every request results in the transfer of several megabytes of data.

Generally, the approach taken is for the data provider to generate some set of "useful information" "about" the requested resource - though saying that rather begs the question of what constitutes "useful" (and whether there is a single answer to that question that is applicable across different datasets dealing with different resource types). Typically the generation of a description is based on some set of rules which, for a specified node in the dataset RDF graph (a specified "thing URI"), selects a "nearby" subgraph of the graph, representing a "bounded description" made up of triples/statements "about" the thing itself and maybe also "about" closely related resources.

Various such algorithms for generating such descriptions are possible and I don't intend to attempt any sort of rigorous analysis or comparison of them here - for further discussion see e.g. Patrick Stickler's CBD - Concise Bounded Description or Bounded Descriptions in RDF from the Talis Platform wiki. But there is one aspect which I think it is worth mentioning in the context of the thesaurus example. One of the key differences between the algorithms used to generate descriptions is how they treat the "directionality" of arcs in the RDF graph, i.e. whether they base the description only on arcs "outbound from" the selected node (RDF triples with that URI as subject), or whether they take into account both arcs "outbound from" and "inbound to" the node (triples with the URI as either subject or object).

That probably sounds like a very abstract point, and the significance is perhaps best illustrated through a concrete example. Let's take the graph for the example from my previous post (tweaked to use the slightly amended URI patterns above - I've continued to leave out the concept scheme links to keep things simple) and suppose this is the dataset to which I'm applying the rules.

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

And in graphical form (as before with the rdf:type triples omitted):

Fig1

(In the figures below, I've tried to represent the idea that a subgraph is being selected by "fading out" the parts which aren't selected, and leaving the selected part fully visible. I hope the images are sufficiently clear for this to be effective!)

Let's first take the approach known as the "concise bounded description (CBD)" - formally defined here, but essentially based on "outbound" links. For the concept C2 (http://example.org/id/concept/polthes/C2), the CBD would consist of the following subgraph (i.e. the document http://example.org/doc/concept/polthes/C2 would contain this data):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
Fig2

For the term T2 (http://example.org/id/term/polthes/T2), corresponding to the "preferred label" (i.e. the document http://example.org/doc/term/polthes/T2 would contain):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.
Fig3

For the term T5 (http://example.org/id/term/polthes/T5), corresponding to the "alternate label" (i.e. the document http://example.org/doc/term/polthes/T5 would contain):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.
Fig4

Note that for the two terms, the "concise bounded description" is quite minimal (though remember I've simplified it a bit): in particular, it does not include the relationship between the term and the concept. This is because using the SKOS-XL vocabulary that relationship is expressed as a triple in which the concept URI is the subject and the term URI is the object - an "inbound arc" to the term URI in the graph - which the CBD approach does not take into account when constructing the description of the term.

But the fact that the relationship is represented only in this way - a link from concept to term, without an inverse term to concept link - is arguably slightly arbitrary.

An alternative approach, the "symmetric bounded description" seeks to address this sort of issue, by taking into account both "outbound" and "inbound". For the same three cases, it produces the following results:

Concept C2:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .
Fig5

Term T2:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C2 skosxl:prefLabel term:T2 .
Fig6

Term T5:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

con:C5 skosxl:altLabel term:T5 .
Fig7

For the concept case, the difference is relatively minor (for the skos:broader and skos:narrower relationships, the inverse triples are ow also included). But for the term cases, the relationship between concept and term is now included.

So (rather long-windedly, I fear!), I'm trying to illustrate that it's worth thinking a little bit about the content of descriptions and how they work as "stand-alone" documents (albeit linked to others). And for this dataset, I think there's an argument that generating "symmetric" descriptions that include inbound links as well as outbound ones probably results in more "useful information" for the consumer of the data.

(Again, I'm simpifying things slightly here to illustrate the point: I've omitted type information and the links to indicate concept scheme membership. Typically the descriptions might (depending on the algorithm) include labels for related resources mentioned, rather than just the URIs, and would include some metadata about the document - its publisher, last modification date, licence/rights information, a link to the dataset of which it is a member, and so on.)

Summary

What I've tried to do in this post is expand on some of the "linked data"-specific aspects of the project, and to examine some of the design choices to be made in applying some of those general rules to this particular case, shaped both by external factors (like the Cabinet Office guidelines on URIs) and by characteristics of the data itself (like the directionality of links made using SKOS-XL). In the next post, I'll move on, as promised, to the questions of how the data changes over time, and any implications of that.

February 25, 2011

RDTF metadata guidelines - Limp Data or Linked Data?

Having just finished reading thru the 196 comments we received on the draft metadata guidelines for the UK RDTF I'm now in the process of wondering where we go next. We (Pete and I) have relatively little effort to take this work forward (a little less than 5 days to be precise) so it's not clear to me how best we use that effort to get something useful out for both RDTF and the wider community.

By the way... many thanks to everyone who took the time to comment. There are some great contributions and, if nothing else, the combination of the draft and the comments form a useful snapshot of the current state of issues around library, museum and archival metadata in the context of the web.

Here's my brief take on what the comments are saying...

Firstly, there were several comments asking about the target audience for the guidelines and whether, as written, they will be meaningful to... well... anyone I guess! It's worth pointing out that my understanding is that any guidelines we come up with thru the current process will be taken forward as part of other RDTF work. What that means is that the guidelines will get massaged into a form (or forms) that are digestable by the target audience (or audiences), as determined by other players with the RDTF activity. What we have been tasked with are the guidelines themselves - not how they are presented. We perhaps should have made this clearer in the draft guidelines. In short, I don't think the document, as written, will be put directly in-front of anyone who doesn't go to the trouble of searching it out explicitly.

Secondly, there were quite a number of detailed comments on particular data formats, data models, vocabularies and so on. This is great and I'm hopeful that as a result we can either extend the list of examples given at various points in the guidelines (or, in some cases, drop back to not having examples but simply say, "do whatever is the emerging norm here in your community").

Thirdly, the were some concerns about what we meant by "open". As we tried to point out in the draft, we do not consider this to be our problem - it is for other activity in RDTF to try and work out what "open" means - we just felt the need to give that word a concrete definition, so that people could understand where we were coming from for the purposes of these guidelines.

Finally, there were some bigger concerns - these are the things that are taxing me right now - that broadly fell into two, related, camps. Firstly, that the step between the community formats approach and the RDF data approach is too large (though no-one really suggested what might go in the middle). And secondly, that we are missing a trick by not encouraging the assignment of 'http' URIs to resources as part of the community formats approach.

As it stands, we have, on the one hand, what one might call Limp Data (MARC records, XML files, CVS, EAD and the rest) and, on the other, Linked Data and all that entails, with a rather odd middle ground that we are calling RDF data (in the current guidelines).

I was half hoping that someone would simply suggest collapsing our RDF data and Linked Data approaches into one - on the basis that separating them into two is somewhat confusing (but as far as I can tell no-one did... OK, I'm doing it now!). That would leave a two-pronged approach - community formats and Linked Data - to which we could add a 'community formats with http URIs' middle ground. My gut feel is that there is some attraction in such an approach but I'm not sure how feasible it is given the characteristics of many existing community formats.

As part of his commentary around encouraging http URIs (building a 'better web' was how he phrased it), Owen Stephens suggested that every resource should have a corresponding web page. I don't disagree with this... well, hang on... actually I do (at least in part)! One of the problems faced by this work is the fundamental difference between the library world and museums and archives. The former is primarily dealing with non-unique resources (at the item level), the latter with unique resources. (I know that I'm simplifying things here but bear with me). Do I think that resource discovery will be improved if every academic library in the UK (or indeed in the world) creates an http URI for every book they hold at which they serve a human-readable copy of their catalogue record? No, I don't. If the JISC and RLUK really want to improve web-scale resource discovery of books in the library sector, they would be better off spending their money encouraging libraries to sign up to OCLC WorldCat and contributing their records there. (I'm guessing that this isn't a particular popular viewpoint in the UK - at least, I'm not sure that I've ever heard anyone else suggest it - but it seems to me that WorldCat represents a valuable shared service approach that will, in practice, be hard to beat in other ways.) Doing this would both improve resource discovery (e.g. thru Google) and provide a workable 'appropriate copy' solution (for books). Clearly, doing this wouldn't help build a more unified approach across the GLAM domains but, as at least one commenter pointed out, it's not clear that the current guidelines do either. Note: I do agree with Owen that every unique museum and archival resource should have an http URI and a web page.

All of which, as I say, leaves us with a headache in terms of how we take these guidelines forward. Ah well... such is life I guess.

February 22, 2011

SALDA

As I've mentioned here before, I'm contributing to a project called LOCAH, funded by the JISC under its 02/10 call Deposit of research outputs and Exposing digital content for education and research (JISCexpo), working with MIMAS and UKOLN on making available bibliographic and archival metadata as linked data.

Partly as a result of that work, I was approached by Chris Keene from the University of Sussex to be part of a bid they were preparing to another recent JISC call, 15/10: Infrastructure for education and research programme, under the "Infrastructure for resource discovery" strand, which seeks to implement some of the approaches outlined by the Resource Discovery Task Force.

The proposal was to make available metadata records from the Mass Observation Archive, data currently managed in a CALM archival management system, as linked data. I'm pleased to report that the bid was successful, and the project, Sussex Archive Linked Data Application (SALDA), has been funded. It's a short project, running for six months from now (February) to July 2011. There's a brief description on the JISC Web site here, a SALDA project blog has just been launched, and the project manager Karen Watson provides more details of the planned work in her initial post there.

I'm looking forward to working with the Sussex team to adapt and extend some of the work we've done with LOCAH for a new context. I expect most information will appear over on the SALDA blog, but I'll try to post the occcasional update on progress here, particularly on any aspects of general interest.

February 11, 2011

Term-based thesauri and SKOS (Part 1)

I'm currently doing a piece of work on representing a thesaurus as linked data. I'm working on the basis that the output will make use of the SKOS model/RDF vocabulary. Some of the questions I'm pondering are probably SKOS Frequently Asked Questions, but I thought it was worth working through my proposed solution and some of the questions I'm pondering here, partly just to document my own thought processes and partly in the hope that SKOS implementers with more experience than me might provide some feedback or pointers.

SKOS adopts a "concept-based" approach (i.e. the primary focus is on the description of "concepts" and the relationships between them); the source thesaurus uses a "term-based" approach based on the ISO 2788 standard. I found the following sources provided helpful summaries of the differences between these two approaches:

Briefly (and simplifying slightly for the purposes of this discussion - I've omitted discussion of documentary attributes like "scope notes"), in the term-based model (and here I'm dealing only with the case of a simple monolingual thesaurus):

  • the only entities considered are "terms" (words or sequences of words)
  • terms can be semantically equivalent, each expressing a single concept, in which case they are distinguished as "preferred terms"/descriptors or "non-preferred terms"/non-descriptors, using USE (non-preferred to preferred) and UF (use for) (preferred to non-preferred) relations. Each non-preferred term is related to a single preferred term; a preferred term may have many non-preferred terms
  • a preferred term can be related by ("hierarchical") BT (broader term) and NT (narrower term) relations to indicate that it is more specific in meaning, or more general in meaning, than another term
  • a preferred term can also be related by an ("associative") RT (related term) relation to a second term, where there is some other relationship which may be useful for retrieval (e.g. overlapping meanings)

In the SKOS (concept-based) model:

  • the primary entities considered are "concepts", "units of thought"
  • a concept is associated with labels, of which at most one (in the monolingual case) is the preferred label, the others alternative labels or hidden labels
  • a concept can be related by "has broader", "has narrower" and "has related" relations to other concepts

The concept-based model, then, is explicit that the thesaurus "contains" two distinct types of thing: concepts and labels. In the base SKOS model, the labels are modelled as RDF literals, so the expression of relationships between labels is not supported. SKOS-XL provides an extension to SKOS which models labels as RDF resources in their own right, which can be identified, described and linked.

To represent a term-based thesaurus in SKOS, then, it is necessary to make a mapping from the term-based model with its term-term relationships to the concept-based model with its concept-label and concept-concept relationships. (An alternative approach would be to represent the thesaurus in RDF using a model/vocabulary which expresses directly the "native" "term-based" model of the source thesaurus. I'm working on the basis that using the SKOS/concept-based approach will facilitate interoperability with other thesauri/vocabularies published on the Web.

Core Mapping

The key principle underlying the mapping is the notion that, in the term-based approach, a preferred term and its multiple related non-preferred terms are acting as labels for a single concept.

Each set of terms related by USE/UF relations in the term-based approach, then, is mapped to a single SKOS concept, with the single "preferred term" becoming the "preferred label" for that concept and the (possibly multiple) "non-preferred terms" each becoming "alternative labels" for that same concept.

Consider the following example, where the term "Political violence" is a preferred term for "Civil violence" and "Violent unrest", and has a broader term "Violence" and narrower term "Terrorism":


Civil violence
USE Political violence

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrrorism

Terrorism
BT Political violence

Violence
NT Political violence

Violent protest
USE Political violence

(Leaving aside for a moment the question of how the URIs might be generated) an SKOS representation takes the form:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix con: <http://example.org/id/concept/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skos:altLabel "Civil violence"@en;
       skos:altLabel "Violent protest"@en;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skos:broader con:C2 .

con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skos:narrower con:C2 .

A graphical representation perhaps makes the links clearer (to keep things simple, I've omitted the arcs and nodes corresponding to the rdf:type triples here):

Fig1

One of the consequences of this approach is that some of the "direct" relationships between terms are reflected in SKOS as "indirect" relationships. In the term-based model, a non-preferred term is linked to a preferred term by a simple "USE" relationship. In the SKOS example, to find the "preferred term" for the term "Violent protest", one finds the node for the concept of to which it is linked via the skos:altLabel property, and then locates the literal which is linked to that concept via an skos:prefLabel property.

SKOS-XL and terms as Labels

As mentioned above, the SKOS specification defines an optional extension to SKOS which supports the representation of labels as resources in their own right, typically alongside the simple literal representation.

Using this extension, then the example above might be expressed as:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

And in graphical form (as above with the rdf:type triples omitted):

Fig2

That may be rather a lot to take in, so below I've shown a subgraph, including only the labels related to a single concept:

Fig3

Using the SKOS-XL extension in this way does introduce some additional complexity. On the other hand:

  1. it introduces a set of resources (of type skosxl:Label) that have a one-to-one correspondence to the terms in the source thesaurus, so perhaps makes the mapping between the two models more explicit and easier for a human reader to understand.
  2. it makes the labels themselves URI-identifiable resources which can be referenced in this data and in data created by others. So, it becomes possible to make assertions about relationships between the labels, or between labels and other things.

Coincidentally, as I was writing this post, I note that Bob duCharme has posted on the use of SKOS-XL for providing metadata "about" the individual labels, distinct from the concepts. So I might add a triple to indicate, say, the date on which a particular label was created. I don't think there is an immediate requirement for that in the case I'm working on, but there may be in the future.

Another possible use case is the ability for other parties to make links specifically to a label, rather than to the concept which it labels. A document could be "tagged", for example, by association with a particular label from a particular thesaurus, rather than just with the plain literal.

Relationships between Labels

The SKOS-XL vocabulary defines only a single generic relationship between labels (the skosxl:labelRelation property) with the expectation that implementers define subproperties to handle more specific relationship types.

The example given of such a relationship in the SKOS spec is that of a case where one label is an acronym for another label.

One thing I wondered about was introducing properties here to reflect the USE/UF relations in the term-based model e.g. in the above example:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .
@prefix ex: <http://example.org/prop/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en;
        ex:use term:T2 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en;
        ex:useFor term:T1 ;
        ex:useFor term:T5 .
       
term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en;
        ex:use term:T2 .

This wouldn't really add any new information; rather it just provides a "shortcut" for the information that is already present in the form of the "indirect" preferred label/alternative label relationships, as noted above.

I hesitate slightly about adding this data, partly on the grounds that it seems slightly redundant, but also on the grounds that this seems like a rather different "category" of relationship from the "is acronym for" relationship. This point is illustrated in figure 4 in the SWAD-E Review of Thesaurus Work document, which divides the concept-based model into three layers:

  • the (upper in the diagram) "conceptual" layer, with the broader/narrower/related relations between concepts
  • the (lower) "lexical" layer, with terms/labels and the relations between them
  • and a middle "indication"/"implication" layer for the preferred/non-preferred relations between concepts and terms/labels

In that diagram, the example lexical relationships exist independently of the preferred/non-preferred relationships e.g. "AIDS" is an abbreviation for "Acquired Immunodeficiency Syndrome", regardless of which is considered the preferred term in the thesaurus. With the use/useFor relationships here, this would not be the case; they would indeed vary with the preferred/non-preferred relationships.

So, having thought initially it might be useful to include these relationships, I'm becoming more rather less sure.

And without them, I'm not sure whether, for this case, the use of the SKOS-XL would be justified - though it may well be, for the other purposes I mentioned above. So, any thoughts on practice in this area would be welcome.

URI Patterns

Following the linked data principles, I want to assign http URIs for all the resources, so that descriptions of those resources can be provided using HTTP. If we go with the SKOS-XL approach, that means assigning http URIs for both concepts and labels.

The input data includes, for each term within the thesaurus, an identifier code, unique within the thesaurus, and which remains stable over time, i.e. as changes are made to the thesaurus a single term remains associated with the same code. (I'll explore the question of how the thesaurus changes over time in a follow-up post.) So in fact the example above looks something like:


Civil violence
USE Political violence
TNR 1

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

The proposal is to use that identifier as the basis of URIs, for both labels/terms and concepts:

  • Term URI Pattern: http://example.org/id/term/T{termid}
  • Concept URI Pattern: http://example.org/id/concept/C{termid}

The generation of label/term URIs, then, is straightforward, as each term (with identifier code) in the input data maps to a single label/term in the SKOS RDF data. The concept case is a little more complicated. As I noted above, the mapping between the term-based model and the concept-based model means that there is a many-to-one relationship between terms and concepts: several terms are related to a single concept, one as preferred label, others as alternative labels. In my first cut at this at least (more on this in a follow-up post), I've generated the URI of the concept based on the identifier code associated with the preferred/descriptor term.

So in my example above, the three terms "Civil violence, "Political violence", and "Violent protest" are mapped to labels for a single concept, and the URI of that concept is constructed from the identifier code of the preferred/descriptor term ("Political violence").

Summary

I've tried to cover the basic approach I'm taking to generating the SKOS/SKOS-XL RDF data from the term-based thesaurus input. In this post, I've focused only on the thesaurus as a static entity. In a follow-up post, I'll look in some detail at the changes which can occur over time, and examine any implications for the choices made here, particularly for our choices of URI patterns.

February 03, 2011

Metadata guidelines for the UK RDTF - please comment

As promised last week, our draft metadata guidelines for the UK Resource Discovery Taskforce are now available for comment in JISCPress. The guidelines are intended to apply to UK libraries, museums and archives in the context of the JISC and RLUK Resource Discovery Taskforce activity.

The comment period will last two weeks from tomorrow and we have seeded JISCPress with a small number of questions (see below) about issues that we think are particularly worth addressing. Of course, we welcome comments on all aspects of the guidelines, not just where we have raised issues. (Note that you don't have to leave public comments in JISCPress if you don't want to - an email to me or Pete will suffice. Or you can leave a comment here.)

The guidelines recommend three approaches to exposing metadata (to be used individually or in combination), referred to as:

  1. the community formats approach;
  2. the RDF data approach;
  3. the Linked Data approach.

We've used words like 'must' and 'should' but it is worth noting that at this stage we are not in a position to say how these guidelines will be applied - if at all. Nor whether there will be any mechanisms for compliance put in place. On that basis, treat phrases like 'must do this' as meaning, 'you must do this for your metadata to comply with one or other approach as recommended by these guidelines' - no more, no less. I hope that's clear.

When we started this work, we we began by trying to think about functional requirements - always a good place to start. In this case however, that turned out not to make much sense. We are not starting from a green field here. Lots of metadata formats are already in use and we are not setting out with the intent of changing current cataloguing practice across libraries, museums and archives. What we can say is that:

  1. we have tried to keep as many people happy as possible (hence the three approaches), and
  2. we want to help libraries, museums and archives expose existing metadata (and new metadata created using existing practice) in ways that support the development of aggregator services and that integrate well with the web (of data).

As mentioned previously, the three approaches correspond roughly to the 3-star, 4-star and 5-star ratings in the W3C's Linked Data Star Ratings Scheme. To try and help characterise them, we prepared the following set of bullet points for a meeting of the RDTF Technical Advisory Group earlier this week:

The community data approach

  • the “give us what you’ve got” bit
  • share existing community formats (MARC, MODS, BibTeX, DC, SPECTRUM, EAD, XML, CSV, JSON, RSS, Atom, etc.) over RESTful HTTP or OAI-PMH
  • for RESTful HTTP, use sitemaps and robots.txt to advertise availability and GZip for compression
  • for CSV, give us a column called ‘label’ or ‘title’ so we’ve got something to display and a column called 'identifier' if you have them
  • provide separate records about separate resources
  • simples!

The RDF data approach

  • use RDF
  • model according to FRBR, CIDOC CRM or EAD and ORE where you can
  • re-use existing vocabularies where you can
  • assign URIs to everything of interest
  • make big buckets of RDF (e.g. as RDF/XML, N-Tuples, N-Quads or RDF/Atom) available for others to play with
  • use Semantic Sitemaps and the Vocabulary of Interlinked Datasets (VoID) to advertise availability of the buckets

The Linked Data approach

Dunno if that is a helpful summary but we look forward to your comments on the full draft. Do your worst!

For the record, the issues we are asking questions about mainly fall into the following areas:

  • is offering a choice of three approaches helpful?
  • for the community formats approach, are the example formats we list correct, are our recommendations around the use of CSV useful and are JSON and Atom significant enough that they should be treated more prominently?
  • does the suggestion to use FRBR and CIDOC CRM as the basis for modeling in RDF set the bar too high for libraries and museums?
  • where people are creating Linked Data, should we be recommending particular RDF datasets/vocabularies as the target of external links?
  • do we need to be more prescriptive about the ways that URIs are assigned and dereferenced?

Note that a printable version of the draft is also available from Google Docs.

January 26, 2011

Metadata guidelines for the UK Resource Discovery Taskforce

We (Pete and I) have been asked by the JISC and RLUK to develop some metadata guidelines for use by the UK Resource Discovery Taskforce as it rolls out its vision [PDF].

This turns out to be a non-trivial task. The vision covers libraries, museums and archives and is intended to:

focus on defining the requirements for the provision of a shared UK infrastructure for libraries, archives, museums and related resources to support education and research. The focus will be on catalogues/metadata that can assist in access to objects/resources. With a special reference to serials, books, archives/special collections, museum collections, digital repository content and other born digital content. This will interpret the shared UK infrastructure as part of global information provision.

(Taken from the Resource Discovery Taskforce Term of Reference)

The vision itself talks of a "collaborative, aggregated and integrated resource discovery and delivery framework" which implies an approach based on harvesting metadata (and other content) rather than cross-searching.

If the last 15 years or so have taught me anything, it's not to expect much coming together of metadata practice across those three sectors! Add to that a wide spectrum of attitudes to Linked Data and its potential value in this space, an unclear picture about the success of Europeana and its ESE [PDF] and EDM [PDF] metadata formats, the apparent success of somewhat "permissive" metadata-based initiatives such as Digital NZ and you are left with a a range of viewpoints from "Keep calm and carry on" thru to "Throw it all away and use Linked Data" and everything in between.

At this point in time, we are taking the view that Tim Berners-Lee's star rating system for linked open data provides a useful framework for this work. However, as I have indicated elsewhere, Mugging up on the linked open data star ratings, it is rather unhelpful that the definition of each of the stars seems to be somewhat up for grabs at the moment (more or less in line with the ongoing, and quite probably endless, debate about the centrality of RDF and SPARQL to Linked Data). On that basis, we will almost certainly have to provide our own definitions for the purposes of these guidelines. Note that using this star rating system does not mean that everything has to use RDF.

Anyway... all of that is currently our problem, so I won't burden you with it :-)

The real purpose of this post is simply to say that we hope to make a draft of our metadata guidelines available during next week (I'm not willing to commit to a specific day at this point in time!), at which point we hope that people will share their thoughts on what we've come up with. That said, time is reasonably tight so I don't expect to be able to give people more than a couple of weeks (at most) to comment.

November 29, 2010

Still here & some news from LOCAH

Contrary to appearances, I haven't completely abandoned eFoundations, but recently I've mostly been working on the JISC-funded LOCAH project which I mentioned here a while ago, and my recent scribblings have mostly been over on the project blog.

LOCAH is working on making available some data from the Archives Hub (a collection of archival finding-aids i.e. metadata about archival collections and their constituent items) and from Copac (a "union catalogue" of bibliographic metadata from major research & specialist libraries) as linked data.

So far, I've mostly been working with the EAD data, with Jane Stevenson and Bethan Ruddock from the Archives Hub. I've posted a few pieces on the LOCAH blog, on the high-level architecture/workflow, on the model for the archival description data (also here), and most recently on the URI patterns we're using for the archival data.

I've got an implementation of this as an XSLT transform that reads an EAD XML document and outputs RDF/XML, and have uploaded the results of applying that to a small subset of data to a Talis Platform instance. We're still ironing out some glitches but there'll be information about that on the project blog coming in the not too distant future.

On a personal note, I'm quite enjoying the project. It gives me a chance to sit down and try to actually apply some of the principles that I read about and talk about, and I'm working through some of the challenges of "real world" data, with all its variability and quirks. I worked in special collections and archives for a few years back in the 1990s, when the institutions where I was working were really just starting to explore the potential of the Web, so it's interesting to see how things have changed (or not! :-)), and to see the impact of and interest in some current technological (and other) trends within those communities. It also gives me a concrete incentive to explore the use of tools (like the Talis Platform) that I've been aware of but have only really tinkered with: my efforts in that space inevitably bring me face to face with the limited scope of my development skills, though it's also nice to find that the availability of a growing range of tools has enabled me to get some results even with my rather stumbling efforts.

It'a also an opportunity for me to discuss the "linked data" approach with the archivists and librarians within the project - in very concrete ways based on actual data - and to try to answer their questions and to understand what aspects are perceived as difficult or complex - or just different from existing approaches and practices.

So while some of my work necessarily involves me getting my head down and analysing input data or hacking away at XSLT or prodding datasets with SPARQL queries, I've been doing my best to discuss the principles behind what I'm doing with Jane and Bethan as I go along, and they in turn have reflected on some of the challenges as they perceive them in posts like Jane's here.

One of the project's tasks is to:

Explore and report on the opportunities and barriers in making content structured and exposed on the Web for discovery and use. Such opportunities and barriers may coalesce around licensing implications, trust, provenance, sustainability and usability.

I think we're trying to take a broad view of this aspect of the project, so that it extends not just to the "technical" challenges in cranking out data and how we address them, but also incorporates some of these "softer" elements of how we, as individuals with backgrounds in different "communities", with different practices and experiences and perspectives, share our ideas, get to grips with some of the concepts and terminology and so on. Where are the "pain points" that cause confusion in this particular context? Which means of explaining or illustrating things work, and which don't? What (if any!) is the value of the "linked data" approach for this sort of data? How is that best demonstrated? What are the implications, if any, for information management practices within this community? It may not be the case that SPARQL becomes a required element of archivists' training any time soon, but having these conversations, and reflecting on them, is, I think, an important part of the LOCAH experience.

November 09, 2010

Is (a debate about) 303 really necessary?

At the tail end of last week Ian Davis of Talis wrote a blog post entitled, Is 303 Really Necessary?, which gave various reasons why the widely adopted (within the Linked Data community) 303-redirect pattern for moving from the URI for a non-Information Resource (NIR) to the URI for an Information Resource (IR) was somewhat less than optimal and offering an alternative approach based on the simpler, and more direct, use of a 200 response. For more information about IRs, NIRs and the use of 303 in this context see the Web Architecture section of How to Publish Linked Data on the Web.

Since then the public-lod@w3.org mailing list has gone close to ballistic with discussions about everything from the proposal itself to what happens to the URI you have assigned to China (a NIR) if/when the Chinese border changes. One thing the Linked Data community isn't short of is the ability to discuss how much RDF you can get on the head of a pin ad infinitum.

But as John Sheridan pointed out in a post to the mailing list, there are down-sides to such a discussion, coming at a time when adoption of Linked Data is slowly growing.

debating fundamental issues like this is very destabilising for those of us looking to expand the LOD community and introduce new people and organisations to Linked Data. To outsiders, it makes LOD seem like it is not ready for adoption and use - which is deadly. This is at best the 11th hour for making such a change in approach (perhaps even 5 minutes to midnight?).

I totally agree. Unfortunately the '200 cat' is now well and truly out of the 'NIR bag' and I don't suppose any of us can put it back in again so from that perspective it's already too late.

As to the proposal itself, I think Ian makes some good points and, from my slightly uninformed perspective, I don't see too much wrong with what he is suggesting. I agree with him that being in the (current) situation where clients can infer semantics based on HTTP response codes is less than ideal. I would prefer to see all semantics carried explicitly in the various representations being served on the Web. In that sense, the debate about 303 vs 200 becomes one that is solely about the best mechanism for getting to a representation, rather than being directly about semantics. I also sense that some people are assuming that Ian is proposing that the current 303 mechanism needs to be deprecated at some point in the future, in favour of his 200 proposal. That wasn't my interpretation. I assume that both mechanisms can sit alongside each other quite happily - or, at least, I would hope that to be the case.

November 03, 2010

Google support for GoodRelations

Google have announced support for the GoodRelations vocabulary for product and price information in Web pages, Product properties: GoodRelations and hProduct. This is primarily of interest to ecommerce sites but is more generally interesting because it is likely to lead to a significant rise in the amount of RDF flowing around the Web. It therefore potentially represents a significant step forward for the adoption of the Semantic Web and Linked Data.

Martin Hepp, the inventor of the GoodRelations vocabulary, has written about this development, Semantic SEO for Google with GoodRelations and RDFa, suggesting a slightly modified form of markup which is compatible with that adopted by Google but that is also "understood by ALL RDFa-aware search engines, shopping comparison sites, and mobile services".

October 26, 2010

Attending meetings remotely - a cautionary tale

In these times of financial constraints and environmental concerns, attending meetings remotely (particularly those held overseas) is becoming increasingly common. Such was the case, for me, at 7pm UK time on Friday night last week... I should have been eating pizza in front of the TV with my family but instead was messing about with Skype, my house phone, IRC and Twitter in an attempt to join the joint meeting of the DC Architecture Forum and the W3C Library Linked Data Incubator Group (described in Pete's recent post, The DCMI Abstract Model in 2010).

The meeting started with Tom Baker summarising the history and current state of the DCMI Abstract Model (the DCAM) - a tad long perhaps but basically a sound introduction and overview. Unfortunately, my Skype connection dropped a couple of times during his talk (I have no idea why) and I resorted to using my house phone instead - using the W3C bridge in Bristol. This proved more stable but some combination of my phone and the microphone positioning in the meeting meant that sound, particularly from speakers in the audience, was rather poor.

By the time we got to the meat of the discussion about the future of the DCAM I was struggling to keep up :-(. I made a rather garbled contribution, trying to summarise my views on the possible ways forward but emphasising that all the interesting possibilities had the same end-game - that DCMI would stop using the language of its own world-view, the DCAM, and would instead work within the more widely accepted language of the RDF model and Linked Data - and that the options were really about how best we get there, rather than about where we want to go.

Unfortunately, this is a view that is open to some confusion because the DCAM itself uses the RDF model. So when I say that we should stop using the DCAM and start using RDF and Linked Data its not like saying that we should stop using model A and start using model B. Rather, it's a case of carrying on with the current model (i.e. the RDF model) but documenting it and talking about it using the same language as everyone else, thus joining forces with more active communities elsewhere rather than silo'ing ourselves on the DC mailing lists by having a separate way of talking.

So, anyway, I don't know how well I put my point of view across - one of the problems of being remote is that the only feedback you get is from the person taking minutes, in this case in the W3C IRC channel:

18:50:56 [andypowe11] ok, i'd like to speak at some point

18:52:34 [markva] andypowe11: options 2b, 3 and 4: all work to RDF, which is where we want to get to

18:52:55 [markva] ... which of these is better to get to that end game, wrt time available

18:53:23 [markva] ... 4 seems not ideal, but less effort

18:54:01 [markva] ... lean to 3; 2b has political value by taking along community; but 3 better given time

Stu Weibel spoke after me - a rather animated contribution (or so it seemed from afar). No problem with that... DCMI could probably do with a bit more animation to be honest. I understood him to be saying that we should adopt the Web model and that Linked Data offered us a useful chance to re-align ourselves with other Web activity. As I say, I was struggling to hear properly, so I may have mis-understood him completely. I glanced at the IRC minutes:

18:54:56 [markva] Stu Weibel: frustrated; no productive outcomes all these years

18:55:10 [markva] ... adopt Web as the model

18:55:37 [markva] ... nobody understands DCAM

18:56:03 [markva] ... W3C published architecture document after actual implementation

18:56:45 [markva] ... revive effort: develop reference software; easily drop in data, generate linked data 

I responded positively, trying to make it clear that I was struggling to hear and that I may have mis-interpreted him but noting the reference to 'linked data', which I'd heard as 'Linked Data':

18:57:12 [markva] andypowe11: support Stu 

The minute is factually correct - I did support Stu - but in an 'economical with the truth' kind of way because I only really supported what I thought I'd heard him say - and quite possibly not what he actually said! With hindsight, I wonder if the minute-taker's use of 'linked data' (lower-case) actually reflected some subtlety in what Stu said that I didn't really pick up on at the time. If nothing else, this exchange highlights for me the potential problems caused by those who want to distinguish 'linked data' (lower-case) from 'Linked Data' (upper-case) - there is no mixed-case in conversation, particularly not where it is carried out over a bad phone connection.

So anyway... the meeting moved on to other things and, feeling somewhat frustrated by the whole experience, I dropped off the call and found my cold pizza.

My point here is not about DCMI at all, though I still have no real idea whether I was right or wrong to agree with Stu. My gut feeling is that I probably agreed with some of what he said and disagreed with the rest - and the lesson, for me, is that I should be more careful before opening my mouth! My point is really about the practical difficulties of engaging in quite challenging intellectual debates in the un-even environment of a hybrid meeting where some people are f2f in the same room and others are remote. Or, to mis-quote William Gibson:

The future of hybrid events is here, it's just not evenly distributed yet.

:-(

(Note: none of this is intended to be critical of the minute-taker for the meeting who actually seems to have done a fantastic job of capturing a complex exchange of views in what must have been a difficult environment).

September 17, 2010

On the length and winding nature of roads

I attended, and spoke at, the ISKO Linked Data - the future of knowledge organization on the Web event in London earlier this week. My talk was intended to have a kind of "what 10 years of working with the Dublin Core community has taught me about the challenges facing Linked Data" theme but probably came across more like "all librarians are stupid and stuck in the past". Oh well... apologies if I offended anyone in the audience :-).

Here are my slides:

They will hopefully have the audio added to them in due course - in the meantime, a modicum of explanation is probably helpful.

My fundamental point was that if we see Linked Data as the future of knowledge organization on the Web (the theme of the day), then we have to see Linked Data as the future of the Web, and (at the risk of kicking off a heated debate) that means that we have to see RDF as the future of the Web. RDF has been on the go for a long time (more than 10 years), a fact that requires some analysis and explanation - it certainly doesn't strike me as having been successful over that period in the way that other things have been successful. I think that Linked Data proponents have to be able to explain why that is the case rather better than simply saying that there was too much emphasis on AI in the early days, which seemed to be the main reason provided during this particular event.

My other contention was that the experiences of the Dublin Core community might provide some hints at where some of the challenges lie. DC, historically, has had a rather librarian-centric make-up. It arose from a view that the Internet could be manually catalogued for example, in a similar way to that taken to catalogue books, and that those catalogue records would be shipped between software applications for the purposes of providing discovery services. The notion of the 'record' has thus been quite central to the DC community.

The metadata 'elements' (what we now call properties) used to make up those records were semantically quite broad - the DC community used to talk about '15 fuzzy buckets' for example. As an aside, in researching the slides for my talk I discovered that the term fuzzy bucket now refers to an item of headgear, meaning that the DC community could quite literally stick it's collective head in a fuzzy bucket and forget about the rest of the world :-). But I digress... these broad semantics (take a look at the definition of dcterms:coverage if you don't believe me) were seen as a feature, particularly in the early days of DC... but they become something of a problem when you try to transition those elements into well crafted semantic web vocabularies, with domains, ranges and the rest of it.

Couple that with an inherent preference for "strings" vs. "things", i.e. a reluctance to use URIs to identify the entities at the value end of a property relationship - indeed, couple it with a distinct scepticism about the use of 'http' URIs for anything other than locating Web pages - and a large dose of relatively 'flat' and/or fuzzy modelling and you have an environment which isn't exactly a breeding ground for semantic web fundamentalism.

When we worked on the original DCMI Abstract Model, part of the intention was to come up with something that made sense to the DC community in their terms, whilst still being basically the RDF model and, thus, compatible with the Semantic Web. In the end, we alienated both sides - librarians (and others) saying it was still too complex and the RDF-crowd bemused as to why we needed anything other than the RDF model.

Oh well :-(.

I should note that a couple of things have emerged from that work that are valuable I think. Firstly, the notion of the 'record', and the importance of the 'record' as a mechanism for understanding provenance. Or, in RDF terms, the notion of bounded graphs. And, secondly, the notion of applying constraints to such bounded graphs - something that the DC community refers to as Application Profiles.

On the basis of the above background, I argued that some of the challenges for Linked Data lie in convincing people:

  • about the value of an open world model - open not just in the sense that data may be found anywhere on the Web, but also in the sense that the Web democratises expertise in a 'here comes everybody' kind of way.
  • that 'http' URIs can serve as true identifiers, of anything (web resources, real-world objects and conceptual stuff).
  • and that modelling is both hard and important. Martin Hepp, who spoke about GoodRelations just before me (his was my favorite talk of the day), indicated that the model that underpins his work has taken 10 years or so to emerge. That doesn't surprise me. (One of the things I've been thinking about since giving the talk is the extent to which 'models build communities', rather than the other way round - but perhaps I should save that as the topic of a future post).

There are other challenges as well - overcoming the general scepticism around RDF for example - but these things are what specifically struck me from working with the DC community.

I ended my talk by reading a couple of paragraphs from Chris Gutteridge's excellent blog post from earlier this month, The Modeller, which seemed to go down very well.

As to the rest of the day... it was pretty good overall. Perhaps a tad too long - the panel session at the end (which took us up to about 7pm as far as I recall) could easily have been dropped.  Ian Mulvany of Mendeley has a nice write up of all the presentations so I won't say much more here. My main concern with events like this is that they struggle to draw a proper distinction between the value of stuff being 'open', the value of stuff being 'linked', and the value of stuff being exposed using RDF. The first two are obvious - the last less-so. Linked Data (for me) implies all three... yet the examples of applications that are typically shown during these kinds of events don't really show the value of the RDFness of the data. Don't get me wrong - they are usually very compelling examples in their own right but usually it's a case of 'this was built on Linked Data, therefore Linked Data is wonderful' without really making a proper case as to why.

September 06, 2010

If I was a Batman villain I'd probably be...

The Modeller.

OK... not a real Batman villain (I didn't realise there were so many to choose from) but one made up by Chris Gutteridge in a recent blog post of the same name. It's very funny:

I’ve invented a new Batman villain. His name is “The Modeller” and his scheme is to model Gotham city entirely accurately in a way that is of no practical value to anybody. He has an OWL which sits on his shoulder which has the power to absorb huge amounts of time and energy.

...

Over the 3 issues there’s a running subplot about The Modeller's master weapon, the FRBR, which everyone knows is very very powerful but when the citizens of Gotham talk about it none of them can quite agree on exactly what it does.

...

While unpopular with the fans, issue two, “Batman vs the Protégé“, will later be hailed as a Kafkaesque masterpiece. Batman descends further into madness as he realises that every moment he’s the Batman of that second in time, and each requires a URI, and every time he considers a plan of action, the theoretical Batmen in his imagination also require unique distinct identifiers which he must assign before continuing.

I suspect there's a little bit of The Modeller in most of us - certainly those of us who have a predisposition towards Linked Data/the Semantic Web/RDF - and as I said before, I tend to be a bit of a purest, which probably makes me worse than most. I've certainly done my time with the FRBR. The trick is to keep The Modeller's influence under control as far as possible.

August 24, 2010

Resource discovery revisited...

...revisited for me that is!

Last week I attended an invite-only meeting at the JISC offices in London, notionally entitled a "JISC IE Technical Review" but in reality a kind of technical advisory group for the JISC and RLUK Resource Discovery Taskforce Vision [PDF], about which the background blurb says:

The JISC and RLUK Resource Discovery Taskforce was formed to focus on defining the requirements for the provision of a shared UK resource discovery infrastructure to support research and learning, to which libraries, archives, museums and other resource providers can contribute open metadata for access and reuse.

The morning session felt slightly weird (to me), a strange time-warp back to the kinds of discussions we had a lot of as the UK moved from the eLib Programme, thru the DNER (briefly) into what became known (in the UK) as the JISC Information Environment - discussions about collections and aggregations and metadata harvesting and ... well, you get the idea.

In the afternoon we were split into breakout groups and I ended up in the one tasked with answering the question "how do we make better websites in the areas covered by the Resource Discovery Taskforce?", a slightly strange question now I look at it but one that was intended to stimulate some pragmatic discussion about what content providers might actually do.

Paul Walk has written up a general summary of the meeting - the remainder of this post focuses on the discussion in the 'Making better websites' afternoon breakout group and my more general thoughts.

Our group started from the principles of Linked Data - assign 'http' URIs to everything of interest, serve useful content (both human-readable and machine-processable (structured according to the RDF model)) at those URIs, and create lots of links between stuff (internal to particular collections, across collections and to other stuff). OK... we got slightly more detailed than that but it was a fairly straight-forward view that Linked Data would help and was the right direction to go in. (Actually, there was a strongly expressed view that simply creating 'http' URIs for everything and exposing human-readable content at those URIs would be a huge step forward).

Then we had a discussion about what the barriers to adoption might be - the problems of getting buy-in from vendors and senior management, the need to cope with a non-obvious business model (particularly in the current economic climate), the lack of technical expertise (not to mention semantic expertise) in parts of those sectors, the endless discussions that might take place about how to model the data in RDF, the general perception that Semantic Web is permanently just over the horizon and so on.

And, in response, we considered the kinds of steps that JISC (and its partners) might have to undertake to build any kind of political momentum around this idea.

To cut a long story short, we more-or-less convinced ourselves out of a purist Linked Data approach as a way forward, instead preferring a 4 layer model of adoption, with increasing levels of semantic richness and machine-processability at each stage:

  1. expose data openly in any format available (.csv files, HTML pages, MARC records, etc.)
  2. assign 'http' URIs to things of interest in the data, expose it in any format available (.csv files, HTML pages, etc.) and serve useful content at each URI
  3. assign 'http' URIs to things of interest in the data, expose it as XML and serve useful content at each URI
  4. assign 'http' URIs to things of interest in the data and expose Linked Data (as per the discussion above).

These would not be presented as steps to go thru (do 1, then 2, then 3, ...) but as alternatives with increasing levels of semantic value. Good practice guidance would encourage the adoption of option 4, laying out the benefits of such an approach, but the alternatives would provide lower barriers to adoption and offer a simpler 'sell' politically.

The heterogeneity of data being exposed would leave a significant implementation challenge for the aggregation services attempting to make use of it and the JISC (and partners) would have to fund some pretty convincing demonstrators of what might usefully be achieved.

One might characterise these approaches as 'data.glam.uk' (echoing 'data.gov.uk' but where 'glam' is short for 'galleries, libraries, archives and museums') and/or Digital UK (echoing the pragmatic approaches being successfully adopted by the Digital NZ activity in New Zealand).

Despite my reservations about the morning session, the day ended up being quite a useful discussion. That said, I remain somewhat uncomfortable with its outcomes. I'm a purest at heart and the 4 levels above are anything but pure. To make matters worse, I'm not even sure that they are pragmatic. The danger is that people will adopt only the lowest, least semantic, option and think they've done what they need to do - something that I think we are seeing some evidence of happening within data.gov.uk. 

Perhaps even more worryingly, having now stepped back from the immediate talking-points of the meeting itself, I'm not actually sure we are addressing a real user need here any more - the world is so different now than it was when we first started having conversations about exposing cultural heritage collections on the Web, particularly library collections - conversations that essentially pre-dated Google, Google Scholar, Amazon, WorldCat, CrossRef, ... the list goes on. Do people still get agitated by, for example, the 'book discovery' problem in the way they did way back then? I'm not sure... but I don't think I do. At the very least, the book 'discovery' problem has largely become an 'appropriate copy' problem - at least for most people? Well, actually, let's face it... for most people the book 'discovery' and 'appropriate copy' problems have been solved by Amazon!

I also find the co-location of libraries, museums and archives, in the context of this particular discussion, rather uncomfortable. If anything, this grouping serves only to prolong the discussion and put off any decision making?

Overall then, I left the meeting feeling somewhat bemused about where this current activity has come from and where it is likely to go.

 

July 29, 2010

legislation.gov.uk

I woke up this morning to find a very excited flurry of posts in my Twitter stream pointing to the launch by the UK National Archives of the legislation.gov.uk site, which provides access to all UK legislation, including revisions made over time. A post on the data.gov.uk blog provides some of the technical background and highlights the ways in which the data is made available in machine-processable forms. Full details are provided in the "Developer Zone" documents.

I don't for a second pretend to have absorbed all the detail of what is available, so I'll just highlight a couple of points.

First and foremost, this is being delivered with an eye firmly on the Linked Data principles. From the blog post I mentioned above:

For the web architecturally minded, there are three types of URI for legislation on legislation.gov.uk. These are identifier URIs, document URIs and representation URIs. Identifier URIs are of the form http://www.legislation.gov.uk/id/{type}/{year}/{number} and are used to denote the abstract concept of a piece of legislation - the notion of how it was, how it is and how it will be. These identifier URIs are designed to support the use of legislation as part of the web of Linked Data. Document URIs are for the document. Representation URIs are for the different types of possible rendition of the document, so htm, pdf or xml.

(Aside: I admit to a certain squeamishness about the notion of "representation URIs" and I kinda prefer to think in terms of URIs for Generic Documents and for Specific Documents, along the lines described by Tim Berners-Lee in his "Generic Resources" note, but that's a minor niggle of terminology on my part, and not at all a disagreement with the model.)

A second aspect I wanted to highlight (given some of my (now slightly distant) past interests) is that, on looking at the RDF data (e.g. http://www.legislation.gov.uk/ukpga/2010/24/contents/data.rdf), I noticed that it appears to make use of a FRBR-based model to deal with the challenge of representing the various flavours of "versioning" relationships.

I haven't had time to look in any detail at the implementation, other than to observe that the data can get quite complex - necessarily so - when dealing with a lot of whole-part and revision-of/variant-of/format-of relationships. (There was one aspect where I wondered if the FRBR concepts were being "stretched" somewhat, but I'm writing in haste and I may well be misreading/misinterpreting the data, so I'll save that question for another day.)

It's fascinating to see the FRBR approach being deployed as a practical solution to a concrete problem, outside of the library community in which it originated.

Pretty cool stuff, and congratulations to all involved in providing it. I look forward to seeing how the data is used.

July 08, 2010

Going LOCAH: a Linked Data project for JISC

Recently I worked with Adrian Stevenson of UKOLN and Jane Stevenson and Joy Palmer of MIMAS, University of Manchester on a bid for a project under the JISC O2/10 call, Deposit of research outputs and Exposing digital content for education and research, and I'm very pleased to be able to say that the proposal has been accepted and the project has been funded.

The project is called "Linked Open Copac Archives Hub" (LOCAH). It aims to address the "expose" section of the call, and focuses on making available data hosted by the Copac and Archives Hub services hosted by MIMAS - i.e. library catalogue data and data from archival finding aids - in the form of Linked Data; developing some prototype applications illustrating the use of that data; and analysing some of the issues arising from that work. The main partners in the work are UKOLN and MIMAS, with contributions from Eduserv, OCLC and Talis. The Eduserv contribution will take the form of some input from me, probably mostly in the area of working with Jane on modelling some of the archival finding aid data, currently held in the form of EAD-encoded XML documents, so that it can be represented in RDF - though I imagine I'll be sticking my oar in on various other aspects along the way.

UKOLN is managing the project and hosting a project weblog. I'm not sure at the moment how I'll divide up thoughts between here and there; I'll probably end up with a bit of duplication along the way.

May 05, 2010

The future of UK Dublin Core application profiles

I spent yesterday morning up at UKOLN (at the University of Bath) for a brief meeting about the future of JISC-funded Dublin Core application profile development in the UK.

I don't intend to report on the outcomes of the meeting here since it is not really my place to do so (I was just invited as an interested party and I assume that the outcomes of the meeting will be made public in due course). However, attending the meeting did make me think about some of the issues around the way application profiles have tended to be developed to date and these are perhaps worth sharing here.

By way of background, the JISC have been funding the development of a number of Dublin Core application profiles in areas such as scholarly works, images, time-based media, learning objects, GIS and research data over the last few years.  An application profile provides a model of some subset of the world of interest and an associated set of properties and controlled vocabularies that can be used to describe the entities in that model for the purposes of some application (or service) within a particular domain. The reference to Dublin Core implies conformance with the DCMI Abstract Model (which effectively just means use of the RDF model) and an inherent preference for the use of Dublin Core terms whenever possible.

The meeting was intended to help steer any future UK work in this area.

I think (note that this blog post is very much a personal view) that there are two key aspects of the DC application profile work to date that we need to think about.

Firstly, DC application profiles are often developed by a very small number of interested parties (sometimes just two or three people) and where engagement in the process by the wider community is quite hard to achieve. This isn't just a problem with the UK JISC-funded work on application profiles by the way. Almost all of the work undertaken within the DCMI community on application profiles suffers from the same problem - mailing lists and meetings with very little active engagement beyond a small core set of people.

Secondly, whilst the importance of enumerating the set of functional requirements that the application profile is intended to meet has not been underestimated, it is true to say that DC application profiles are often developed in the absence of an actual 'software application'. Again, this is also true of the application profile work being undertaken by the DCMI. What I mean here is that there is not a software developer actually trying to build something based on the application profile at the time it is being developed. This is somewhat odd (to say the least) given that they are called application profiles!

Taken together, these two issues mean that DC application profiles often take on a rather theoretical status - and an associated "wouldn't it be nice if" approach. The danger is a growth in the complexity of the application profile and a lack of any real business drivers for the work.

Speaking from the perspective of the Scholarly Works Application Profile (SWAP) (the only application profile for which I've been directly responsible), in which we adopted the use of FRBR, there was no question that we were working to a set of perceived functional requirements (e.g. "people need to be able to find the latest version of the current item"). However, we were not driven by the concrete needs of a software developer who was in the process of building something. We were in the situation where we could only assume that an application would be built at some point in the future (a UK repository search engine in our case). I think that the missing link to an actual application, with actual developers working on it, directly contributed to the lack of uptake of the resulting profile. There were other factors as well of course - the conceptual challenge of basing the work on FRBR and that fact that existing repository software was not RDF-ready for example - but I think that was the single biggest factor overall.

Oddly, I think JISC funding is somewhat to blame here because, in making funding available, JISC helps the community to side-step the part of the business decision-making that says, "what are the costs (in time and money) of developing, implementing and using this profile vs. the benefits (financial or otherwise) that result from its use?".

It is perhaps worth comparing current application profile work and other activities. Firstly, compare the progress of SWAP with the progress of the Common European Research Information Format (CERIF), about which the JISC recently reported:

EXRI-UK reviewed these approaches against higher education needs and recommended that CERIF should be the basis for the exchange of research information in the UK. CERIF is currently better able to encode the rich information required to communicate research information, and has the organisational backing of EuroCRIS, ensuring it is well-managed and sustainable.

I don't want to compare the merits of these two approaches at a technical level here. What is interesting however, is that if CERIF emerges as the mandated way in which research information is shared in the UK then there will be a significant financial driver to its adoption within systems in UK institutions. Research information drives a significant chunk of institutional funding which, in turn, drives compliance in various applications. If the UK research councils say, "thou shalt do CERIF", that is likely what institutions will do.  They'll have no real choice. SWAP has no such driver, financial or otherwise.

Secondly, compare the current development of Linked Data applications within the UK data.gov.uk initiative with the current application profile work. Current government policy in the UK effectively says, 'thou shalt do Linked Data' but isn't really any more prescriptive. It encourages people to expose their data as Linked Data and to develop useful applications based on that data. Ignoring any discussion about whether Linked Data is a good thing or not, what has resulted is largely ground-up. Individual developers are building stuff and, in the process, are effectively developing their own 'application profiles' (though they don't call them that) as part of exposing/using the Linked Data. This approach results in real activity. But it also brings with it the danger of redundancy, in that every application developer may model their Linked Data differently, inventing their own RDF properties and so on as they see fit.

As Paul Walk noted at the meeting yesterday, at some stage there will be a huge clean-up task to make any widespread sense of the UK government-related Linked Data that is out there. Well, yes... there will. Conversely, there will be no clean up necessary with SWAP because nobody will have implemented it.

Which situation is better!? :-)

I think the issue here is partly to do with setting the framework at the right level. In trying to specify a particular set of application profiles, the JISC is setting the framework very tightly - not just saying, "you must use RDF" or "you must use Dublin Core" but saying "you must use Dublin Core in this particular way". On the other hand, the UK government have left the field of play much more open. The danger with the DC application profile route is lack of progress. The danger with the government approach is too little consistency.

So, what are the lessons here? The first, I think, is that it is important to lobby for your prefered technical solution at a policy level as well as at a technical level. If you believe that a Linked Data-compliant Dublin Core application profile is the best technical way of sharing research information in the UK then it is no good just making that argument to software developers and librarians. Decisions made by the research councils (in this case) will be binding irrespective of technical merit and will likely trump any decisions made by people on the ground.

The second is that we have to understand the business drivers for the adoption, or not, of our technical solutions rather better than we do currently. Who makes the decisions? Who has the money? What motivates the different parties? Again, technically beautiful solutions won't get adopted if the costs of adoption are perceived to outweigh the benefits, or if the people who hold the purse strings don't see any value in spending their money in that particular way, or if people simply don't get it.

Finally, I think we need to be careful that centralised, top-down, initiatives (particularly those with associated funding) don't distort the environment to such an extent that the 'real' drivers, both financial and user-demand, can be ignored in the short term, leading to unsustainable situations in the longer term. The trick is to pump-prime those things that the natural drivers will support in the long term - not always an easy thing to pull off.

April 08, 2010

Linked Data & Web Search Engines

I seem to have fallen into the habit of half-writing posts and then either putting them to one side because I'm don't feel entirely happy with them or because I get diverted into other more pressing things. This is one of several that I seem to have accumulated over the last few weeks, and which I've resolved to try to get out there....

A few weekends ago I spotted a brief exchange on Twitter between Andy and our colleague Mike Ellis on the impact of exposing Linked Data on Google search ranking. Their conclusion seemed to be that the impact was minimal. I think I'd question this assessment, and here I'll try to explain why - though in the absence of empirical evidence, I admit this is largely speculation on my part, a "hunch", if you like. I admit I almost hesitate to write this post at all, as I am far from an expert in "search-engine optimisation", and, tbh, I have something of an instinctive reaction against a notion that a high Google search ranking is the "be all and end all" :-) But I recognise it is something that many content providers care about.

In this post, I'm not considering the ways search engines might use the presence of structured data in the documents they index to enhance result sets (or make that data available to developers to provide such enhancements); rather, I'm thinking about the potential impact of the Linked Data approach on ranking.

It is widely recognised that one of the significant factors in Google's page ranking algorithm is the weighting it attaches to the number of links made to the page in question from other pages ("incoming links"). Beyond that, the recommendations of the Google Webmaster guidelines seem to be largely "common sense" principles for providing well-formed X/HTML, enabling access to your document set for Google's crawlers, and not resorting to techniques that attempt to "game" the algorithm.

Let's go back to Berners-Lee's principles for Linked Data:

  1. Use URIs as names for things

  2. Use HTTP URIs so that people can look up those names.

  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)

  4. Include links to other URIs so that they can discover more things.

The How to Publish Linked Data on the Web and the W3C Note on Cool URIs for the Semantic Web elaborate on some of the mechanics of providing Linked Data. Both of these sources make the point that to "provide useful information" means to provide data both in RDF format(s) and in human-readable forms.

So following those guidelines typically means that "exposing Linked Data" results in exposing a whole lot of new Web documents "about" the things featured in the dataset, in both RDF/XML (or another RDF format) and in XHTML/HTML - and indeed the use of XHTML+RDFa could meet both requirements in a single format. So this immediately increases what Leigh Dodds of Talis rather neatly refers to as the "surface area" of my pages which are available for Google to crawl and index.

The second aspect which is significant is that, by definition, Linked Data is about making links: I make links between items described in my own dataset, but I also make ("outgoing") links between those items and items described in other linked datasets made available by other parties elsewhere. And (hopefully!), at least in time, other people exposing Linked Data make ("incoming") links to items in my datasets.

And in the X/HTML pages at least, those are the very same hyperlinks that Google crawls and counts when calculating its pagerank.

The key point, I think, is that my pages are available, not just to other "Linked Data applications", but also for other people to reference, bookmark and make links to just as they do any page on any Web site. This is one of the points I was trying to highlight in my last post when I mentioned the BBC's Linked Data work: the pages generated as part of those initiatives are fairly seamlessly integrated within the collection of documents that make up the BBC Web site. They do not appear as some sort of separate "data area", something just for "client apps that want data", somehow "different from" the news pages or the IPlayer pages; on the contrary, they are linked to by those other pages, and the X/HTML pages are given a "look and feel" that emphasises their membership of this larger aggregation. And human readers of the BBC Web site encounter those pages in the course of routinely navigating the site.

Of course the key to increasing the rank of my pages in Google is whether other people actually make those links to pages I expose, and it may well be that for much of the data surfaced so far, such links are relatively small in number. But the Linked Data approach, and its emphasis on the use of URIs and links, helps me do my bit to make sure my resources are things "of (or in) the Web".

So I'd argue that the Linked Data approach is potentially rather a good fit with what we know of the way Google indexes and ranks pages - precisely because both approaches seek to "work with the grain of the Web". I'd stick my neck out and say that having a page about my event (project, idea, whatever) provides a rather better basis for making that information findable in Google than exposing that description only as the content of a particular row in an Excel spreadsheet, where it is difficult to reference as an individual target resource and where it is (typically at least) not a source of links to other resources.

As I was writing this, I saw a new post appear from Michael Hausenblas, in which he attempts to categorise some common formats and services according to what he calls their "Link Factor" ("the degree of how 'much' they are in the Web". And more recently, I noticed the appearance of a a post titled 10 Reasons Why News Organizations Should Use 'Linked Data' which, in its first two points, highlights the importance of Linked Data's use of hyperlinks and URIs to SEO - and points to the fact that the BBC's Wildlife Finder pages do tend to appear prominently in Google result sets.

Before I get carried away, I should add a few qualifiers, and note some issues which I can imagine may have some negative impact. And I should emphasise this is just my thinking out loud here - I think more work is necessary to examine the actual impact, if any.

  • Redirects: Many of the links in Linked Data are made between "things", rather than between the pages describing the things. And following the "Cool URIs" guidelines, these URIs would either be URIs with fragment identifiers ("hash URIs") or URIs for which an HTTP server responds with a 303 response providing the URI of a document describing the thing. For the first case, I think Google recognises these as links to the document with the URI obtained by stripping the fragment id; for the 303 case, I'm unsure about the impact of the use of the redirect on the ranking for the document which is the final target. (A related issue would be that some sources might cite the URI of the thing and other sources might cite the URI of the document describing the thing).
  • Synonyms: As the Publishing Linked Data tutorial highlights, one of the characteristics of Linked Data is that it often makes use of URI aliases, multiple URIs each referring to the same resource. If some users bookmark/cite URI A and some users bookmark/cite URI B, then that would result in a lower link-based ranking for each of the two pages describing the thing than if all users bookmarked/cited a single URI. To some extent, this is just part of the nature of the Web, and it applies similarly outside the Linked Data context, but the tendency to generate an increasing number of aliases is something which generates continued discussion in the LD community (see, for example, the recent thread on "annotation" on the public-lod mailing list generated in response to Leigh Dodds' and Ian Davis' recent Linked Data Patterns document (which I should add, from my very hasty skim reading so far, seems to provide an excellent combination of thoughtful discussion and clear practical suggestions.)).
  • "Caching"/"Resurfacing": As we are seeing Linked Data being deployed, we are seeing data aggregated by various agencies and resurfaced on the Web using new URIs. Similarly to the previous point, this may lead to a case where two users cite different URIs, with a corresponding reduction in the number of incoming links to any single document. I also note that Google's guidelines include the admonition: "Don't create multiple pages, subdomains, or domains with substantially duplicate content", which does make me wonder whether such resurfaced content may have a negative impact on ranking.
  • "Good XHTML": While links are important, they aren't the whole story, and attention still needs to be paid to ensuring that HTML pages generated by a Linked Data application follow the sort of general good practice for "good Web pages" described in the Google guidelines (provide well-structured XHTML, use title elements, use alt attributes, don't fill with irrelevant keywords etc etc)
  • Sitemaps: This is probably just a special case of the previous point, but Google emphasises the importance of using sitemaps to provide entry points for its crawlers. Although I'm aware of the Semantic Sitemap extension, I'm not sure whether the use of sitemaps is widespread in Linked Data deployments - though it is the sort of thing I'd expect to see happen as Linked Data moves further out of the preserve of the "research project" and towards more widespread deployment.
  • "Granularity": (I'm unsure whether this is a factor or not: I can imagine it might be, but it's probably not simple to assess exactly what the impact is.) How a provider decides to "chunk up" their descriptive data into documents might have an impact on the "density" of incoming links. If they expose a large number of documents each describing a single specific resource, does that result in each document receiving fewer incoming links than if they expose a smaller number of documents each describing several resources?
  • Integration: Although above I highlighted the BBC as an example of Linked Data being well-integrated into a "traditional" Web site, and so made highly visible to users of that Web site, I suspect this may - at the moment at least - be the exception rather than the rule. However, as with the previous point, this is something I'd expect to become more common.

Nevertheless, I still stand by my "hunch" that the LD approach is broadly "good" for ranking. I'm not claiming Linked Data is a panacea for search-engine optimisation, and I admit that some of what I'm suggesting here may be "more potential than actual". But I do believe the approach can make a positive contribution - and that is because both the Google ranking algorithm and Linked Data exploit the URI and the hyperlink: they "work with the grain of the Web".

April 07, 2010

Linked Data business models

The by-line is mine but the content comes from Pete (in a message to the Dublin Core Advisory Board mailing list)... a list of blog posts and other resources that articulate some of the business cases/models around Linked Data:

March 23, 2010

JISC Linked Data call possibilities

Hmmm... I should have written this post about two weeks ago but my blog-writing ain't what it used to be :-).

On that basis, this is just a quick note to say that we (Eduserv) are interested in strand B (the Linked Data strand) of the JISC Grant Funding Call 2/10: Deposit of research outputs and Exposing digital content for education and research.

What can we offer? A pretty decent level of expertise in Linked Data, RDF, RDFa and the Dublin Core and and some experience of modelling data. This comes in the form of your favorite eFoundations contributors, Pete Johnston and Andy Powell. In this instance we are not offering either hosting space or software development.

I appreciate that this comes late in the day and that people's planning will already be quite advanced. But we are keen to get involved in this programme somewhere so if you have a vacancy for the kind of expertise above, please get in touch.

Federating purl.org ?

I suggested a while back that PURLs have become quite important, at least for some aspects of the Web (particularly Linked Data as it happens), and that the current service at purl.org may therefore represent something of a single point of failure.

I therefore note with some interest that Zepheira, the company developing the PURL software, have recently announced a PURL Federation Architecture:

A PURL Federation is proposed which will consist of multiple independently-operated PURL servers, each of which have their own DNS hostnames, name their PURLs using their own authority (different from the hostname) and mirror other PURLs in the federation. The authorities will be "outsourced" to a dynamic DNS service that will resolve to proxies for all authorities of the PURLs in the federation. The attached image illustrates and summarizes the proposed architecture.

Caching proxies are inserted between the client and federation members. The dynamic DNS service responds to any request with an IP address of a proxy. The proxy attempts to contact the primary PURL member via its alternative DNS name to fulfill the request and caches the response for future requests. In the case where the primary PURL member is not responsive, the proxy attempts to contact another host in the federation until it succeeds. Thus, most traffic for a given PURL authority continues to flow to the primary PURL member for that authority and not other members of the federation.

I don't know what is planned in this space, and I may not have read the architecture closely enough, but it seems to me that there is now a significant opportunity for OCLC to work with a small number of national libraries (the British Library, The Library of Congress and the National Library of Australia spring to mind as a usefully geographically-dispersed set) to federate the current service at purl.org ?

February 26, 2010

The 2nd Linked Data London Meetup & trying to bridge a gap

On Wednesday I attended the 2nd London Linked Data Meetup, organised by Georgi Kobilarov and Silver Oliver and co-located with the JISC Dev8D 2010 event at ULU.

The morning session featured a series of presentations:

  • Tom Heath from Talis started the day with Linked Data: Implications and Applications. Tom introduced the RDF model, and planted the idea that the traditional "document" metaphor (and associated notions like the "desktop" and the "folder") were inappropriate and unnecessarily limiting in the context of Linked Data's Web of Things. Tom really just scratched the surface of this topic, I think, with a few examples of the sort of operations we might want to perform, and there was probably scope for a whole day of exploring it.
  • Tom Scott from the BBC on the Wildlife Finder, the ontology beind it, and some of the issues around generating and exposing the data. I had heard Tom speak before, about the BBC Programmes and Music sites, and again this time I found myself admiring the way he covered potentially quite complex issues very clearly and concisely. The BBC examples provide great illustrations of how linked data is not (or at least should not be) something "apart from" a "Web site", but rather is an integral part of it: they are realisations of the "your Web site is your API" maxim. The BBC's use of Wikipedia as a data source also led into some interesting discussion of trust and provenance, and dealing with the risk of, say, an editor of a Wikipedia page creating malicious content which was then surfaced on the BBC page. At the time of the presentation, the wildlife data was still delivered only in HTML, but Tom announced yesterday that the RDF data was now being exposed, in a similar style to that of the Programmes and Music sites.
  • John Sheridan and Jeni Tennison described their work on initiatives to open up UK government data. This was really something of a whirlwind (or maybe given the presenters' choice of Wild West metaphors, that should be a "twister") tour through a rapidly evolving landscape of current work, but I was impressed by the way they emphasised the practical and pragmatic nature of their approaches, from guidance on URI design through work on provenance, to the current work on a "Linked Data API" (on which more below)
  • Lin Clark of DERI gave a quick summary of support for RDFa in Drupal 7. It was necessarily a very rapid overview, but it was enough to make me make a mental note to try to find the time to explore Drupal 7 in more detail.
  • Georgi Kobilarov and Silver Oliver presented Uberblic, which provides a single integrated point of access to a set of data sources. One of the very cool features of Uberblic is that updates to the sources (e.g. a Wikipedia edit) are reflected in the aggregator in real time.

The morning closed with a panel session chaired by Paul Miller, involving Jeni Tennison, Tom Scott, Ian Davis (Talis) and Timo Hannay (Nature Publishing) which picked up many of the threads from the preceding sessions. My notes (and memories!) from this session seem a bit thin (in my defence, it was just before lunch and we'd covered a lot of ground...), but I do recall discussion of the trade-offs between URI readability and opacity, and the impact on managing persistence, which I think spilled out into quite a lot of discussion on Twitter. IIRC, this session also produced my favourite quote of the day, from Tom Scott, which was something along the lines of, "The idea that doing linked data is really hard is a myth".

Perhaps the most interesting (and timely/topical) session of the day was the workshop at the end of the afternoon by Jeni Tennison, Leigh Dodds and Dave Reynolds, in which they introduced a proposal for what they call a "Linked Data API".

This defines a configurable "middleware" layer that sits in front of a SPARQL endpoint to support the provision of RESTful access to the data, including not only the provision of descriptions of individual identified resources, but also selection and filtering based on simple URI patterns rather than on SPARQL, and the delivery of multiple output formats (including a serialisation of RDF in JSON - and the ability to generate HTML or XHTML). (It only addresses read access, not updates.)

This initiative emerged at least in part out of responses to the data.gov.uk work, and comments on the UK Government Data Developers Google Group and elsewhere by developers unfamiliar with RDF and related technologies. It seeks to try to address the problem that the provision of queries only through SPARQL requires the developers of applications to engage directly with the SPARQL query language, the RDF model and the possibly unfamiliar formats provided by SPARQL. At the same time, this approach also seeks to retain the "essence" of the RDF model in the data - and to provide clients with access to the underlying queries if required: it complements the SPARQL approach, rather than replaces it.

The configurability offers a considerable degree of flexibility in the interface that can be provided - without the requirement to create new application code. Leigh made the important point that the API layer might be provided by the publisher of the SPARQL endpoint, or it might be provided by a third party, acting as an intermediary/proxy to a remote SPARQL endpoint.

IIRC, mentions were made of work in progress on implementations in Ruby, PHP and Java(?).

As a non-developer myself, I hope I haven't misrepresented any of the technical details in my attempt to summarise this. There was a lot of interest in this session at the meeting, and it seems to me this is potentially an important contribution to bridging the gap between the world of Linked Data and SPARQL on the one hand and Web developers on the other, both in terms of lowering immediate barriers to access and in terms of introducing SPARQL more gradually. There is now a Google Group for discussion of the API.

All in all it was an exciting if somewhat exhausting day. The sessions I attended were all pretty much full to capacity and generated a lot of discussion, and it generally felt like there is a great deal of excitement and optimism about what is becoming possible. The tools and infrastructure around linked data are still evolving, certainly, but I was particularly struck - through initiatives like the API project above - by the sense of willingness to respond to comments and criticisms and to try to "build bridges", and to do so in very real, practical ways.

February 03, 2010

More famous than Simon Cowell

I wrote a blog post on my other, Blipfoto, blog this morning, More famous than Simon Cowell, looking at some of the issues around persistent identifiers from the perspective of a non-technical audience. (You'll have to read the post to understand the title).

I used the identifier painted on the side of a railway bridge just outside Bath as my starting point.

It's certainly not an earth-shattering post, but it was quite interesting (for me) to approach things from a slightly different perspective:

What makes the bridge identifier persistent? It's essentially a social construct. It's not a technical thing (primarily). It's not the paint the number is written in, or the bricks of the bridge itself, or the computer system at head office that maps the number to a map reference. These things help... but it's mainly people that make it persistent.

I wrote the piece because the JISC have organised a meeting, taking place in London today, to consider their future requirements around persistent identifiers. For various reasons I was not able to attend - a situation that I'm pretty ambivalent about to be honest - I've sat thru a lot of identifier meetings in the past :-).

Regular readers will know that I've blown hot and cold (well, mainly cold!) about the DOI - an identifier that I'm sure will feature heavily in today's meeting. Just to be clear... I am not against what the DOI is trying to achieve, nor am I in any way negative about the kinds of services, particularly CrossRef, that have been able to grow up around it. Indeed, while I was at UKOLN I committed us to joining CrossRef and thus assigning DOIs to all UKOLN publications. (I have no idea if they are still members).  I also recognise that the DOI is not going to go away any time soon.

I am very critical of some of the technical decisions that the DOI people have made - particularly their decision to encourage multiple ways of encoding the DOI as a URI and the fact that the primary way (the 'doi' URI scheme) did not use an 'http' URI. Whilst persistence is largely a social issue rather than a technological one, I do think that badly used technology can get in the way of both persistence and utility. I also firmly believe in the statement that I have made several times previously... that "the only good long term identifier is a good short term identifier".  The DOI, in both its 'doi' URI and plain-old string of characters forms, is not a good short term identifier.

My advice to the JISC? Start from the principles of Linked Data, which very clearly state that 'http' URIs should be used. Doing so sidesteps many of the cyclic discussions that otherwise occur around the benefits of URNs and other URI schemes and allows people to focus on the question of, "how do we make http URIs work as well and as persistently as possible?" rather than always starting from, "http URIs are broken, what should we build instead?".

February 02, 2010

Data.gov.uk, Creative Commons and the public domain

In a blog post at Creative Commons, UK moves towards opening government data, Jane Park notes that the UK Government have taken a significant step towards the use of Creative Commons licences by making the terms and conditions for the data.gov.uk website compatible with CC-BY 3.0:

In a step towards openness, the UK has opened up its data to be interoperable with the Attribution Only license (CC BY). The National Archives, a department responsible for “setting standards and supporting innovation in information and records management across the UK,” has realigned the terms and conditions of data.gov.uk to accommodate this shift. Data.gov.uk is “an online point of access for government-held non-personal data.” All content on the site is now available for reuse under CC BY. This step expresses the UK’s commitment to opening its data, as they work towards a Creative Commons model that is more open than their former Click-Use Licenses.

This feels like a very significant move - and one that I hadn't fully appreciated in the recent buzz around data.gov.uk.

Jane Park ends her piece by suggesting that "the UK as well as other governments move in the future towards even fuller openness and the preferred standard for open data via CC Zero". Indeed, I'm left wondering about the current move towards CC-BY in relation to the work undertaken a while back by Talis to develop the Open Data Commons Public Domain Dedication and Licence.

As Ian Davis of Talis says, Linked Data and the Public Domain:

In general factual data does not convey any copyrights, but it may be subject to other rights such as trade mark or, in many jurisdictions, database right. Because factual data is not usually subject to copyright, the standard Creative Commons licenses are not applicable: you can’t grant the exclusive right to copy the facts if that right isn’t yours to give. It also means you cannot add conditions such as share-alike.

He suggests instead that waivers (of which CC Zero and the Public Domain Dedication and License (PDDL) are examples) are a better approach:

Waivers, on the other hand, are a voluntary relinquishment of a right. If you waive your exclusive copyright over a work then you are explictly allowing other people to copy it and you will have no claim over their use of it in that way. It gives users of your work huge freedom and confidence that they will not be persued for license fees in the future.

Ian Davis' post gives detailed technical information about how such waivers can be used.

January 31, 2010

Readability and linkability

In July last year I noted that the terminology around Linked Data was not necessarily as clear as we might wish it to be.  Via Twitter yesterday, I was reminded that my colleague, Mike Ellis, has a very nice presentation, Don't think websites, think data, in which he introduces the term MRD - Machine Readable Data.

It's worth a quick look if you have time:

We also used the 'machine-readable' phrase in the original DNER Technical Architecture, the work that went on to underpin the JISC Information Environment, though I think we went on to use both 'machine-understandable' and 'machine-processable' in later work (both even more of a mouthful), usually with reference to what we loosely called 'metadata'.  We also used 'm2m - machine to machine' a lot, a phrase introduced by Lorcan Dempsey I think.  Remember that this was back in 2001, well before the time when the idea of offering an open API had become as widespread as it is today.

All these terms suffer, it seems to me, from emphasising the 'readability' and 'processability' of data over its 'linkedness'. Linkedness is what makes the Web what it is. With hindsight, the major thing that our work on the JISC Information Environment got wrong was to play down the importance of the Web, in favour of a set of digital library standards that focused on sharing 'machine-readable' content for re-use by other bits of software.

Looking at things from the perspective of today, the terms 'Linked Data' and 'Web of Data' both play up the value in content being inter-linked as well as it being what we might call machine-readable.

For example, if we think about open access scholarly communication, the JISC Information Environment (in line with digital libraries more generally) promotes the sharing of content largely through the harvesting of simple DC metadata records, each of which typically contains a link to a PDF copy of the research paper, which, in turn, carries only human-readable citations to other papers.  The DC part of this is certainly MRD... but, overall, the result isn't very inter-linked or Web-like. How much better would it have been to focus some effort on getting more Web links between papers embedded into the papers themselves - using what we would now loosely call a 'micro format'?  One of the reasons I like some of the initiatives around the DOI (though I don't like the DOI much as a technology), CrossRef springs to mind, is that they potentially enable a world where we have the chance of real, solid, persistent Web links between scholarly papers.

RDF, of course, offers the possibility of machine-readability, machine-processable semantics, and links to other content - which is why it is so important and powerful and why initiatives like data.gov.uk need to go beyond the CSV and XML files of this world (which some people argue are good enough) and get stuff converted into RDF form.

As an aside, DCMI have done some interesting work on Interoperability Levels for Dublin Core Metadata. While this work is somewhat specific to DC metadata I think it has some ideas that could be usefully translated into the more general language of the Semantic Web and Linked Data (and probably to the notions of the Web of Data and MRD).

Mike, I think, would probably argue that this is all the musing of a 'purist' and that purists should be ignored - and he might well be right.  I certainly agree with the main thrust of the presentation that we need to 'set our data free', that any form of MRD is better than no MRD at all, and that any API is better than no API.  But we also need to remember that it is fundamentally the hyperlink that has made the Web what it is and that those forms of MRD that will be of most value to us will be those, like RDF, that strongly promote the linkability of content, not just to other content but to concepts and people and places and everything else.

The labels 'Linked Data' and 'Web of Data' are both helpful in reminding us of that.

January 22, 2010

The right and left hands of open government data in the UK

As I'm sure everyone knows by now, the UK Government's data.gov.uk site was formally launched yesterday to a significant fanfare on Twitter and elsewhere.  There's not much I can add other than to note that I think this initiative is a very good thing and I hope that we can contribute more in the future than we have done to date.

[Edit: I note that the video of the presentation by Tim Berners-Lee and Nigel Shadbolt is now available.]

I'd like to highlight two blog posts that hurtled past in my Twitter stream yesterday.  The first, by Brian Hoadley, rightly reminds us that Open data is not a panacea – but it is a start:

In truth, I’ve been waiting for Joe Bloggs on the street to mention in passing – “Hey, just yesterday I did ‘x’ online” and have it be one of those new ‘Services’ that has been developed from the release of our data. (Note: A Joe Bloggs who is not related to Government or those who encircle Government. A real true independent Citizen.)

It may be a long wait.

The reality is that releasing the data is a small step in a long walk that will take many years to see any significant value. Sure there will be quick wins along the way – picking on MP’s expenses is easy. But to build something sustainable, some series of things that serve millions of people directly, will not happen overnight. And the reality, as Tom Loosemore pointed out at the London Data Store launch, it won’t be a sole developer who ultimately brings it to fruition.

The second, from the Daily Mash, is rather more flippant, New website to reveal exactly why Britain doesn't work:

Sir Tim said ordinary citizens will be able to use the data in conjunction with Ordnance Survey maps to show the exact location of road works that are completely unnecessary and are only being carried out so that some lazy, stupid bastard with a pension the size of Canada can use up his budget before the end of March.

The information could also be used to identify Britain's oldest pothole, how much business it has generated for its local garage and why in the name of holy buggering fuck it has never, ever been fixed.

And, while we are on the subject of maps and so on, today's posting to the Ernest Marples Blog, Postcode Petition Response — Our Reply, makes for an interesting read about the government's somewhat un-joined-up response to a petition to "encourage the Royal Mail to offer a free postcode database to non-profit and community websites":

The problem is that the licence was formed to suit industry. To suit people who resell PAF data, and who use it to save money and do business. And that’s fine — I have no problem with industry, commercialism or using public data to make a profit.

But this approach belongs to a different age. One where the only people who needed postcode data were insurance and fulfilment companies. Where postcode data was abstruse and obscure. We’re not in that age any more.

We’re now in an age where a motivated person with a laptop can use postcode data to improve people’s lives. Postcomm and the Royal Mail need to confront this and change the way that they do things. They may have shut us down, but if they try to sue everyone who’s scraping postcode data from Google, they’ll look very foolish indeed.

Finally — and perhaps most importantly — we need a consistent and effective push from the top. Number 10’s right hand needs to wake up and pay attention to the fantastic things the left hand’s doing.

Without that, we won’t get anywhere.

Hear, hear.

December 21, 2009

Scanning horizons for the Semantic Web in higher education

The week before last I attended a couple of meetings looking at different aspects of the use of Semantic Web technologies in the education sector.

On the Wednesday, I was invited to a workshop of the JISC-funded ResearchRevealed project at ILRT in Bristol. From the project weblog:

ResearchRevealed [...] has the core aim of demonstrating a fine-grained, access controlled, view layer application for research, built over a content integration repository layer. This will be tested at the University of Bristol and we aim to disseminate open source software and findings of generic applicability to other institutions.

ResearchRevealed will enhance ways in which a range of user stakeholder groups can gain up-to-date, accurate integrated views of research information and thus use existing institutional, UK and potentially global research information to better effect.

I'm not formally part of the project, but Nikki Rogers of ILRT mentioned it to me at the recent VoCamp Bristol meeting, and I expressed a general interest in what they were doing; they were also looking for some concrete input on the use of Dublin Core vocabularies in some of their candidate approaches.

This was the third in a series of small workshops, attended by representatives of the project from Bristol, Oxford and Southampton, and the aim was to make progress on defining a "core Research ontology". The morning session circled mainly around usage scenarios (support for the REF (and other "impact" assessment exercises), building and sustaining cross-institutional collaboration etc), and the (somewhat blurred) boundaries between cross-institutional requirements and institution-specific ones; what data might be aggregated, what might be best "linked to"; and the costs/benefits of rich query interfaces (e.g. SPARQL endpoints) v simpler literal- or URI-based lookups. In the afternoon, Nick Gibbins from the University of Southampton walked through a draft mapping of the CERIF standard to RDF developed by the dotAC project. This focused attention somewhat and led to some - to me - interesting technical discussions about variant ways of expressing information with differing degrees of precision/flexibility. I had to leave before the end of the meeting, but I hope to be able to continue to follow the project's progress, and contribute where I can.

A long train journey later, the following day I was at a meeting in Glasgow organised by the CETIS Semantic Technologies Working Group to discuss the report produced by the recent JISC-funded Semtech project, and to try to identify potential areas for further work in that area by CETIS and/or JISC. Sheila MacNeill from CETIS liveblogged proceedings here. Thanassis Tiropanis from the University of Southampton presented the project report, with a focus on its "roadmap for semantic technology adoption". The report argues that, in the past, the adoption of semantic technologies may have been hindered by a tendency towards a "top-down" approach requiring the widespread agreement on ontologies; in contrast the "linked data" approach encourages more of a "bottom-up" style in which data is first made available as RDF, and then later application-specific or community-wide ontologies are developed to enable more complex reasoning across the base data (which may involve mapping that initial data to those ontologies as they emerge). While I think there's a slight risk of overstating the distinction - in my experience many "linked data" initiatives do seem to demonstrate a good deal of thinking about the choice of RDF vocabularies and compatibility with other datasets - and I guess I see rather more of a continuum, it's probably a useful basis for planning. The report recommends a graduated approach which focusses initially on the development of this "linked data field" - in particular where there are some "low-hanging fruit" cases of data already made available in human-readable form which could relatively easily be made available in RDF, especially using RDFa.

One of the issues I was slightly uneasy with in the Glasgow meeting was that occasionally there were mentions of delivering "interoperability" (or "data interoperability") without really saying what was meant by that - and I say this as someone who used to have the I-word in my job title ;-) I feel we probably need to be clearer, and more precise, about what different "semantic technologies" (for want of a better expression) enable. What does the use of RDF provide that, say, XML typically doesn't? What does, e.g., RDF Schema add to that picture? What about convergence on shared vocabularies? And so on. Of course, the learners, teachers, researchers and administrators using the systems don't need to grapple with this, but it seems to me such aspects do need to be conveyed to the designers and developers, and perhaps more importantly - as Andy highlighted in his report of related discussions at the CETIS conference - to those who plan and prioritise and fund such development activity. (As an aside, I this is also something of an omission in the current version of the DCMI document on "Interoperability Levels": it tells me what characterises each level, and how I can test for whether an application meets the requirements of the level, but it doesn't really tell me what functionality each level provides/enables, or why I should consider level n+1 rather than level n.)

Rather by chance, I came across a recent presentation by Richard Cyganiak to the Vienna Linked Data Camp, which I think addresses some similar questions, albeit from a slightly different starting point: Richard asks the questions, "So, if we have linked data sources, what's stopping the development of great apps? What else do we need?", and highlights various dimensions of "heterogeneity" which may exist across linked data sources (use of identifiers, differences in modelling, differences in RDF vocabularies used, differences in data quality, differences in licensing, and so on).

Finally, I noticed that last Friday, Paul Miller (who was also at the CETIS meeting) announced the availability of a draft of a "Horizon Scan" report on "Linked Data" which he has been working on for JISC, as part of the background for a JISC call for projects in this area some time early in 2010. It's a relatively short document (hurrah for short reports!) but I've only had time for a quick skim through. It aims for some practical recommendations, ranging from general guidance on URI creation and the use of RDFa to more specific actions on particular resources/datasets. And here I must reiterate what Paul says in his post - it's a draft on which he is seeking comments, not the final report, and none of those recommendations have yet been endorsed by JISC. (If you have comments on the document, I suggest that you submit them to Paul (contact details here or comment on his post) rather than commenting on this post.)

In short, it's encouraging to see the active interest in this area growing within the HE sector. On reading Paul's draft document, I was struck by the difference between the atmosphere now (both at the Semtech meeting, and more widely) and what Paul describes as the "muted" conclusions of Brian Matthews' 2005 survey report on Semantic Web Technologies for JISC Techwatch. Of course, many of the challenges that Andy mentioned in his report of the CETIS conference session remain to be addressed, but I do sense that there is a momentum here - an excitement, even - which I'm not sure existed even eighteen months ago. It remains to be seen whether and how that enthusiasm translates into applications of benefit to the educational community, but I look forward to seeing how the upcoming JISC call, and the projects it funds, contribute to these developments.

December 08, 2009

UK government’s public data principles

The UK government has put down some pretty firm markers for open data in it's recent document, Putting the Frontline First: smarter government. The section entitled Radically opening up data and promoting transparency sets out the agenda as follows:

  1. Public data will be published in reusable, machine-readable form
  2. Public data will be available and easy to find through a single easy to use online access point (http://www.data.gov.uk/)
  3. Public data will be published using open standards and following the recommendations of the World Wide Web Consortium
  4. Any 'raw' dataset will be represented in linked data form
  5. More public data will be released under an open licence which enables free reuse, including commercial reuse
  6. Data underlying the Government's own websites will be published in reusable form for others to use
  7. Personal, classified, commercially sensitive and third-party data will continue to be protected.

(Bullet point numbers added by me.)

I'm assuming that "linked data" in point 4 actually means "Linked Data", given reference to W3C recommendations in point 3.

There's also a slight tension between points 4 and 5, if only because the use of the phrase, "more public data will be released under an open licence", in point 5 implies that some of the linked data made available as a result of point 4 will be released under a closed licence.  One can argue about whether that breaks the 'rules' of Linked Data but it seems to me that it certainly runs counter to the spirit of both Linked Data and what the government says it is trying to do here?

That's a pretty minor point though and, overall, this is a welcome set of principles.

Linked Data, of course, implies URIs and good practice suggests Cool URIs as the basic underlying principle of everything that will be built here.  This applies to all government content on the Web, not just to the data being exposed thru this particular initiative.  One of the most common forms of uncool URI to be found on the Web in government circles is the technology-specific .aspx suffix... hey, I work for an organisation that has historically provided the technology to mint a great deal of these (though I think we do a better job now).  It's worth noting, for example, that the two URIs that I use above to cite the Putting the Frontline First document both end in .aspx - ironic huh?

I'm not suggesting that cool URIs are easy, but there are some easy wins and getting the message across about not embedding technology into URIs is one of the easier ones... or so it seems to me anyway.

December 03, 2009

On being niche

I spoke briefly yesterday at a pre-IDCC workshop organised by REPRISE.  I'd been asked to talk about Open, social and linked information environments, which resulted in a re-hash of the talk I gave in Trento a while back.

My talk didn't go too well to be honest, partly because I was on last and we were over-running so I felt a little rushed but more because I'd cut the previous set of slides down from 119 to 6 (4 really!) - don't bother looking at the slides, they are just images - which meant that I struggled to deliver a very coherent message.  I looked at the most significant environmental changes that have occurred since we first started thinking about the JISC IE almost 10 years ago.  The resulting points were largely the same as those I have made previously (listen to the Trento presentation) but with a slightly preservation-related angle:

  • the rise of social networks and the read/write Web, and a growth in resident-like behaviour, means that 'digital identity' and the identification of people have become more obviously important and will remain an important component of provenance information for preservation purposes into the future;
  • Linked Data (and the URI-based resource-oriented approach that goes with it) is conspicuous by its absence in much of our current digital library thinking;
  • scholarly communication is increasingly diffusing across formal and informal services both inside and outside our institutional boundaries (think blogging, Twitter or Google Wave for example) and this has significant implications for preservation strategies.

That's what I thought I was arguing anyway!

I also touched on issues around the growth of the 'open access' agenda, though looking at it now I'm not sure why because that feels like a somewhat orthogonal issue.

Anyway... the middle bullet has to do with being mainstream vs. being niche.  (The previous speaker, who gave an interesting talk about MyExperiment and its use of Linked Data, made a similar point).  I'm not sure one can really describe Linked Data as being mainstream yet, but one of the things I like about the Web Architecture and REST in particular is that they describe architectural approaches that haven proven to be hugely successful, i.e. they describe the Web.  Linked data, it seems to me, builds on these in very helpful ways.  I said that digital library developments often prove to be too niche - that they don't have mainstream impact.  Another way of putting that is that digital library activities don't spend enough time looking at what is going on in the wider environment.  In other contexts, I've argued that "the only good long-term identifier, is a good short-term identifier" and I wonder if that principle can and should be applied more widely.  If you are doing things on a Web-scale, then the whole Web has an interest in solving any problems - be that around preservation or anything else.  If you invent a technical solution that only touches on scholarly communication (for example) who is going to care about it in 50 or 100 years - answer, not all that many people.

It worries me, for example, when I see an architectural diagram (as was shown yesterday) which has channels labelled 'OAI-PMH', XML' and 'the Web'!

After my talk, Chris Rusbridge asked me if we should just get rid of the JISC IE architecture diagram.  I responded that I am happy to do so (though I quipped that I'd like there to be an archival copy somewhere).  But on the train home I couldn't help but wonder if that misses the point.  The diagram is neither here nor there, it's the "service-oriented, we can build it all", mentality that it encapsulates that is the real problem.

Let's throw that out along with the diagram.

December 01, 2009

On "Creating Linked Data"

In the age of Twitter, short, "hey, this is cool" blog posts providing quick pointers have rather fallen out of fashion, but I thought this material was worth drawing attention to here. Jeni Tennison, who is contributing to the current work with Linked Data in UK government, has embarked on a short series of tutorial-style posts called "Creating Linked Data", in which she explains the steps typically involved in reformulating existing data as linked data, and discusses some of the issues arising.

Her "use case" is the scenario in which some data is currently available in CSV format, but I think much of the discussion could equally be applied to the case where the provider is making data available for the first time. The opening post on the sequence ("Analysing and Modelling") provides a nice example of working through the sort of "things or strings?" questions which we've tried to highlight in the context of designing DC Application Profiles. And as Jeni emphasises, this always involves design choices:

It’s worth noting that this is a design process rather than a discovery process. There is no inherent model in any set of data; I can guarantee you that someone else will break down a given set of data in a different way from you. That means you have to make decisions along the way.

And further on in the piece, she rationalises her choices for this example in terms of what those choices enable (e.g. "whenever there’s a set of enumerated values it’s a good idea to consider turning them into things, because to do so enables you to associate extra information about them").

The post on URI design offers some tips, not only on designing new URIs but also on using existing URIs where appropriate: I admit I tend to forget about useful resources like placetime.com "a URI space containing URIs that represent places and times" (and provides redirects to descriptions in various formats).

On a related note, the post on choosing/coining properties, classes and datatypes includes a pointer to the OWL Time ontology. This is something I was aware of, but only looked at in any detail relatively recently. At first glance it can seem rather complex; Ian Davis has a summary graphic which I found helpful in trying to get my head round the core concepts of the ontology.

It seems to me these sort of very common areas like time data are those around which some shared practice will emerge, and articles like these, by "hands-on" practitioners, are important contributions to that process.

November 20, 2009

COI guidance on use of RDFa

Via a post from Mark Birbeck, I notice that the UK Central Office for Information has published some guidelines called Structuring information on the Web for re-usability which include some guidance on the use of RDFa to provide specific types of information, about government consultations and about job vacancies.

This is exciting news as, as far as I know, this is the first formal document from UK central government to provide this sort of quite detailed, resource-type-specific guidance with recommendations on the use of particular RDF vocabularies - guidance of the sort I think will be an essential component in the effective deployment of RDFa and the Linked Data approach. It's also the sort of thing that is of considerable interest to Eduserv, as a developer of Web sites for several government agencies. The document builds directly on the work Mark has been doing in this area, which I mentioned a while ago.

As Mark notes in his post, the document is unequivocal in its expression of the government's commitment to the Linked Data approach:

Government is committed to making its public information and data as widely available as possible. The best way to make structured information available online is to publish it as Linked Data. Linked Data makes the information easier to cut and combine in ways that are relevant to citizens.

Before the announcement of these guidelines, I recently had a look at the "argot" for consultations - "argot" is Mark's term for a specification of how a set of terms from multiple RDF vocabularies is used to meet some application requirement; as I noted in that earlier post, I think it is similar to what DCMI calls an "application profile" - , and I had intended to submit some comments. I fear it is now somewhat late in the day for me to be doing this, but the release of this document has prompted me to write them up here. My comments are concerned primarily with the section titled "Putting consultations into Linked Data"

The guidelines (correctly, I think) establish a clear distinction between the consultation on the one hand and the Web page describing the consultation on the other by (in paragraphs 30/31) introducing a fragment identifier for the URI of the consultation (via the about="#this" attribute). The consultation itself is also modelled as a document, an instance of the class foaf:Document, which in turn "has as parts" the actual document(s) on which comment is being sought, and for which a reply can be sent to some agent.

I confess that my initial "instinctive" reaction to this was that this seemed a slightly odd choice, as a "consultation" seemed to me to be more akin to an event or a process, taking place during an interval of time, which had a as "inputs" to that process a set of documents on which comments were sought, and (typically at least) resulted in the generation of some other document as a "response". And indeed the page describing the Consultation argot introduces the concept as follows (emphasis added):

A consultation is a process whereby Government departments request comments from interested parties, so as to help the department make better decisions. A consultation will usually be focused on a particular topic, and have an accompanying publication that sets the context, and outlines particular questions on which feedback is requested. Other information will include a definite start and end date during which feedback can be submitted and contact details for the person to submit feedback to.

I admit I find it difficult to square this with the notion of a "document". And I think a "consultation-as-event" (described by a Web page) could probably be modelled quite neatly using the Event Ontology or the similar LODE ontology (with some specialisation of classes and properties if required).

Anyway, I appreciate that aspect may be something of a "design choice". So for the remainder of the comments here, I'll stick to the actual approach described by the guidelines (consultation as document).

The RDF properties recommended for the description of the consultation are drawn mainly from Dublin Core vocabularies, and more specifically from the "DC Terms" vocabulary.

The first point to note is that, as Andy noted recently, DCMI recently made some fairly substantive changes to the DC Terms vocabulary, as a result of which the majority of the properties are now the subject of rdfs:range assertions, which indicate whether the value of the property is a literal or a non-literal resource. The guidelines recommend the use of the publisher (paragraphs 32-37), language(paragraphs 38-39), and audience (paragraph 46) properties, all with literal values, e.g.

<span property="dc:publisher" content="Ministry of Justice"></span>

But according to the term descriptions provided by DCMI, the ranges of these properties are the classes dcterms:Agent, dcterms:LinguisticSystem and dcterms:AgentClass respectively. So I think that would require the use of an XHTML-RDFa construct something like the following, introducing a blank node (or a URI for the resource if one is available):

<div rel="dc:publisher"><span property="foaf:name" content="Ministry of Justice"></span></div>

Second, I wasn't sure about the recommendation for the use of the dcterms:source property (paragraph 37). This is used to "indicate the source of the consultation". For the case where this is a distinct resource (i.e. distinct from the consultation and this Web page describing it), this seems OK, but the guidelines also offer the option of referring to the current document (i.e. the Web page) as the source of the consultation:

<span rel="dc:source" resource=""></span>

DCMI's definition of the property is "A related resource from which the described resource is derived", but it seems to me the Web page is acting as a description of the consultation-as-document, rather than a source of it.

Third, the guidelines recommend the use of some of the DCMI date properties (paragraph 42):

  • dcterms:issued for the publication date of the consultation
  • dcterms:available for the start date for receiving comments
  • dcterms:valid for the closing date ("a date through which the consultation is 'valid'")

I think the use of dcterms:valid here is potentially confusing. DCMI's definition is "Date (often a range) of validity of a resource", so on this basis I think the implication of the recommended usage is that the consultation is "valid" only on that date, which is not what is intended. The recommendations for dcterms:issued and dcterms:available are probably OK - though I do think the event-based approach might have helped make the distinction between dates related to documents and dates related to consultation-as-process rather clearer!

Oh dear, this must read like an awful lot of pedantic nitpicking on my part, but my intent is to try to ensure that widely used vocabularies like those provided by DCMI are used as consistently as possible. As I said at the start I'm very pleased to see this sort of very practical guidance appearing (and I apologise to Mark for not submitting my comments earlier!)

November 16, 2009

The future has arrived

Cetis09

About 99% of the way thru Bill Thompson's closing keynote to the CETIS 2009 Conference last week I tweeted:

great technology panorama painted by @billt in closing talk at #cetis09

And it was a great panorama - broad, interesting and entertainingly delivered. It was a good performance and I am hugely in awe of people who can give this kind of presentation. However, what the talk didn't do was move from the "this is where technology has come from, this is where it is now and this is where it is going" kind of stuff to the "and this is what it means for education in the future". Which was a shame because in questioning after his talk Thompson did make some suggestions about the future of print news media (not surprising for someone now at the BBC) and I wanted to hear similar views about the future of teaching, learning and research.

As Oleg Liber pointed out in his question after the talk, universities, and the whole education system around them, are lumbering beasts that will be very slow to change in the face of anything. On that basis, whilst it is interesting to note that (for example) we can now just about store a bit on an atom (meaning that we can potentially store a digital version of all human output on something the weight of a human body), that we can pretty much wire things directly into the human retina, and that Africa will one-day overtake 'digital' Britain in the broadband stakes are interesting individual propositions in their own right, there comes a "so what?" moment where one is left wondering what it actually all means.

As an aside, and on a more personal note, I suggest that my daughter's experience of university (she started at Sheffield Hallam in September) is not actually going to be very different to my own, 30-odd years ago. Lectures don't seem to have changed much. Project work doesn't seem to have changed much. Going out drinking doesn't seem to have changed much. She did meet all her 'hall' flat-mates via Facebook before she arrived in Sheffield I suppose :-) - something I never had the opportunity to do (actually, I never even got a place in hall). There is a big difference in how it is all paid for of course but the interesting question is how different university will be for her children. If the truth is, "not much", then I'm not sure why we are all bothering.

At one point, just after the bit about storing a digital version of all human output I think, Thompson did throw out the question, "...and what does that mean for copyright law?". He didn't give us an answer. Well, I don't know either to be honest... though it doesn't change the fact that creative people need to be rewarded in some way for their endeavours I guess. But the real point here is that the panorama of technological change that Thompson painted for us, interesting as it was, begs some serious thinking about what the future holds.  Maybe Thompson was right to lay out the panorama and leave the serious thinking to us?

He was surprisingly positive about Linked Data, suggesting that the time is now right for this to have a significant impact.  I won't disagree because I've been making the same point myself in various fora, though I tend not to shout it too loudly because I know that the Semantic Web has a history of not quite making it.  Indeed, the two parallel sessions that I attended during the conference, University API and the Giant Global Graph both focused quite heavily on the kinds of resources that universities are sitting on (courses, people/expertise, research data, publications, physical facilities, events and so on) that might usefully be exposed to others in some kind of 'open' fashion.  And much of the debate, particularly in the second session (about which there are now some notes), was around whether Linked Data (i.e. RDF) is the best way to do this - a debate that we've also seen played out recently on the uk-government-data-developers Google Group.

The three primary issues seemed to be:

  • Why should we (universities) invest time and money exposing our data in the hope that people will do something useful/interesting/of value with it when we have many other competing demands on our limited resources?
  • Why should we take the trouble to expose RDF when it's arguably easier for both the owner and the consumer of the data to expose something simpler like a CSV file?
  • Why can't the same ends be achieved by offering one or more services (i.e. a set of one or more APIs) rather than the raw data itself?

In the ensuing debate about the why and the how, there was a strong undercurrent of, "two years ago SOA was all the rage, now Linked Data is all the rage... this is just a fashion thing and in two years time there'll be something else".  I'm not sure that we (or at least I) have a well honed argument against this view but, for me at least, it lies somewhere in the fit with resource-orientation, with the way the Web works, with REST, and with the Web Architecture.

On the issue of the length of time it is taking for the Semantic Web to have any kind of mainstream impact, Ian Davis has an interesting post, Make or Break for the Semantic Web?, arguing that this is not unusual for standards track work:

Technology, especially standards track work, takes years to cross the chasm from early adopters (the technology enthusiasts and visionaries) to the early majority (the pragmatists). And when I say years, I mean years. Take CSS for example. I’d characterise CSS as having crossed the chasm and it’s being used by the early majority and making inroads into the late majority. I don’t think anyone would seriously argue that CSS is not here to stay.

According to this semi-official history of CSS the first proposal was in 1994, about 13 years ago. The first version that was recognisably the CSS we use today was CSS1, issued by the W3C in December 1996. This was followed by CSS2 in 1998, the year that also saw the founding of the Web Standards Project. CSS 2.1 is still under development, along with portions of CSS3.

Paul Walk has also written an interesting post, Linked, Open, Semantic?, in which he argues that our discussions around the Semantic Web and Linked Data tend to mix up three memes (open data, linked data and semantics) in rather unhelpful ways. I tend to agree, though I worry that Paul's proposed distinction between Linked Data and the Semantic Web is actually rather fuzzier than we may like.

On balance, I feel a little uncomfortable that I am not able to offer a better argument against the kinds of anti-Linked Data views expressed above. I think I understand the issues (or at least some of them) pretty well but I don't have them to hand in a kind of this is why Linked Data is the right way forward 'elevator pitch'.

Something to work on I guess!

[Image: a slide from Bill Thompson's closing keynote to the CETIS 2009 Conference]

October 19, 2009

Helpful Dublin Core RDF usage patterns

In the beginning [*] there was the HTML meta element and we used to write things like:

<meta name="DC.Creator"content="Andy Powell">
<meta name="DC.Subject" content="something, something else, something else again">
<meta name="DC.Date.Available" scheme="W3CDTF" content="2009-10-19">
<meta name="DC.Rights" content="Open Database License (ODbL) v1.0">

Then came RDF and a variety of 'syntax' guidance from DCMI and we started writing:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dc:creator>Andy Powell</dc:creator>
    <dcterms:available>2009-10-19</dcterms:available>
    <dc:subject>something</dc:subject>
    <dc:subject>something else</dc:subject>
    <dc:subject>something else again</dc:subject>
    <dc:rights>Open Database License (ODbL) v1.0</dc:rights>
  </rdf:Description>
</rdf:RDF>

Then came the decision to add 15 new properties to the DC terms namespace which reflected the original 15 DC elements but which added a liberal smattering of domains and ranges.  So, now we write:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dcterms:creator>
      <dcterms:Agent>
        <rdf:value>Andy Powell</rdf:value>
        <foaf:name>Andy Powell</foaf:name>
      </dcterms:Agent>
    </dcterms:creator>
    <dcterms:available
rdf:datatype="http://purl.org/dc/terms/W3CDTF">2009-10-19</dcterms:available>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something else</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something else again</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:rights
rdf:resource="http://opendatacommons.org/licenses/odbl/1.0/" />
  </rdf:Description>
</rdf:RDF>

Despite the added verbosity and rather heavy use of blank nodes in the latter, I think there are good reasons why moving towards this kind of DC usage pattern is a 'good thing'.  In particular, this form allows the same usage pattern to indicate a subject term by URI or literal (or both - see addendum below) meaning that software developers only need to code to a single pattern. It also allows for the use of multiple literals (e.g. in different languages) attached to a single value resource.

The trouble is, a lot of existing usage falls somewhere between the first two forms shown here.  I've seen examples of both coming up in discussions/blog posts about both open government data and open educational resources in recent days.

So here are some useful rules of thumb around DC RDF usage patterns:

  • DC properties never, ever, start with an upper-case letter (i.e. dcterms:Creator simply does not exist).
  • DC properties never, ever, contain a full-stop character (i.e. dcterms:date.available does not exist either).
  • If something can be named by its URI rather than a literal (e.g. the ODbL licence in the above examples) do so using rdf:resource.
  • Always check the range of properties before use.  If the range is anything other than a literal (as is the case with both dc:subject and dc:creator for example) and you don't know the URI of the value, use a blank or typed node to indicate the value and rdf:value to indicate the value string.
  • Do not provide lists of separate keywords as a single dc:subject value.  Repeat the property multiple times, as necessary.
  • Syntax encoding schemes, W3CDTF in this case, are indicated using rdf:datatype.

See the Expressing Dublin Core metadata using the Resource Description Framework (RDF) DCMI Recommendation for more examples and guidance.

[*] The beginning of Dublin Core metadata obviously! :-)

Addendum

As Bruce notes in the comments below, the dcterms:subject pattern that I describe above applies in those situations where you do not know the URI of the subject term. In cases where you do know the URI (as is the case with LCSH for example), the pattern becomes:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dcterms:subject>
      <rdf:Description rdf:about="http://id.loc.gov/authorities/sh85101653#concept">
        <rdf:value>Physics</rdf:value>
      </rdf:Description>
    </dcterms:subject>
  </rdf:Description>
</rdf:RDF>

October 14, 2009

Open, social and linked - what do current Web trends tell us about the future of digital libraries?

About a month ago I travelled to Trento in Italy to speak at a Workshop on Advanced Technologies for Digital Libraries organised by the EU-funded CACOA project.

My talk was entitled "Open, social and linked - what do current Web trends tell us about the future of digital libraries?" and I've been holding off blogging about it or sharing my slides because I was hoping to create a slidecast of them. Well... I finally got round to it and here is the result:

Like any 'live' talk, there are bits where I don't get my point across quite as I would have liked but I've left things exactly as they came out when I recorded it. I particularly like my use of "these are all very bog standard... err... standards"! :-)

Towards the end, I refer to David White's 'visitors vs. residents' stuff, about which I note he has just published a video. Nice one.

Anyway... the talk captures a number of threads that I've been thinking and speaking about for the last while. I hope it is of interest.

September 22, 2009

VoCamp Bristol

At the end of the week before last, I spent a couple of days (well, a day and a half as I left early on Friday) at the VoCamp Bristol meeting, at ILRT, at the University of Bristol.

To quote the VoCamp wiki:

VoCamp is a series of informal events where people can spend some dedicated time creating lightweight vocabularies/ontologies for the Semantic Web/Web of Data. The emphasis of the events is not on creating the perfect ontology in a particular domain, but on creating vocabs that are good enough for people to start using for publishing data on the Web.

I admit that I went into the event slightly unprepared, as I didn't have any firm ideas about any specific vocabulary I wanted to work on, but happy to join in with anyone who was working on anything of interest. Some of the outputs of the various groups are listed on the wiki page.

As well as work on specific vocabularies, the opening discussions highlighted an interest in a small set of more general issues, which included the expression of "structural constraints" and "validation"; broader questions of collecting and interpreting vocabulary usage; representing RDF data using JSON; and the features available in OWL 2. Friday morning was set aside for those topics, which meant I had an opportunity to talk a little bit about the work being done within the Dublin Core Metadata Initiative on "Description Set Profiles", which I've mentioned in some recent posts here. I did hastily knock up a few slides, mainly as an aide memoire to make sure I mentioned various bits and pieces:

There was a useful discussion around various different approaches for representing such patterns of constraints at the level of the RDF graph, either based on query patterns, or on the use of OWL (with a "closed-world" assumption that the "world" in question is the graph at hand). Some of the new features in OWL 2, such as capabilities for expressing restrictions on datatypes seem to make it quite an attractive candidate for this sort of task.

I was asked about whether we had considered the use of OWL in the DCMI context. IIRC, we decided against it mostly because we wanted an approach that built explicitly on the description model of the DCMI Abstract Model (i.e. talked in terms of "descriptions" and "statements" and patterns of use of those particular constructs), though I think the "open-world" considerations were also an issue (See this piece for a discussion of some of the "gotchas" that can arise).

Having said that, it would seem a good idea to explore to what extent the constraint types permitted by the DSP model might be mapped into other form(s) of expressing constraints which might be adopted.

All in all, it was a very enjoyable couple of days: a fairly low-key, thoughtful, gentle sort of gathering - no "pitches", no prizes, no "dragons" in their "dens", or other cod-"bizniz" memes :-) - just an opportunity for people to chat and work together on topics that interested them. Thank you to Tom & Damian & Libby for doing the organisation (and introducing me to a very nice Chinese restaurant in Bristol on the Thursday night!)

September 16, 2009

Edinburgh publish guidance on research data management

The University of Edinburgh has published some local guidance about the way that research data should be managed, Research data management guidance, covering How to manage research data and Data sharing and preservation, as well as detailing local training, support and advice options.

One assumes that this kind of thing will become much more common at universities over the next few years.

Having had a very quick look, it feels like the material is more descriptive than prescriptive - which isn't meant as a negative comment, it just reflects the current state of play. The section on Data documentation & metadata for example, gives advice as simple as:

Have you created a "readme.txt" file to describe the contents of files in a folder? Such a simple act can be invaluable at a later date.

but also provides a link to the UK Data Archive's guidance on Data Documentation and Metadata, which at first sight appears hugely complex. I'm not sure what your average research will make of it?

(In passing, I note that the UKDA seem to be promoting the use of the Data Documentation Initiative standard at what they call the 'catalogue' level, a standard that I've not come across before but one that appears to be rooted firmly outside the world of linked data, which is a shame.)

Similarly, the section on Methods for data sharing lists a wide range of possible options (from "posting on a University website" thru to "depositing in a data repository") without being particularly prescriptive about which is better and why.

(As a second aside, I am continually amazed by this firm distinction in the repository world between 'posting on the website' and 'depositing in a repository' - from the perspective of the researcher, both can, and should, achieve the same aims, i.e. improved management, more chance of persistence and better exposure.)

As we have found with repositories of research publications, it seems to me that research data repositories (the Edinburgh DataShare in this case) need to hide much of this kind of complexity, and do most of the necessary legwork, in order to turn what appears to be a simple and obvious 'content management' workflow (from the point of view of the individual researcher) into a well managed, openly shared, long term resource for the community.

July 29, 2009

Enhanced PURL server available

A while ago, I posted about plans to revamp the PURL server software to (amongst other things) introduce support for a range of HTTP response codes. This enables the use of PURLs for the identification of a wider range of resources than "Web documents" using the interaction patterns specified by current guidelines provided by the W3C in the TAG's httpRange-14 resolution and the Cool URIs for the Semantic Web document.

Lorcan Dempsey posted on Twitter yesterday to indicate that OCLC have deployed the new software, developed by Zepheira, on the OCLC purl.org service. Although I've just had time for a quick poke around, and I need to read the documentation more carefully to understand all the new features, it looks like it does the job quite nicely.

This should mean that existing PURL owners who have used PURLs to identify things other than "Web documents" - like DCMI, who use PURLs like http://purl.org/dc/terms/title to identify their "metadata terms", "conceptual" resources - should be able to adjust the appropriate entries on the purl.org server so that interactions follow those guidelines. It also offers a new option for those who wish to set up such redirects but perhaps don't have suitable levels of access to configure their own HTTP server to perform those redirects.

July 21, 2009

Linked data vs. Web of data vs. ...

On Friday I asked what I thought would be a pretty straight-forward question on Twitter:

is there an agreed name for an approach that adopts the 4 principles of #linkeddata minus the phrase, "using the standards (RDF, SPARQL)" ??

Turns out not to be so straight-forward, at least in the eyes of some of my Twitter followers. For example, Paul Miller responded with:

@andypowe11 well, personally, I'd argue that Linked Data does NOT require that phrase. But I know others disagree... ;-)

and

@andypowe11 I'd argue that the important bit is "provide useful information." ;-)

Paul has since written up his views more thoughtfully in his blog, Does Linked Data need RDF?, a post that has generated some interesting responses.

I have to say I disagree with Paul on this, not in the sense that I disagree with his focus on "provide useful information", but in the sense that I think it's too late to re-appropriate the "Linked Data" label to mean anything other than "use http URIs and the RDF model".

To back this up I'd go straight to the horses mouth, Tim Berners-Lee, who gave us his personal view way back in 2006 with his 'design issues' document on Linked Data. This gave us the 4 key principles of Linked Data that are still widely quoted today:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
  4. Include links to other URIs. so that they can discover more things.

Whilst I admit that there is some wriggle room in the interpretation of the 3rd point - does his use of "RDF, SPARQL" suggested these as possible standards or is the implication intended to be much stronger? - more recent documents indicate that the RDF model is mandated. For example, in Putting Government Data online Tim Berners-Lee says (refering to Linked Data):

The essential message is that whatever data format people want the data in, and whatever format they give it to you in, you use the RDF model as the interconnection bus. That's because RDF connects better than any other model.

So, for me, Linked Data implies use of the RDF model - full stop. If you put data on the Web in other forms, using RSS 2.0 for example, then you are not doing Linked Data, you're doing something else! (Addendum: I note that Ian Davis makes this point rather better in The Linked Data Brand).

Which brings me back to my original question - "what do you call a Linked Data-like approach that doesn't use RDF?" - because, in some circumstances, adhering to a slightly modified form of the 4 principles, namely:

  1. use URIs as names for things
  2. use HTTP URIs so that people can look up those names
  3. when someone looks up a URI, provide useful information
  4. include links to other URIs. so that they can discover more things

might well be a perfectly reasonable and useful thing to do. As purists, we can argue about whether it is as good as 'real' Linked Data but sometimes you've just got to get on and do whatever you can.

A couple of people suggested that the phrase 'Web of data' might capture what I want. Possibly... though looking at Tom Coates' Native to a Web of Data presentation it's clear that his 10 principles go further than the 4 above.  Maybe that doesn't matter? Others suggested "hypermedia" or "RESTful information systems" or "RESTful HTTP" none of which strikes me as quite right.

I therefore remain somewhat confused. I quite like Bill de hÓra's post on "links in content", Snowflake APIs, but, again, I'm not sure it gets us closer to an agreed label?

In a comment on a post by Michael Hausenblas, What else?, Dan Brickley says:

I have no problem whatsoever with non-RDF forms of data in “the data Web”. This is natural, normal and healthy. Stastical information, geographic information, data-annotated SVG images, audio samples, JSON feeds, Atom, whatever.

We don’t need all this to be in RDF. Often it’ll be nice to have extracts and summaries in RDF, and we can get that via GRDDL or other methods. And we’ll also have metadata about that data, again in RDF; using SKOS for indicating subject areas, FOAF++ for provenance, etc.

The non-RDF bits of the data Web are – roughly – going to be the leaves on the tree. The bit that links it all together will be, as you say, the typed links, loose structuring and so on that come with RDF. This is also roughly analagous to the HTML Web: you find JPEGs, WAVs, flash files and so on linked in from the HTML Web, but the thing that hangs it all together isn’t flash or audio files, it’s the linky extensible format: HTML. For data, we’ll see more RDF than HTML (or RDFa bridging the two). But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much! :)

Count me in as a worrier then!

I ask because, as a not-for-profit provider of hosting and Web development solutions to the UK public sector, Eduserv needs to start thinking about the implications of Tim Berners-Lee's appointment as an advisor to the UK government on 'open data' issues on the kinds of solutions we provide.  Clearly, Linked Data is going to feature heavily in this space but I fully expect that lots of stuff will also happen outside the RDF fold.  It's important for us to understand this landscape and the impact it might have on future services.

About

Search

Loading
eFoundations is powered by TypePad