« February 2011 | Main | April 2011 »

March 28, 2011

Waiter, my resource discovery glass is half-empty

Old joke...

A snail goes into a pub and says to the barman, "I've just been mugged by two tortoises". The barman looks a bit shocked and says, "Oh no, that's terrible. What happened?".

The snail responds, "I dunno, it all happened so fast".

Sorry!

I had a bit of a glass half-empty moment last week, listening to the two presentations in the afternoon session of the ESRC Resource Discovery workshop, the first by Joy Palmer about the MIMAS-led Resource Discovery Task Framework Management Framework and the second by Lucy Bell about the UKDA resource discovery project. Not that there was anything wrong with either presentation. But it struck me that they both used phrases that felt very familiar in the context of resource discovery in the cultrual heritage and education space over the last 10 years or so (probably longer) - "content locked in sectoral silos", "the need to work across multiple websites, each with their own search idiosyncracies", "the need to learn and understand multiple vocabularies", and so on.

In a moment of panic I said words to the effect of, "We're all doomed. Nothing has changed in the last 10 years. We're going round in circles here". Clearly rubbish... and, looking at the two presentations now, it's not clear why I reached that particular conclusion anyway. I asked the room why this time round would be different, compared with previous work on initiatives like the UK JISC Information Environment, and got various responses about community readiness, political will, better stakeholder engagement and what not. I mean, for sure, lots of things have changed in the last 10 years - I'm not even sure the alphabet contained the three letters A, P and I back then and the whole environment is clearly very different - but it is also true that some aspects of the fundamental problem remain largely unchanged. Yes, there are a lot more cultural heritage, scientific and educational resources out there (being made available from within those sectors) but it's not always clear the extent to which that stuff is better joined up, open and discoverable than it was at the turn of the century?

There is a glass half-full view of the resource discovery world, and I try to hold onto it most of the time, but it certainly helps to drink from the Google water fountain! Hence the need for initiatives like the UK Resource Discovery Task Force to emphasise the 'build better websites' approach. We're talking about cultural change here, and cultural change takes time. Or rather, the perceived rate of cultural change tends to be relative to the beholder.

March 25, 2011

RDTF metadata guidelines - next steps

A few weeks ago I blogged about the work that Pete and I have been doing on metadata guidelines as part of the JISC/RLUK Resource Discovery Task Force, RDTF metadata guidelines - Limp Data or Linked Data?.

In discussion with the JISC we have agreed to complete our current work in this area by:

  • delivering a single summary document of the consultation process around the current draft guidelines, incorporating the original document and all the comments made using the JISCpress site during the comment period; and
  • suggesting some recommendations about any resulting changes that we would like to see made to the draft guidelines.

For the former, a summary view of the consultation is now available. It's not 100% perfect (because the links between the comments and individual paragraphs are not retained in the summary) but I think it is good enough to offer a useful overview of the draft and the comments in a single piece of text. Furthermore, the production of this summary was automated (by post-processing the export 'dump' from Wordpress), so the good news is that a similar view can be obtained for any future (or indeed past) JISCpress consultations.

For the latter, this blog post forms our recommendations.

As noted previously, there were 196 comments during the comment period (which is not bad!), many of which were quite detailed in terms of particular data models, formats and so on. On the basis that we do not know quite what form any guidelines might take from here on (that is now the responsibility of the RDTF team at MIMAS I think), it doesn't seem sensible to dig into the details too much. Rather, we will make some comments on the overall shape of the document and suggest some areas where we think it might be useful for JISC and RLUK to undertake additional work.

You may recall that our original draft proposed three approaches to exposing metadata, which we refered to as the community formats approach, the RDF data approach and the Linked Data approach. In light of comments (particularly those from Owen Stephens and Paul Walk) we have been putting some thought into how the shape of the whole document might be better conceptualised. The result is the following four-quadrant model:

Rdtf
Like any simple conceptualisation, there is some fuzziness in this but we think it's a useful way of thinking about the space.

Traditionally (in the world of library, museum and archives at least), most sharing of metadata has happened in the bottom-left quadrant - exchanging bulk files of MARC records for example. And, to an extent, this approach continues now, even outside those sectors. Look at the large amount of 'open data' that is shared as CSV files on sites like data.gov.uk for example. Broadly speaking, this is what we refered to as the community formats approach (though I think our inclusion of the OAI-PMH in that area probably muddied the waters a little - see below).

One can argue that moving left to right across the quadrants offers semantically richer metadata in a 'small pieces, loosely joined' kind of way (though this quickly becomes a religious argument with no obvious point of closure! :-) ) and that moving bottom to top offers the ability to work with individual item descriptions rather than whole collections of them - and that, in particular, it allows for the assignment of 'http' URIs to those descriptions and the dereferencing of those URIs to serve them.

Our three approaches covered the bottom-left, bottom-right and top-right quadrants. The web, at least in the sense of serving HTML pages about things of interest in libraries, museums and archives, sits in the top-left quadrant (though any trend towards embedded RDFa in HTML pages moves us firmly towards the top-right).

Interestingly, especially in light of the RDTF mantra to "build better websites", our guidelines managed to miss that quadrant. In their comments, Owen and Paul argued that moving from bottom to top is more important than moving left to right - and, on balance, we tend to agree.

So, what does this mean in terms of our recommendations?

We think that the guidelines need to cover all four quadrants and that, in particular, much greater emphasis needs to be placed on the top-left quadrant. Any guidance needs to be explicit that the 'http' URIs assigned to descriptions served on the web are not URIs for the things being described; that, typically, multiple descriptions may be served for the things being described (an HTML page and an XML document for example, each of which will have separate URIs) and that mechanisms such as '<link rel="alternative" ...>' can be used to tie them together; and that Google sitemaps (on the left) and semantic sitemaps (on the right) can be used to guide robots to the descriptions (either individually or in collections).

Which leaves the issue of the OAI-PMH. In a sense, this sits along-side the top-left quadrant - which is why, I think, it didn't fit particularly well with our previous three approaches. If you think about a typical repository for example, it is making descriptions of the content it holds available as HTML 'splash' pages (sometimes with embedded links to descriptions in other formats). In that sense it is functioning in top-left, "page per thing", mode. What the OAI-PMH does is to give you a protocol mechanism for getting at that those descriptions in a way that is useful for harvesting.

Several people noted that Atom and RSS might be used as an alternative to both sitemaps and the OAI-PMH, and we agree - though it may be that some additional work is needed to specify the exact mechanisms for doing so.

There were some comments on our suggestion to follow the UK government guidelines on assigning URIs. On reflection, we think it would make more sense to recommend only the W3C guidelines on Cool URIs for the Semantic Web, particularly on the separation of things from the desriptions of things, suggesting that it may be sensible to fund (or find) more work in this area making specific recommendations around persistent URIs (for both things and their descriptions).

Finally, there were a lot of comments on the draft guidelines about our suggested models and formats - notably on FRBR, with many commenters suggesting that this was premature given significant discussion around FRBR elsewhere. We think it would make sense to separate out any guidance on conceptual models and associated vocabularies, probably (again) as a separate piece of work.

To summarise then, we suggest:

  • that the four-quadrant model above is used to frame the guidelines - we think all four quadrants are useful, and that there should probably be some guidance on each area;
  • that specific guidance be developed for serving an HTML page description per 'thing' of interest (possibly with associated, and linked, alternative formats such as XML);
  • that guidance be developed (or found) about how to sensibly assign persistent 'http' URIs to everything of interest (including both things and descriptions of things);
  • that the definition of 'open' needs more work (particularly in the context of whether commercial use is allowed) but that this needs to be sensitive to not stirring up IPR-worries in those domains where they are less of a concern currently;
  • that mechanisms for making statement of provenance, licensing and versioning be developed where RDF triples are being made available (possibly in collaboration with Europeana work); and
  • that a fuller list of relevant models that might be adopted, the relationships between them, and any vocabularies commonly associated with them be maintained separately from the guidelines themselves (I'm trying desperately not to use the 'registry' word here!).

March 21, 2011

Virtualisation and the cloud - the Eduserv Symposium 2011

In my last post I mentioned that there was a 'Cloud solutions: risk or reward?' session at the recent JISC Conference in Liverpool. You can watch the three presentations that were made as part of that session by visiting the conference website: Paul Watson (Professor of Computer Science, University of Newcastle) giving a nice overview of work they have been doing to allow non-technical people to use cloud infrastructure more easily; Phil Richards (University of Loughborough) talking about the recent work that Loughborough have been doing with Logicalis; and Henry Hughes (Strategic Programmes Manager, JANET(UK)) talking about the new JANET cloud brokerage service.

Cloud infrastructure is clearly one of the big topics for academia in the UK this year, not least because of the recent UMF funding announcement from HEFCE/JISC (of which the JANET brokerage service (above) is a part). As a result, this particular JISC Conference session came hot on the heels of various other 'cloud' university events including one organised by UCISA that I reported on recently. What struck me while watching it was that we have rapidly reached the point where people are up to speed with the general principles of cloud infrastructure. We don't need too many more 'What is the cloud?' type sessions. What we do need are more sessions that get into the detail of cloud infrastructure, how it might be delivered and consumed in the context of academia, what business models are going to be sustainable, and so on.

This was quite a sobering thought for me personally because I'm currently in the closing stages of organising this year's annual Eduserv Symposium, an event that will focus on - yes, you guessed it - the provision of cloud infrastructure. That said, I think we have a pretty good line-up of speakers - see the symposium website for details - including an opening keynote from Simon Wardley (previously of Canonical and now at Leading Edge Forum) and talks by Chris Cobb (Pro Vice Chancellor, Roehampton University), Phil Richards (Director of IT, Loughborough University) and Kenji Takeda (Senior Lecturer, University of Southampton). I'm also pleased to say that our closing keynote will be given by Armando Fox, Adjunct Associate Professor, Electrical Engineering and Computer Science, UC Berkeley, who was one of the authors of the influential position paper, Above the Clouds: A Berkeley View of Cloud Computing [PDF].

As with last year, we'll start the afternoon session with a set of short 'lightning talks', this time covering what JANET, the Digital Curation Centre (DCC) and ourselves are doing as part of the UMF programme (given by Dan Perry, Kevin Ashley and Matt Johnson).

We have a stated set of aims for the symposium, namely that it will allow people to:

  • hear about the latest developments in the University Modernisation Fund (UMF) shared services in cloud computing infrastructure programme;
  • understand the strategic role of virtualisation and the cloud in the delivery of shared IT services;
  • find out about current and future directions in the provision of cloud solutions for compute and storage, both within academia and beyond;
  • cover the issues and challenges associated with these approaches and their impact on efficiency and cost effectiveness;
  • listen to practical experiences from institutions already workingin the area; and
  • network with peers who have a shared interest in these issues.

I'm really hopeful that the symposium will help us begin to move this debate forward, to inform Eduserv's thinking as we begin ro roll out cloud services, and to help shape the wider UMF programme. I appreciate that it is difficult to get concrete stuff out of a day like this but (as always) I'm really looking forward to it and think we have the makings of a great day.

The event is free to delegates and we have plenty of room. If you are interested in this area and want to get a good handle on what is going on, please sign up via the website. Oh, and we have a drinks reception afterwards which always helps with the networking!

Addendum: I very pleased to report that Terry Harmer (Co-Principal Investigator, Belfast eScience Centre) has also agreed to speak.

UMF pilot cloud infrastructure - size matters?

It's been a while since the HEFCE announcement about UMF and there was quite a bit of discussion about UMF, virtualisation and the cloud at the recent JISC conference in Liverpool (at least from what I could see on the live video stream). It therefore seems appropriate to mention our role in this activity.

By way of background, the University Modernisation Fund (UMF) is a HEFCE initiative that aims to help universities and colleges in the UK deliver better efficiency and value for money through the development of shared services. Managed by the JISC, the programme has two core elements:

  • investment of up to £10 million in cloud computing, shared IT infrastructure, support to deliver virtual servers, storage and data management applications;
  • investment of up to £2.5 million to establish cloud computing and shared services in central administration functions to support learning, teaching, and research.

As part of our involvement in this activity, Eduserv is building a generalised virtualisation and cloud platform to serve up compute and storage resources as IaaS.  Both compute and storage resources will be offered at different tiers to enable delivery of a wide range of applications. At this stage, we expect the platform to offer the following services (though exact details are still under discussion with both JANET and the JISC):

  • VMware-based virtual machines;
  • physical blade servers;
  • block-level SAN disk storage;
  • file/object-level archive disk storage.

Whilst the platform is designated a pilot service, it will be delivered to production quality standards in order that we both build consumer confidence in service availability and that we are able to understand and mitigate any transitional or operational concerns as quickly as possible during the pilot. The platform will be designed to offer virtualization and cloud infrastructure to any projects funded through the UMF programme (at no cost) and to the wider UK HE community (using pricing and billing models that are still to be determined - Eduserv will be developing pricing and billing models that are both sympathetic to the needs of the academic community and that support a sustainable service in the future - again, in discussion with JANET).

We are in the process of designing this infrastructure in such a way to provide the following tactical benefits to HE institutions and supporting organisations: 

  • a fully-configurable virtualised environment that allows for the configuration of customer-specific infrastructure in segregated environments, offering high levels of security and performance;
  • a resilient network, capable of delivering wire-speed 10 Gigabit Ethernet connectivity from the physical or virtual server through to the JANET backbone, which enables institutions to make use of UMF services as though they were located locally on-premise;
  • a highly-scalable compute infrastructure that can simultaneously accommodate a number of different initiatives, from UMF-funded pilot SaaS services through to institution-specific virtualisation and cloud provisioning;
  • a multi-tier storage architecture providing a range of data services, from raw data processing through to research data management and longer-term storage.

From our perspective, the intention is to investigate and support the following strategic benefits across the HE community:

  • the potential to significantly reduce the amount of time and effort spent by HEIs in developing plans and associated business cases for institutional data centre infrastructure, leading to a reduction in capital expenditure on HEI-specific data centre construction, refit and on-going operations;
  • the provision of a focal point for new shared service development, offering on-demand development and test environments as well as high-quality production infrastructure capable of delivering enterprise-level SLAs;
  • a long-term, sustainable service blueprint that delivers IaaS services at pricing competitive with existing commercial providers, but with the benefit of direct JANET connectivity and HE-suited pricing models;
  • a service platform that offers IPv6 capability, assisting institutions in the transition from the currently depleted IPv4 address space.

During the morning cloud session at the JISC Conference, there were a couple of comments relating to "industrial scale" clouds, the implication being both that the education community can't build such clouds itself and that massive size matters in order to realise sufficient economies of scale to be worthwhile.

As I tweeted at the time, I don't believe that to be the case. Or, rather, I don't know if that is the case - one of the things we need to make sure comes out of the next 12 months or so of activity within UMF is some much better understanding of what the education community is capable of building itself, whether cloud infrastructure services within that community are likely to be sustainable, and what cost savings are likely to be made.

It seems to me that there is a scale that sits somewhere between "a single institution" and "industrial scale" (which I take to mean Amazon, Microsoft, etc.), a scale that the education community is well able to deliver, that is sufficiently far along the scale/cost curve for significant savings to be made.

Cost-scale
The further one can move along the scale axis, the better - clearly. As in most things, size matters! But it is also the case that there are diminishing returns here I suspect. It remains to be seen how far educational providers can move along the scale, either individually or in collaboration, and whether the resulting infrastructure can be delivered in an attractive and sustainable way.

If you are interested in this kind of stuff, our annual free Eduserv Symposium (May 12th in London) will be focusing on virtualisation and the cloud in general, and UMF in particular - more on this shortly.

March 10, 2011

Term-based thesauri and SKOS (Part 4): Change over time (ii)

This is the fourth in a series of posts (previously: part 1, part 2, part 3) on making a thesaurus available as linked data using the SKOS and SKOS-XL RDF vocabularies. In the previous post, I examined some of the ways the thesaurus can change over time, and problems that arose with my proposed mapping to RDF. Here I'll outline one potential solution to those problems.

The last three cases I described in the previous post, where an existing preferred term loses that status and is "relegated" to a non-preferred term, all present a problem for my suggested simple mapping, because the URI for a concept disappears from the generated RDF graph - and this creates a conflict with the principles of URI stability and reliability I advocated at the start of that post.

My first thoughts on a solution circled around generating concept URIs, not just for the preferred term, but also for all the non-preferred terms, and using owl:sameAs (or skos:exactMatch?) to indicate that the concept URIs derived from the terms associated with a single preferred term were synonyms, i.e. each of them identified the same concept. That way the change from preferred term to non-preferred term would not result in the loss of a concept URI. But the proliferation of URIs here feels fundamentally flawed - the problem is not one that is solved by having multiple URIs for a single concept; the issue is the persistence of a single URI. Introducing the multiple URIs also seems like a recipe for a lot of practical difficulties in managing the impact of changes on external applications, particularly if URIs which were once synonyms cease to be so.

After some searching, I found a couple of useful pages on the W3C wiki: some notes on versioning (which as far as I know didn't make it into the final SKOS specifications) and particularly this page on "Concept Evolution" in SKOS. The latter is rather more a collection of pointers than the concrete set of examples and guidelines I was hoping for, but one of those pointers is to a thread on the W3C public-esw-thes mailing list, starting with this message from Rob Tice, which I think describes (in his point 2) exactly the situation I'm dealing with in the problem cases in the previous post:

How should we identify and manage change between revisions of concept schemes as this 'seems' to result in imprecision. e.g. a concept 'a' is currently in thes 'A' and only has a preferred label. A new revision of thes 'A' is published and what was concept 'a' is now a non preferred concept and thus becomes simply a non preferred label for a new concept 'b'.

It seems to me that this operation loses some of the semantic meaning of the change as all references to the concept id of 'concept a' would be lost as it now is only a non preferred label of a different concept with a different id (concept 'b').

The suggested approach emerging from that discussion has two elements:

  1. A notion that a concept can be marked as "deprecated" (using e.g. a "status" property with a value of "deprecated" or a "deprecated" property with Boolean (yes/no) values) or as being "valid" or "applicable" only for a specified bounded period of time (see the messages from Johan De Smedt and from Margarita Sini)
  2. Such a "deprecated" concept can be the subject of a "replaced by" relationship linking it to the "preferred term" concept (see the message from Michael Panzer)

The application of these two elements in combination is illustrated in this example by Joachim Neubert (again, I think, addressing the same scenario).

I wasn't aware of the owl:deprecated property before, but as far as I can tell, it would be appropriate for this case.

Joachim's message highlights the question of what to do about skos:prefLabel/skosxl:prefLabel or skos:altLabel/skosxl:altLabel properties for the deprecated concept. In the term-based thesaurus, the term has become a non-preferred term for another term: in the SKOS model, the term is now the alternate label for a different concept, and the preferred label for no concept. So on that basis, I'm inclined to follow Joachim's suggestion that the deprecated concept should be the subject of neither skos:prefLabel/skosxl:prefLabel nor skos:altLabel/skosxl:altLabel properties, though it could, as Joachim's example shows, retain an rdfs:label property. And similarly it is no longer the subject or object of semantic relationships.

I did wonder about the option of introducing a set of properties, parallel to the SKOS ones, to indicate those former relationships, e.g. ex:hadPrefLabel, ex:hadAltLabel, ex:hadRelated, ex:hadBroader, ex:hadNarrower, essentially as documentation. But I'm really not sure how useful this is: the semantic relationships in which those other target concepts are involved may themselves change. And I suppose in principle (though it seems unlikely in practice) a single concept may itself go through several status changes (e.g. from active to deprecated to active to deprecated) and accrue various different "former" relationships in the course of that. If this level of information is required, then I think it probably has to be provided using some other approach - like the use of a set of date-stamped graphs/documents that reflect the state of a concept at a point in time.

So applying Joachim's approach to Case 8 from the examples in the previous post, where the current preferred term "Political violence" is to become a non-preferred term for "Collective violence", we end up with the concept con:C2 as a "deprecated" concept with a Dublin Core dcterms:isReplacedBy relationship to concept con:C6 (and the inverse from con:C6 to con:C2):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C6 .
       
term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2; 
       dcterms:replaces con:C2 .       

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

Using this approach then, the full output graph for Case 8 would be as follows (the highlighting indicates the difference between this graph and that for Case 8 in the previous post):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3; 
       dcterms:replaces con:C2 .       

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C6 .
       
term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C6 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

Now our graph retains the URI con:C2 and provides a description of that resource as a "deprecated concept".

And for Case 9 (again the highlighting indicates the difference from the initial graph for Case 9):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3; 
       dcterms:replaces con:C2 .       
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       rdfs:label "Political violence"@en;
       owl:deprecated "true"^^xsd:boolean;
       dcterms:isReplacedBy con:C1 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C1 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

In the (unlikely?) event that the (previously preferred) non-preferred term is once again restored to the state of preferred term, then the concept con:C2 loses its deprecated status and the dcterms:isReplacedBy relationship, and acquires skos:prefLabel/skos:altLabel properties as normal.

Generating these graphs does, however, imply a change to the process of generating the RDF representation. As I noted at the start of the previous post, my first cut at this was based on being able to process a snaphot of the thesaurus "stand-alone" without knowledge of previous versions. But the capacity to detect deprecated concepts depends on knowledge of the current state of the thesaurus, i.e. when the transformation process encounters a non-preferred term x, it needs to behave differently depending on whether:

  1. concept con:Cx exists in the current thesaurus dataset (as either an "active" or "deprecated" concept), in which case a "deprecated concept" con:Cx should be output, as well as term:Tx (as alternate label for some other concept, con:Cy); or
  2. concept con:Cx does not exist in the current thesaurus dataset, in which case only term:Tx (as alternate label for a concept con:Cy) is required

I think that test has to be made against the current RDF thesaurus dataset rather than using the previous XML snapshot in time, as the "deprecation" may have taken place several snapshots ago.

I have to admit this does make the transformation process rather more complicated than I had hoped. The only way alternative would be if it is somehow possible to distinguish the "deprecation" case from the "static" non-preferred term case from the input data alone, but as far as I know this isn't possible.

Summary

The previous post highlighted that for one particular category of change, where an existing preferred term is "relegated" to the status of a non-preferred term, the results of the suggested simple mapping into SKOS had problematic consequences.

Based on some investigation of how others approach similar scenarios (and here I should note I'm very grateful to the contributors to the wiki page on concept evolution and to those discussions linked from it, as I was struggling to see clearly how to deal with these scenarios), I've sketched above an approach to representing a concept which has been "deprecated", or is no longer applicable, and is replaced by another concept. I'm sure it isn't the only way of addressing the problem, but it seems a reasonable one to try.

I think this creates new challenges for implementing this approach in the transformation process and I need to work on that to test it, but I think it is achievable. But I would also be very grateful for any comments, particularly if there are gaping holes in this which I haven't spotted!

Term-based thesauri and SKOS (Part 3): Change over time (i)

This is the third in a series of posts (previously: part 1, part 2) on making a thesaurus available as linked data using the SKOS and SKOS-XL RDF vocabularies. In this post, I'll examine some of the ways the thesaurus can change over time, and how such changes are reflected when applying the mapping I described earlier.

A note on "workflow"

In the case I'm working on, the term-based thesaurus is managed in a purpose-built application, from which a snapshot is exported (as an XML document) at regular intervals. These XML documents are the inputs to a transformation process which generates an SKOS/SKOS-XL RDF version, to be exposed as linked data.

Currently at least, each "run" of that transformation operates on a single snaphot of the thesaurus "stand-alone" i.e. the transform process has no "knowledge" of the previous snapshot, and the expectation is that the output generated from processing will replace the output of the previous run (either in full, or through a process of establishing the differences and then removing some triples and adding others). This "stand-alone" approach may be something I have to revisit.

The mapping

To summarise the transformation described in the previous post, a single preferred term and its set of zero or more non-preferred terms are treated as labels for a single concept. For each such set:

  • a single SKOS concept is created with a URI based on the term number of the preferred term
  • the concept is related to the literal form of the preferred term by an skos:prefLabel property
  • an SKOS-XL label is created with a URI based on the term number of the preferred term
  • the label is related to the literal form of the preferred term by an skosxl:literalForm property
  • the concept is related to the label by an skosxl:prefLabel property
  • the "hierarchical" (broader term, narrower term) and "associative" (related term) relationships between preferred terms are represented as "semantic" relationships between concepts
  • And for each non-preferred term in the set
    • the concept is related to the literal form of the non-preferred term by an skos:altLabel property
    • an SKOS-XL label is created with a URI based on the term number of the non-preferred term
    • the label is related to the literal form of the preferred term by an skosxl:literalForm property
    • the concept is related to the label by an skosxl:altLabel property

In the discussion below, I'll take the following "snapshot" of a notional thesaurus - it's another version of the example used in the previous posts, extended with an additional preferred term - as a starting point:

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

Using the mapping above, it is represented as follows in RDF using SKOS/SKOS-XL:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.
Fig1

"Versioning" and change over time

Once our resource URIs are generated and published, they will be used/cited by other agencies in their data - in other linked data datasets, in other thesauri, or in simple Web documents which reference terms or concepts using those URIs. From the linked data perspective, it is important that once generated and published the resource URIs, which will be http: URIs, remain stable and reliable. I'm using the terms "stable" and "reliable" as they are used by Henry Thompson and Jonathan Rees in their note Guidelines for Web-based naming, which I've found very helpful in breaking down the various aspects of what we tend to call "persistence". And for "stability", I'm thinking particularly of what they call "resource stability". So

  • once a URI is created, we should continue to use that URI to denote/identify the same resource
  • it should continue to be possible to obtain some information "about" the identified resource using the HTTP protocol - though that information obtained may change over time

For our particular case, the requirement is only that the "current version" of the thesaurus is available at any point in time, i.e. for each concept and for each term/label, at any point in time, it is necessary to serve only a description of the current state of that resource.

So, in my previous post, I mentioned that the Cabinet Office guidelines Designing URI Sets for the UK Public Sector allow for the case of creating a set of "date-stamped" document URIs, to provide variant descriptions of a resource at different points in time. I don't think that is required for this case, so for each term and concept, we'll have a URI for the that "thing", a "Document URI" for a "generic document" (current) description of that thing, and "Representation URIs" for each "specific document" in a particular format.

The formats provided will include a human-readable HTML version, an RDF/XML version and possibly other RDF formats. Over time, additional formats can be added as required through the addition of new "Representation URIs".

My primary focus here is the changes to the thesaurus content. Over time, various changes are possible. New terms may be added, and the relationships between terms may change. Terms are not deleted from the theasurus, however.

The most common type of change is the "promotion" of an existing non-preferred term to the status of a preferred term, but all of the following types of change can occur, even if some are infrequent:

  1. Addition of new semantic relationships between existing preferred terms
  2. Removal of existing semantic relationships between existing preferred terms
  3. Addition of new preferred terms
  4. Addition of new non-preferred terms (for existing preferred terms)
  5. An existing non-preferred term becomes a new preferred term
  6. An existing non-preferred term becomes a non-preferred term for a different existing preferred term
  7. An existing non-preferred term becomes a non-preferred term for a newly-added preferred term
  8. An existing preferred term becomes a non-preferred term for another existing preferred term
  9. An existing preferred term become a non-preferred term for a term which is currently a non-preferred term for it (and vice versa)
  10. An existing preferred term becomes a non-preferred term for a newly added preferred term

Below, I'll try to walk through an example of each of those changes in turn, starting from the example thesaurus above, showing the results using the mapping suggested above, and examining any issues which arise.

Case 1: Addition of new semantic relationship

The addition of new broader term (BT), narrower term (NT) or related term (RT) relationships is straightforward, as it involves only the creation of additional assertions of relationships between concepts, using the skos:broader, skos:narrower or skos:related properties, not the creation of new resources.

So if the example above is extended to add a BT relation between the "Collective violence" (term no 6) and "Violence" (term no 4) terms (and the inverse NT relation):

Civil violence
USE Political violence
TNR 1

Collective violence
BT Violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
NT Collective violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the RDF graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:broader con:C4 .

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 ;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

i.e. two new triples are added to the RDF graph

con:C6 skos:broader con:C4 .
con:C4 skos:narrower con:C6 .

The addition of the triples means that, from a linked data perspective, the graphs served as descriptions of the resources con:C6 and con:C4 change. They each include one additional triple for the concise bounded description case; two triples for the symmetric bounded description case (see the previous post for the discussion of different forms of bounded description). So the contents of the representations of documents http://example.org/doc/concept/polthes/C4 and http://example.org/doc/concept/polthes/C6 change - but no new resources are generated, and no new URIs required.

Case 2: Removal of existing semantic relationship

The removal of existing broader term (BT), narrower term (NT) or related term (RT) relationships is similarly straightforward, as it involves only the deletion of assertions of relationships between concepts, using the skos:broader, skos:narrower or skos:related properties, without the removal of existing resources.

I won't bother writing out an example in full for this case, but imagine the case of the previous example reverting to its initial state.

Again, from a linked data perspective, the graphs served as descriptions of the resources con:C6 and con:C4 change, with each containing one triple less for the CBD case or two triples less for the SCBD case, but we still have the same set of term URIs and concept URIs.

Case 3: Addition of new preferred terms

The addition of a new preferred term is again a matter of extending the graph with new information, though in this case some new URIs are also introduced.

Suppose a new preferred term "Revolution" (term no 7) is added to our initial example:

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Revolution
TNR 7

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the following graph:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C7 a skos:Concept;
       skos:prefLabel "Revolution"@en;
       skosxl:prefLabel term:T7 .

term:T7 a skosxl:Label;
        skosxl:literalForm "Revolution"@en.
        
con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following triples are added:

con:C7 a skos:Concept;
       skos:prefLabel "Revolution"@en;
       skosxl:prefLabel term:T7.

term:T7 a skosxl:Label;
        skosxl:literalForm "Revolution"@en.

The RDF representation now includes an additional concept and label, each with a new URI. So now there are two new resources, with new URIs (con:C7 and term:T7), and a corresponding set of new Document URIs and Representation URIs for descriptions of those resources.

It is quite probable that the addition of a new preferred term is accompanied by the assertion of semantic relationships with other existing preferred terms. This is the equivalent of following this step, then a second step of the type shown in case 1.

Case 4: Addition of new non-preferred term (for existing preferred term)

The addition of a new non-preferred term is, again, a matter of adding new information, and new URIs.

Suppose a new term "Assault" (term no 8) is added as a new non-preferred term for "Violence" (term no 4):

Assault
USE Violence
TNR 8

Civil violence
USE Political violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
UF Assault
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

which results in the graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T8 a skosxl:Label;
        skosxl:literalForm "Assault"@en.

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:altLabel "Assault"@en;
       skosxl:altLabel term:T8;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

i.e. the following triples are added:

term:T8 a skosxl:Label;
        skosxl:literalForm "Assault"@en.

con:C4 skos:altLabel "Assault"@en;
       skosxl:altLabel term:T8 .

So from a linked data perspective, there is a new resource with a new URI (term:T8) (and its own new description with a new Document URI), and the existing URI con:C4 is the subject of two new triples, an skos:altLabel for the literal, and an skosxl:altLabel link to the new label, so the graph served as description of that existing resource changes to include additional triples.

Case 5: Existing non-preferred term becomes new preferred term

Suppose the existing term "Civil violence", initially a non-preferred term for "Political violence" is "promoted" and made a preferred term in its own right

Civil violence
USE Political violence
BT Violence
TNR 1

Collective violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Civil violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

resulting in the following graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:broader con:C4 .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

For this case, the following new triples are added

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:broader con:C4 .

con:C4 skos:narrower con:C1 .

and also the following existing triples are removed

con:C2 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1 .

So from a linked data perspective, there is a new resource with a new URI (concept:C1) (and its own new description with a new Document URI), and the graph served as description of the existing resources con:C2 and con:C4 both change: the former loses the skos:altLabel and skosxl:altLabel triples and the latter includes a new skos:narrower triple. If symmetric bounded descriptions are used, the description of term:T1 changes too.

Case 6: Existing non-preferred term becomes non-preferred term for a different existing preferred term

Suppose we decide that "Civil violence", initially a non-preferred term for "Political violence", is to become a non-preferred term for "Collective violence".

Civil violence
USE Political violence
USE Collective violence
TNR 1

Collective violence
UF Civil violence
TNR 6

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

This generates the following graph:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

For this case, the following new triples are added

con:C6 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1.

and also the following existing triples are removed

con:C2 skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1 .

The graphs served as descriptions of the existing resources con:C2 and con:C6 both change: the former loses the skos:altLabel and skosxl:altLabel triples and the latter gains skos:altLabel and skosxl:altLabel triples. If symmetric bounded descriptions are used then the description of term:T1 also changes.

Case 7: Existing non-preferred term becomes non-preferred term for a newly added preferred term

I think this case is just a combination of Case 3 (addition of new preferred term) and Case 6 (existing non-preferred term becomes non-preferred term for a different existing preferred term) in sequence. We've seen above that these changes can be made without problems, so the "composite" case should be OK too, and I won't bother working through a full example here.

Case 8: An existing preferred term becomes a non-preferred term for another existing preferred term

Suppose the current preferred term "Political violence" is to be "relegated" to become a non-preferred term for "Collective violence", with the latter becoming the participant in hierarchical relations previously involving the former. (I appreciate that these two terms probably don't constitute a great example, but let’s suppose it works, for the sake of the discussion!)

Civil violence
USE Political violence
USE Collective violence
TNR 1

Collective violence
UF Civil violence
UF Political violence
UF Violent protest
BT Violence
NT Terrorism
TNR 6

Political violence
USE Collective violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
BT Collective violence
TNR 3

Violence
NT Political violence
NT Collective violence
TNR 4

Violent protest
USE Political violence
USE Collective violence
TNR 5

This maps to the rather substantially changed RDF graph

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C6 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C6 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following RDF triples have been added

con:C6 skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C6 .

con:C4 skos:narrower con:C6 .

And the following RDF triples have been removed

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .

So the graphs served as descriptions of the concepts con:C3 and con:C4 change (with the removal of a triple and the addition of a new one); and that for concept con:C6 changes with the addition of several triples.

So far, so good.

However, the URI con:C2 has now completely disappeared from the graph. If this new graph simply replaces the previous graph, then there will be no description available for resource con:C2.

Case 9: An existing preferred term become a non-preferred term for a term which is currently a non-preferred term for it (and vice versa)

Suppose that the current non-preferred term "Civil violence" is to become preferred to "Political violence", and the latter is to become a non-preferred term for the former - both "relegation" and "promotion" taking place together, if you like.

Civil violence
USE Political violence
UF Political violence
UF Violent protest
BT Violence
NT Terrorism
TNR 1

Collective violence
TNR 6

Political violence
USE Civil violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrorism
TNR 2

Terrorism
BT Political violence
BT Civil violence
TNR 3

Violence
NT Political violence
NT Civil violence
TNR 4

Violent protest
USE Political violence
USE Civil violence
TNR 5
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
        
con:C6 a skos:Concept;
       skos:prefLabel "Collective violence"@en;
       skosxl:prefLabel term:T6.

term:T6 a skosxl:Label;
        skosxl:literalForm "Collective violence"@en.

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C1 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C1 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

The following RDF triples have been added

con:C1 a skos:Concept;
       skos:prefLabel "Civil violence"@en;
       skosxl:prefLabel term:T1;
       skos:altLabel "Political violence"@en;
       skosxl:altLabel term:T2;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
       
con:C3 skos:broader con:C1 .

con:C4 skos:narrower con:C1 .

And the following RDF triples have been removed

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .

The outcome here is similar to that of the previous case.

The graphs served as descriptions of the concepts con:C3 and con:C4 change (with the removal of a triple and the addition of a new one). A new concept con:C1 is created. But again the URI con:C2 has completely disappeared from the graph, with the same consequences that no description will be available.

Case 10: An existing preferred term becomes a non-preferred term for a newly added preferred term

I think this case is just a combination of Case 3 (addition of new preferred term) and Case 8 (existing preferred term becomes a non-preferred term for another existing preferred term) in sequence.

The same problem will arise with the URI of the existing concept disappearing from the new output graph.

Summary

I've walked through in detail the different types of changes which can occur to the content of the thesaurus. This highlighted that for one particular category of change, where an existing preferred term is "relegated" to the status of a non-preferred term, exemplified by my cases 8, 9 and 10 above, the results of the suggested simple mapping into SKOS had problematic consequences: the URI for a concept disappears from the generated RDF graph - and this creates a conflict with the principles of URI stability and reliability I advocated at the start of this post.

In the next post, I'll suggest one way of (I hope!) addressing this problem.

March 01, 2011

Term-based thesauri and SKOS (Part 2): Linked Data

In my previous post on this topic I outlined how I was approaching making a thesaurus available using the SKOS and SKOS-XL RDF vocabularies. In that post I focused on:

  • how the thesaurus content is modelled using a "concept-based" approach - what are the "things of interest", their attributes, and the relationships between them;
  • how those things (concepts, terms/labels) are named/identified using http URIs;
  • how those things can be "described" using the simple "triple" statement model of RDF, and using the SKOS and SKOS-XL RDF vocabularies;
  • an example of how an expresssion of the thesaurus using the term-based model is mapped or transformed into an SKOS RDF expression

What I didn't really address in that post is how that resulting RDF data is made available and accessed on the Web - which is more specifically where the "Linked Data" principles articulated by Tim Berners-Lee come into play.

(A good deal of the content of this post is probably familiar stuff for those of you already working with Linked Data, but I thought it was worth documenting it both to fill out the picture of some of the "design choices" to be made in this particular example, and to provide some more background to others less familiar with the approaches.)

Linked Data, URIs, things, documents and HTTP

The use of http URIs as identifiers provides two features:

  • a global naming system, and a set of processes by which authority for assigning names can be delegated/distributed;
  • through the HTTP protocol, a well understood and widely deployed mechanism for providing access to information "about", or descriptions of, the things identified by those URIs (in our case, the concepts and terms/labels).

As a user/consumer of an http URI, given only the URI, I can "look up" that URI using the HTTP protocol, i.e. I can provide it to a tool (like a Web browser) and that tool can issue a request to obtain a description of the thing identified by the URI. And conversely as the owner/provider of a URI, I can configure my server to respond to such requests with a document providing a description of the thing.

And the HTTP protocol incorporates features which we can use to "optimise" this process. So, for example, the "content negotiation" feature allows a client to specify a preference for the format in which it wishes to receive data, and allows a server to select - from amongst several it may have available - format which the server determines is most appropriate for the client. In the terminology of the Web Architecture, the description can have multiple "representations", each of which can vary by format (or by other criteria). In the context of Linked Data, this technique is typically used to support the provision of document representations in formats suitable for a human reader (HTML, XHTML) and in one or more RDF syntaxes (usually, at least as RDF/XML). (The emergence of the RDFa syntax, which enables the embedding of RDF data in HTML/XHTML documents, and the growing support for RDFa in RDF tools, offers the possibility, in principle at least, of a single format serving both purposes.)

The widespread use of the HTTP protocol and tools that support it mean that these techniques are widely available (in theory at least; experience suggests that in practice the ability (or authority) to set up things like HTTP redirects (more below) can create something of a barrier). It also means that the "Web of Linked Data" is part of the existing Web of documents that we are accustomed to navigating using a Web browser.

One of the principles underpinning RDF's use of URIs as names is that we should try to avoid ambiguity in our use of those names, i.e. we should use different URIs for different things, and avoid using the same URI as a name for two different things. One of the issues I've slightly glossed over in the last few paragraphs is the distinction between a "thing" and a document describing that thing as two different resources. After all, if I provide a page describing the Mona Lisa, both the page and the Mona Lisa have creators, creation dates, terms of use, but they have different creators, creation dates and terms of use. And if I want to provide such information in RDF, then I need to take care to avoid confusing the two objects - by using two different URIs, one for my document and one for the painting, and citing those URIs in my RDF statements accordingly.

However, as I emphasise above, we also want to be in a position where, given only a "thing URI", I can obtain a document describing that thing: I shouldn't need in advance information about a second URI, the URI of a document about that thing.

The W3C Note Cool URIs for the Semantic Web describes some possible approaches to addressing this issue, broadly using two different techniques:

  • the use of URIs containing "fragment identifiers" ("hash URIs") (i.e. URIs of the form http://example.org/doc/123#thing). In this case, the "fragment identifier" part of the URI is always "trimmed" from the URI when the client makes a request to the server, and this allows the use of the URI with fragment identifier as "thing URI", leaving the trimmed URI without fragment id as a document URI.
  • the use of a convention of HTTP "redirects". In this case, when a server receives a request for a URI which it "knows" is a "thing URI" rather than a document URI, it returns a response which provides a second URI as a source of information about the thing, and the client then sends a second request for that second URI. Formally, the initial response uses the HTTP "303 See Also" status code, which sometimes leads to these being called colloquially "303 URIs", even though there's nothing special about the URIs themselves.

I'm conscious that I'm skipping over some of the details here; for a more detailed description, particularly of the "flow" of the interactions involved, and some consideration of the pros and cons of the two approaches, see Cool URIs for the Semantic Web.

URI Sets for the UK Public Sector

The Cool URIs note focuses mainly on the patterns of "interaction" for handling the two approaches to moving from "thing URI" to document URI. Its examples include example URIs, but the exact form of those URIs is intended to be illustrative rather than prescriptive. I think it's important to note that in the redirect case, it is the server's notification to the client of the second URI that provides the client with that information. There is no technical requirement for a structural similarity in the forms of the "thing URI" and the document URI, and consumers of the URIs should rely on the information provided to them by the server rather than making assumptions about URI structure.

Having said that, the use of a shared, consistent set of URI patterns within a community can provide some useful "cues" to human readers of those URIs. It can also simplify the work of data providers - for example by facilitating the use of similar HTTP server configurations or the reuse of scripts/tools for serving "Linked Data" documents. With this (and other factors such as URI stability) in mind, the UK Cabinet Office has provided a set of guidelines, Designing URI Sets for the UK Public Sector which build on the W3C Cool URIs note, but offer more specific guidance, particularly on the design of URIs.

For the purposes of this discussion, of particular interest is the document's specification (in the "Definitions, frameworks and principles" section) of several distinct "types of URI", or perhaps more accurately, URIs for different categories of resource, and (in the "The path structure for URIs" section) of suggested structural patterns for each:

  • Identifier URIs (what I have been calling above "thing URIs") name "real-world things" and should use patterns like:
    • http://{domain}/{concept}/{reference}#id or
    • http://{domain}/id/{concept}/{reference}
    where:
    • {concept} is "a word or string to capture the essence of the real-world 'Thing' that the set names e.g. school". (I think this is roughly what I think of as the name of a "resource type" - note this is a more generic use of the word "concept" than that of the SKOS concept)
    • {reference} is "a string that is used by the set publisher to identify an individual instance of concept".
    The document allows for the use of a hierarchy of concept-reference pairs in a single URI if appropriate, so for a specific class within a specific school, the path might be /id/school/123/class/5
  • Document URIs name the documents that provide information about, descriptions of, "real-world things". The suggested pattern is
    • http://{domain}/doc/{concept}/{reference}
    These documents are, I think, what Berners-Lee calls Generic Resources. For each such document, multiple representations may be available, each in different formats, and each of those multiple "more specific" concrete formats may be available as a separate resource in its own right (see "Representation URIs" below). If descriptions vary over time, and those variants are to be exposed then a series of "date-stamped" URIs can be used, with the pattern
    • http://{domain}/doc/{concept}/{reference}/{yyyy-mm-dd}
  • Representation URIs name a document in a specific format. The suggested pattern is
    • http://{domain}/doc/{concept}/{reference}/{doc.file-extension}
    This can also be applied to a date-stamped version:
    • http://{domain}/doc/{concept}/{reference}/{yyyy-mm-dd}/{doc.file-extension}

The guidelines also distinguish a category of "Ontology URIs" which use the pattern http://{domain}/def/{concept}. I had interpreted "Ontology URIs" as applying to the identification of classes and properties, and I was treating the terms/concepts of a thesaurus as "conceptual things" which would fall under the /id/ case. But I do notice that in an example in which she refers to these guidelines, Jeni Tennison uses the /def/ pattern for an SKOS example. I don't think it's really much of an issue - and pretty much all the other points I discuss apply anyway - but any advice on this point would be appreciated.

So, applying these general rules for the thesaurus case, where, as I discussed in the previous post, the primary types of thing of interest in our SKOS-modelled thesaurus are "concepts" and "terms":

  • Term URI Pattern: http://example.org/id/term/T{termid}
  • Concept URI Pattern: http://example.org/id/concept/C{termid}

However, if we bear in mind that within the URI space of the example.org domain, we're likely to want to represent, and coin URIs for the components of, multiple thesauri, and our "termid" references (drawn from the term numbers in the input) are unique only within the scope of a single thesaurus, then we should include some sort of thesaurus-specific component in the path to "qualify" those term numbers. Let's use the token "polthes" for this example:

  • Term URI Pattern: http://example.org/id/term/{schemename}/T{termid}
    Example: http://example.org/id/term/polthes/T2
  • Concept URI Pattern: http://example.org/id/concept/{schemename}/C{termid}
    Example: http://example.org/id/concept/polthes/C2

We should also include a URI for the thesaurus as a whole. The SKOS model provides a generic class of "concept scheme" to cover aggregations of concepts:

  • Concept Scheme URI Pattern: http://example.org/id/concept-scheme/{schemename}
    Example: http://example.org/id/concept-scheme/polthes

where each concept and term in the thesaurus is linked to this concept scheme by a triple using the skos:inScheme property. (I omitted this from the example in the previous post so that it was easier to focus on the concept-term and concept-concept relationships, and to try to keep the already rather complex diagrams slightly readable!)

Aside: An alternative for the concept and term URI patterns would be to use the "hierarchical concept-reference" approach and use patterns like:

  • Term URI Pattern: http://example.org/id/concept-scheme/{schemename}/term/T{termid}
    Example: http://example.org/id/concept-scheme/polthes/term/T2
  • Concept URI Pattern: http://example.org/id/concept-scheme/{schemename}/concept/C{termid}
    Example: http://example.org/id/concept-scheme/polthes/concept/C2

My only slight misgiving about this approach is that (bearing in mind that strictly speaking the URIs should be treated as opaque and such information obtained from the descriptions provided by the server) in the (non-hierarchical) form I suggested initially, the string indicating the resource type ("concept", "term") is fairly clear to the human reader from its position following the "/id/" component in the URI (e.g. http://example.org/id/concept/polthes/C2). But with the hierarchical form, it perhaps becomes slightly less clear (e.g. http://example.org/id/concept-scheme/polthes/concept/C2). But that is a minor gripe, and really the hierarchical form would serve just as well. For the remainder of this document, in the examples, I'll continue with the initial non-hierarchical pattern I suggested above, but it may be something to revisit if the hierarchical form is more in line with the intent - and current usage - of the guidelines. (So again, comments are welcome on this point.)

For each of these "Identifier URIs", there should be a corresponding "Document URI" naming a document describing the thing, and following the /doc/ pattern:

  • Description of Concept Scheme: http://example.org/doc/concept-scheme/polthes
  • Description of Term: http://example.org/doc/term/polthes/T{termid}
  • Description of Concept: http://example.org/doc/concept/polthes/C{termid}

And for each format in which the description is available, a corresponding "Representation URI":

  • Description of Concept Scheme (HTML): http://example.org/doc/concept-scheme/polthes/doc.html
  • Description of Concept Scheme (RDF/XML): http://example.org/doc/concept-scheme/polthes/doc.rdf
  • Description of Concept (HTML): http://example.org/doc/concept/polthes/C{termid}/doc.html
  • Description of Concept (RDF/XML): http://example.org/doc/concept/polthes/C{termid}/doc.rdf
  • Description of Term (HTML): http://example.org/doc/term/polthes/T{termid}.html
  • Description of Term (RDF/XML): http://example.org/doc/term/polthes/T{termid}.html

Descriptions and "boundedness"

The three documents I've mentioned so far (Berners-Lee's Linked Data Design Issues note, the W3C Cool URIs document, or the Cabinet Office URI patterns document) don't have a great deal to say on the topic of the content of the document which is returned as a description of a "thing". This is discussed briefly in the "Linked Data Tutorial" document by Chris Bizer, Richard Cyganiak and Tom Heath, How to Publish Linked Data on the Web.

In principle at least, it is quite possible to provide a single document which describes several resources. This approach has been quite common in association with the use of "hash URIs" in a pattern where a number of "thing URIs" differ only by fragment identifier, and share the same "non-fragment" part (http://example.org/school#1, http://example.org/school#2, ... http://example.org/school#99), and a number of common ontologies make use of this sort of approach. One consequence is that a client interested only in a single resource always retrieves the full set of descriptions. If my thesaurus really did consist only of the half-dozen concepts and terms I described in the example in my previous post, retrieving a document describing them all would probably not be a problem, but for the "real world" case where there are several thousand terms involved, it would represent a significant overhead if every request results in the transfer of several megabytes of data.

Generally, the approach taken is for the data provider to generate some set of "useful information" "about" the requested resource - though saying that rather begs the question of what constitutes "useful" (and whether there is a single answer to that question that is applicable across different datasets dealing with different resource types). Typically the generation of a description is based on some set of rules which, for a specified node in the dataset RDF graph (a specified "thing URI"), selects a "nearby" subgraph of the graph, representing a "bounded description" made up of triples/statements "about" the thing itself and maybe also "about" closely related resources.

Various such algorithms for generating such descriptions are possible and I don't intend to attempt any sort of rigorous analysis or comparison of them here - for further discussion see e.g. Patrick Stickler's CBD - Concise Bounded Description or Bounded Descriptions in RDF from the Talis Platform wiki. But there is one aspect which I think it is worth mentioning in the context of the thesaurus example. One of the key differences between the algorithms used to generate descriptions is how they treat the "directionality" of arcs in the RDF graph, i.e. whether they base the description only on arcs "outbound from" the selected node (RDF triples with that URI as subject), or whether they take into account both arcs "outbound from" and "inbound to" the node (triples with the URI as either subject or object).

That probably sounds like a very abstract point, and the significance is perhaps best illustrated through a concrete example. Let's take the graph for the example from my previous post (tweaked to use the slightly amended URI patterns above - I've continued to leave out the concept scheme links to keep things simple) and suppose this is the dataset to which I'm applying the rules.

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

And in graphical form (as before with the rdf:type triples omitted):

Fig1

(In the figures below, I've tried to represent the idea that a subgraph is being selected by "fading out" the parts which aren't selected, and leaving the selected part fully visible. I hope the images are sufficiently clear for this to be effective!)

Let's first take the approach known as the "concise bounded description (CBD)" - formally defined here, but essentially based on "outbound" links. For the concept C2 (http://example.org/id/concept/polthes/C2), the CBD would consist of the following subgraph (i.e. the document http://example.org/doc/concept/polthes/C2 would contain this data):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .
Fig2

For the term T2 (http://example.org/id/term/polthes/T2), corresponding to the "preferred label" (i.e. the document http://example.org/doc/term/polthes/T2 would contain):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.
Fig3

For the term T5 (http://example.org/id/term/polthes/T5), corresponding to the "alternate label" (i.e. the document http://example.org/doc/term/polthes/T5 would contain):

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.
Fig4

Note that for the two terms, the "concise bounded description" is quite minimal (though remember I've simplified it a bit): in particular, it does not include the relationship between the term and the concept. This is because using the SKOS-XL vocabulary that relationship is expressed as a triple in which the concept URI is the subject and the term URI is the object - an "inbound arc" to the term URI in the graph - which the CBD approach does not take into account when constructing the description of the term.

But the fact that the relationship is represented only in this way - a link from concept to term, without an inverse term to concept link - is arguably slightly arbitrary.

An alternative approach, the "symmetric bounded description" seeks to address this sort of issue, by taking into account both "outbound" and "inbound". For the same three cases, it produces the following results:

Concept C2:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 skos:broader con:C2 .

con:C4 skos:narrower con:C2 .
Fig5

Term T2:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C2 skosxl:prefLabel term:T2 .
Fig6

Term T5:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/polthes/> .
@prefix term: <http://example.org/id/term/polthes/> .

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

con:C5 skosxl:altLabel term:T5 .
Fig7

For the concept case, the difference is relatively minor (for the skos:broader and skos:narrower relationships, the inverse triples are ow also included). But for the term cases, the relationship between concept and term is now included.

So (rather long-windedly, I fear!), I'm trying to illustrate that it's worth thinking a little bit about the content of descriptions and how they work as "stand-alone" documents (albeit linked to others). And for this dataset, I think there's an argument that generating "symmetric" descriptions that include inbound links as well as outbound ones probably results in more "useful information" for the consumer of the data.

(Again, I'm simpifying things slightly here to illustrate the point: I've omitted type information and the links to indicate concept scheme membership. Typically the descriptions might (depending on the algorithm) include labels for related resources mentioned, rather than just the URIs, and would include some metadata about the document - its publisher, last modification date, licence/rights information, a link to the dataset of which it is a member, and so on.)

Summary

What I've tried to do in this post is expand on some of the "linked data"-specific aspects of the project, and to examine some of the design choices to be made in applying some of those general rules to this particular case, shaped both by external factors (like the Cabinet Office guidelines on URIs) and by characteristics of the data itself (like the directionality of links made using SKOS-XL). In the next post, I'll move on, as promised, to the questions of how the data changes over time, and any implications of that.

About

Search

Loading
eFoundations is powered by TypePad