« Metadata guidelines for the UK RDTF - please comment | Main | SALDA »

February 11, 2011

Term-based thesauri and SKOS (Part 1)

I'm currently doing a piece of work on representing a thesaurus as linked data. I'm working on the basis that the output will make use of the SKOS model/RDF vocabulary. Some of the questions I'm pondering are probably SKOS Frequently Asked Questions, but I thought it was worth working through my proposed solution and some of the questions I'm pondering here, partly just to document my own thought processes and partly in the hope that SKOS implementers with more experience than me might provide some feedback or pointers.

SKOS adopts a "concept-based" approach (i.e. the primary focus is on the description of "concepts" and the relationships between them); the source thesaurus uses a "term-based" approach based on the ISO 2788 standard. I found the following sources provided helpful summaries of the differences between these two approaches:

Briefly (and simplifying slightly for the purposes of this discussion - I've omitted discussion of documentary attributes like "scope notes"), in the term-based model (and here I'm dealing only with the case of a simple monolingual thesaurus):

  • the only entities considered are "terms" (words or sequences of words)
  • terms can be semantically equivalent, each expressing a single concept, in which case they are distinguished as "preferred terms"/descriptors or "non-preferred terms"/non-descriptors, using USE (non-preferred to preferred) and UF (use for) (preferred to non-preferred) relations. Each non-preferred term is related to a single preferred term; a preferred term may have many non-preferred terms
  • a preferred term can be related by ("hierarchical") BT (broader term) and NT (narrower term) relations to indicate that it is more specific in meaning, or more general in meaning, than another term
  • a preferred term can also be related by an ("associative") RT (related term) relation to a second term, where there is some other relationship which may be useful for retrieval (e.g. overlapping meanings)

In the SKOS (concept-based) model:

  • the primary entities considered are "concepts", "units of thought"
  • a concept is associated with labels, of which at most one (in the monolingual case) is the preferred label, the others alternative labels or hidden labels
  • a concept can be related by "has broader", "has narrower" and "has related" relations to other concepts

The concept-based model, then, is explicit that the thesaurus "contains" two distinct types of thing: concepts and labels. In the base SKOS model, the labels are modelled as RDF literals, so the expression of relationships between labels is not supported. SKOS-XL provides an extension to SKOS which models labels as RDF resources in their own right, which can be identified, described and linked.

To represent a term-based thesaurus in SKOS, then, it is necessary to make a mapping from the term-based model with its term-term relationships to the concept-based model with its concept-label and concept-concept relationships. (An alternative approach would be to represent the thesaurus in RDF using a model/vocabulary which expresses directly the "native" "term-based" model of the source thesaurus. I'm working on the basis that using the SKOS/concept-based approach will facilitate interoperability with other thesauri/vocabularies published on the Web.

Core Mapping

The key principle underlying the mapping is the notion that, in the term-based approach, a preferred term and its multiple related non-preferred terms are acting as labels for a single concept.

Each set of terms related by USE/UF relations in the term-based approach, then, is mapped to a single SKOS concept, with the single "preferred term" becoming the "preferred label" for that concept and the (possibly multiple) "non-preferred terms" each becoming "alternative labels" for that same concept.

Consider the following example, where the term "Political violence" is a preferred term for "Civil violence" and "Violent unrest", and has a broader term "Violence" and narrower term "Terrorism":


Civil violence
USE Political violence

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrrorism

Terrorism
BT Political violence

Violence
NT Political violence

Violent protest
USE Political violence

(Leaving aside for a moment the question of how the URIs might be generated) an SKOS representation takes the form:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix con: <http://example.org/id/concept/> .

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skos:altLabel "Civil violence"@en;
       skos:altLabel "Violent protest"@en;
       skos:broader con:C4;
       skos:narrower con:C3 .

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skos:broader con:C2 .

con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skos:narrower con:C2 .

A graphical representation perhaps makes the links clearer (to keep things simple, I've omitted the arcs and nodes corresponding to the rdf:type triples here):

Fig1

One of the consequences of this approach is that some of the "direct" relationships between terms are reflected in SKOS as "indirect" relationships. In the term-based model, a non-preferred term is linked to a preferred term by a simple "USE" relationship. In the SKOS example, to find the "preferred term" for the term "Violent protest", one finds the node for the concept of to which it is linked via the skos:altLabel property, and then locates the literal which is linked to that concept via an skos:prefLabel property.

SKOS-XL and terms as Labels

As mentioned above, the SKOS specification defines an optional extension to SKOS which supports the representation of labels as resources in their own right, typically alongside the simple literal representation.

Using this extension, then the example above might be expressed as:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en.

con:C2 a skos:Concept;
       skos:prefLabel "Political violence"@en;
       skosxl:prefLabel term:T2;
       skos:altLabel "Civil violence"@en;
       skosxl:altLabel term:T1;
       skos:altLabel "Violent protest"@en;
       skosxl:altLabel term:T5;
       skos:broader con:C4;
       skos:narrower con:C3 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en.

con:C3 a skos:Concept;
       skos:prefLabel "Terrorism"@en;
       skosxl:prefLabel term:T3;
       skos:broader con:C2 .

term:T3 a skosxl:Label;
        skosxl:literalForm "Terrorism"@en.
       
con:C4 a skos:Concept;
       skos:prefLabel "Violence"@en;
       skosxl:prefLabel term:C4;
       skos:narrower con:C2 .

term:T4 a skosxl:Label;
        skosxl:literalForm "Violence"@en.

term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en.

And in graphical form (as above with the rdf:type triples omitted):

Fig2

That may be rather a lot to take in, so below I've shown a subgraph, including only the labels related to a single concept:

Fig3

Using the SKOS-XL extension in this way does introduce some additional complexity. On the other hand:

  1. it introduces a set of resources (of type skosxl:Label) that have a one-to-one correspondence to the terms in the source thesaurus, so perhaps makes the mapping between the two models more explicit and easier for a human reader to understand.
  2. it makes the labels themselves URI-identifiable resources which can be referenced in this data and in data created by others. So, it becomes possible to make assertions about relationships between the labels, or between labels and other things.

Coincidentally, as I was writing this post, I note that Bob duCharme has posted on the use of SKOS-XL for providing metadata "about" the individual labels, distinct from the concepts. So I might add a triple to indicate, say, the date on which a particular label was created. I don't think there is an immediate requirement for that in the case I'm working on, but there may be in the future.

Another possible use case is the ability for other parties to make links specifically to a label, rather than to the concept which it labels. A document could be "tagged", for example, by association with a particular label from a particular thesaurus, rather than just with the plain literal.

Relationships between Labels

The SKOS-XL vocabulary defines only a single generic relationship between labels (the skosxl:labelRelation property) with the expectation that implementers define subproperties to handle more specific relationship types.

The example given of such a relationship in the SKOS spec is that of a case where one label is an acronym for another label.

One thing I wondered about was introducing properties here to reflect the USE/UF relations in the term-based model e.g. in the above example:


@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#> .
@prefix con: <http://example.org/id/concept/> .
@prefix term: <http://example.org/id/term/> .
@prefix ex: <http://example.org/prop/> .

term:T1 a skosxl:Label;
        skosxl:literalForm "Civil violence"@en;
        ex:use term:T2 .

term:T2 a skosxl:Label;
        skosxl:literalForm "Political violence"@en;
        ex:useFor term:T1 ;
        ex:useFor term:T5 .
       
term:T5 a skosxl:Label;
        skosxl:literalForm "Violent protest"@en;
        ex:use term:T2 .

This wouldn't really add any new information; rather it just provides a "shortcut" for the information that is already present in the form of the "indirect" preferred label/alternative label relationships, as noted above.

I hesitate slightly about adding this data, partly on the grounds that it seems slightly redundant, but also on the grounds that this seems like a rather different "category" of relationship from the "is acronym for" relationship. This point is illustrated in figure 4 in the SWAD-E Review of Thesaurus Work document, which divides the concept-based model into three layers:

  • the (upper in the diagram) "conceptual" layer, with the broader/narrower/related relations between concepts
  • the (lower) "lexical" layer, with terms/labels and the relations between them
  • and a middle "indication"/"implication" layer for the preferred/non-preferred relations between concepts and terms/labels

In that diagram, the example lexical relationships exist independently of the preferred/non-preferred relationships e.g. "AIDS" is an abbreviation for "Acquired Immunodeficiency Syndrome", regardless of which is considered the preferred term in the thesaurus. With the use/useFor relationships here, this would not be the case; they would indeed vary with the preferred/non-preferred relationships.

So, having thought initially it might be useful to include these relationships, I'm becoming more rather less sure.

And without them, I'm not sure whether, for this case, the use of the SKOS-XL would be justified - though it may well be, for the other purposes I mentioned above. So, any thoughts on practice in this area would be welcome.

URI Patterns

Following the linked data principles, I want to assign http URIs for all the resources, so that descriptions of those resources can be provided using HTTP. If we go with the SKOS-XL approach, that means assigning http URIs for both concepts and labels.

The input data includes, for each term within the thesaurus, an identifier code, unique within the thesaurus, and which remains stable over time, i.e. as changes are made to the thesaurus a single term remains associated with the same code. (I'll explore the question of how the thesaurus changes over time in a follow-up post.) So in fact the example above looks something like:


Civil violence
USE Political violence
TNR 1

Political violence
UF Civil violence
UF Violent protest
BT Violence
NT Terrrorism
TNR 2

Terrorism
BT Political violence
TNR 3

Violence
NT Political violence
TNR 4

Violent protest
USE Political violence
TNR 5

The proposal is to use that identifier as the basis of URIs, for both labels/terms and concepts:

  • Term URI Pattern: http://example.org/id/term/T{termid}
  • Concept URI Pattern: http://example.org/id/concept/C{termid}

The generation of label/term URIs, then, is straightforward, as each term (with identifier code) in the input data maps to a single label/term in the SKOS RDF data. The concept case is a little more complicated. As I noted above, the mapping between the term-based model and the concept-based model means that there is a many-to-one relationship between terms and concepts: several terms are related to a single concept, one as preferred label, others as alternative labels. In my first cut at this at least (more on this in a follow-up post), I've generated the URI of the concept based on the identifier code associated with the preferred/descriptor term.

So in my example above, the three terms "Civil violence, "Political violence", and "Violent protest" are mapped to labels for a single concept, and the URI of that concept is constructed from the identifier code of the preferred/descriptor term ("Political violence").

Summary

I've tried to cover the basic approach I'm taking to generating the SKOS/SKOS-XL RDF data from the term-based thesaurus input. In this post, I've focused only on the thesaurus as a static entity. In a follow-up post, I'll look in some detail at the changes which can occur over time, and examine any implications for the choices made here, particularly for our choices of URI patterns.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8345203ba69e2014e5f27082f970c

Listed below are links to weblogs that reference Term-based thesauri and SKOS (Part 1):

» Term-based thesauri and SKOS (Part 2): Linked Data from eFoundations
In my previous post on this topic I outlined how I was approaching making a thesaurus available using the SKOS and SKOS-XL RDF vocabularies. In that post I focused on: how the thesaurus content is modelled using a "concept-based" approach... [Read More]

» Term-based thesauri and SKOS (Part 3): Change over time (i) from eFoundations
This is the third in a series of posts (previously: part 1, part 2) on making a thesaurus available as linked data using the SKOS and SKOS-XL RDF vocabularies. In this post, I'll examine some of the ways the thesaurus... [Read More]

» Term-based thesauri and SKOS (Part 4): Change over time (ii) from eFoundations
This is the fourth in a series of posts (previously: part 1, part 2, part 3) on making a thesaurus available as linked data using the SKOS and SKOS-XL RDF vocabularies. In the previous post, I examined some of the... [Read More]

Comments

The comments to this entry are closed.

About

Search

Loading
eFoundations is powered by TypePad