« Social media and the emerging technology hype curve | Main | The right time for outsourcing »

August 18, 2008

The importance of non-literals to linked data

I'm less involved in Dublin Core metadata discussions than I used to be but a brief exchange on one of the DC lists caught my eye and reminded me how confusing the concepts underpinning metadata on the Web can be.  The exchange started with an apparently simple question:

I'm currently involved in the selection of standard fields for a metadata project and we have some fields that we are calling Dublin Core fields (Subject and Relation fields), but we are including free text or uncontrolled terms. I notice that the DC Subject and Relation fields are "intended to be used with a non-literal value." I'm not sure what this means. Is there anyone that can explain in simple terms? I've looked at the DCMI Abstract Model and I'm still not sure what they mean by "non-literal" value. Also, can you say you are using Qualified Dublin Core for some fields and Simple Dublin Core for other fields in an  application profile?

To which the initial response included:

I suspect the members of the small coterie that could explain this are all on vacation at this time. I am not in that group, but I will attempt an explanation anyway ;-)

...

Leaving aside the DCAM (which is often puzzling), it seems to me that you need a way to indicate 1) whether or not the values in the subject field are controlled and 2) if they are controlled, what list they come from.

...

Well, OK... point taken!  Let me attempt a response... though I note that, in support of the comments above, it will probably be neither simple nor clear!

I'll start by quoting the DCMI Abstract Model:

The abstract model of the resources described by descriptions is as follows:

  • Each described resource is described using one or more property-value pairs.
  • Each property-value pair is made up of one property and one value.
  • Each value is a resource - the physical, digital or conceptual entity or literal that is associated with a property when a property-value pair is used to describe a resource. Therefore, each value is either a literal value or a non-literal value:
    • A literal value is a value which is a literal.
    • A non-literal value is a value which is a physical, digital or conceptual entity.
  • A literal is an entity which uses a Unicode string as a lexical form, together with an optional language tag or datatype, to denote a resource (i.e. "literal" as defined by RDF).

Resource Description Framework (RDF): Concepts and Abstract Syntax (the foundation stone of the Semantic Web), says this about literals:

Literals are used to identify values such as numbers and dates by means of a lexical representation. Anything represented by a literal could also be represented by a URI, but it is often more convenient or intuitive to use literals.

So, in Dublin Core metadata, each value is either "a literal" (a literal value) or "a physical, digital or conceptual entity" (a non-literal value) and the choice of which to use is largely one of convenience.

In the case of dc:subject, the value is the "topic" of the resource being described.  The topic might be a concept ("physics"), a place ("Bath, UK") or a person ("Albert Einstein") or something else.  While that topic could be treated as a literal value (and superficially at least, doing so may appear to be more convenient) good practice suggests that it is better if it is treated as a non-literal value, i.e. as a physical, digital or conceptual entity.  Why?  Because if the topic is treated as a non-literal value then it can be assigned a URI and can become the subject of other descriptions.  If the topic is treated as a literal value then it becomes a descriptive cul-de-sac - no further description of the topic is possible.

A literal may be the object of an RDF statement, but not the     subject or the predicate.

In short, by treating values as non-literal resources and assigning URIs to them we give ourselves (and others) the hooks on which to hang further descriptions.  This is a very fundamental part of the way in which the Semantic Web (and indeed the Web) works.

Unfortunately, in my experience at least, people find it difficult to grasp the importance of this point, particularly if they come to Dublin Core from the "old world" of library cataloguing, attribute/value pairs and text-string values.  For them, values have always been strings of text written onto cards, or the electronic equivalent of cards.  Doing things that way has always been good enough.  Why should things have to change?  Answer: the Web changes everything - even library cataloguing... eventually!

By way of an example...  let's consider the case of a book, Shakespeare: The World as a Stage by Bill Bryson which I happened to read on my summer holiday a couple of weeks ago.  The dc:subject of this book is William Shakespeare, a person.

Now, as I indicated above, we could treat this as a literal value, the string "William Shakespeare" for example (though a string taken from a well recognised name-authority file would be better).  But in doing so we don't provide any hooks that allow other people to say, "hang on, I know something useful about William Skakespeare and I'm going to provide a description of him so that it can be automatically integrated into your metadata if you want".  By treating William Shakespeare as a resource (as a non-literal value) and by assigning him a URI, we give people that hook.  We give them an unambiguous way of saying, "here is a description of the person that you are saying is the subject of that book by Bill Bryson".  Indeed, it allows us to go further...  it allows us to say things like, "that person who you say is the dc:subject of that book by Bill Bryson is also the dc:creator of these plays".  We can build a massive global graph of data about stuff, all linked together unambiguously thru their 'http' URIs in much the same way that Web pages are currently linked together with their 'http' URIs.  This is known as linked data.

Now, of course, that leaves us a with a very fundamental problem.  Who the hell is going to mint 'http' URIs for people like William Shakespeare, for concepts like physics and for places like Bath, UK?

This is not an easy question to answer and there are arguments that we should all just go ahead and start doing so, leaving it to someone else to say, "hang on, this is actually the same as that".  But I would argue that libraries and related organisations are well placed to mint persistent and reusable 'http' URIs for much of this stuff (and indeed some of them are now beginning to do so) provided that they drag themselves out of the old world of text strings on cards and into the new world of the Web and the URI.  Just look at the list of example data sets on the Wikipedia page for linked data - can you spot the missing contributing organisations? :-(

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8345203ba69e200e553ef15988833

Listed below are links to weblogs that reference The importance of non-literals to linked data:

Comments

Actually, there *is* one library working on exposing their entire authority DB as Linked Data.

See the presentation:
http://library.wur.nl/WebQuery/file/formulier/profielelaglt_i00117690_001.ppt

I believe there's an upcoming paper from them at DC2008 as well.

Yup, and the Library of Congress is experimenting as well.

http://lcsh.info

I'm going to be presenting about it at DC2008 as well. Thanks for writing this up Andy.

Good exposition. Towards an answer... I believe that more attention needs to be given to easy and standard ways of any minter of URIs to state the relationships between their URI and other ones. SKOS sort-of approaches the issue ("exactMatch", "broadMatch", "narrowMatch") but doesn't allow for a fundamental difference in kinds of similarity/identity that are available for different kinds of thing. Embodied concrete things/events allow true identity; concepts (IMO) do not. For works / expressions etc. we need to take FRBR into account.

The comments to this entry are closed.

About

Search

Loading
eFoundations is powered by TypePad