« Nonline | Main | Open content licences survey - update »

September 26, 2007

dctagging revisited

In response to the short presentation on encoding DC metadata as "structured tags" or "triple tags" that I gave to the meeting of the DCMI Social Tagging community at the DC-2007 conference, Ganesh Yanamandra from the National Library Board of Singapore posted a comment to my post of a few months ago on the topic, and I thought it was worth replying more fully.

Ganesh points out that at least some of the metadata that can be expressed using DCMI-owned terms is already captured "natively" in the various flavours of feed XML format, either using "core" components of the format (e.g. using the RSS 2.0 <title>, <description>, <author>, etc. elements or the Atom <atom:title>, <atom:summary>, etc. elements) or using the extension features available (there are RSS 1.0 "modules" for "Dublin Core" and "Qualified Dublin Core" (I'll leave to one side the debate about what constitutes "Qualified Dublin Core"....)) Yes, this is indeed true, and I didn't mean to suggest that the "structured tagging" approach should replace that. I should have been clearer about that. Where metadata providers have control over the structure, the markup, of their feed, I agree that the most effective way to expose metadata is to do so explicitly using the "built-in" capabilities of the feed format.

What I was really suggesting with dctagging was that where, for whatever reason, the metadata provider is not in a position to vary the markup of the feed, but rather is limited to varying only the tags associated with an item - e.g. they are a user of del.icio.us or some other service that supports tagging - , then using a "structured tags" approach offered a mechanism for "tunneling" (not sure that's really the right word) that metadata in the form of the tag content, rather than as markup. I did admit in Singapore this was a bit of a "hack"... ;-)

On the subject of an algorithm for "extracting"/"expanding" structured tags, I was hoping I might be able to do this using the CONSTRUCT feature of the SPARQL RDF query language, but I can't see a way of using SPARQL to generate literals in the output graph which are "substrings" of those in the input graph (e.g. from an input literal "dctagged dc:creator=Berners-LeeTim" I'd like to generate an output literal "Berners-LeeTim" ).

So I've fallen back on my old favourite multi-purpose screwdriver-come-hammer, XSLT, to process an  XML serialisation. expanddctag.xsl is a (pretty rough and ready) XSLT 1.0 transform which takes as input an RSS feed that has been translated into valid RSS 1.0 RDF/XML using Dave Beckett's Triplr service, and outputs an RDF/XML document which represents an extended RDF graph which includes the additional triples represented by the "structured tags".

So for example:

A few points:

  • The transform generates triples only for the prefixes "dc" and "dcterms" because that was the scope I had in mind for this convention! (You could extend the table near the start of the stylesheet so that it maps other prefixes too, though if you wanted a completely open-ended set of prefixes/"root URIs", then I'm inclined to say it would be better to shift to a convention which associates prefixes and "root URIs" in the data itself, e.g. in the way Flickr machine tags do)
  • No attempt is made to check the content of the tags and whether they map to the URIs of existing DCMI-owned properties: e.g. if a tag contains dc:audience=xyz, then a triple will be generated with the predicate http:/purl.org/dc/elements/1.1/audience
  • As I note in my presentation, the convention is limited to generating what the DCMI Abstract Model calls literal value surrogates. i.e. in RDF terms, literal objects are generated. No attempt is made to check that the predicates generated are intended by DCMI for use with literal objects (and indeed at the time of writing DCMI has not yet made such information available in its "term descriptions").

And I make no promises about the longer term persistence of that incognitum.net URI for the transform! It'll dereference to that XSLT doc for, well, for a while. But if you want to do anything with that transform then I suggest that you take a copy of it and give it another URI.

At the risk of embarking on something of a digression, playing around with this reminded me that I noticed some time ago that del.icio.us' use of the dc:creator element in its RSS 1.0 feeds seems slightly odd. My del.icio.us feed contains XML fragments like:

<item rdf:about="http://www.w3.org/Provider/Style/URI">

<title>Hypertext Style: Cool URIs don't change.</title>
<dc:subject>URI dc:creator=Berners-LeeTim dctagged identifiers web</dc:subject>
   <rdf:li resource="http://del.icio.us/tag/URI" />
   <rdf:li resource="http://del.icio.us/tag/identifiers" />
   <rdf:li resource="http://del.icio.us/tag/web" />
   <rdf:li resource="http://del.icio.us/tag/dctagged" />
   <rdf:li resource="http://del.icio.us/tag/dc:creator%3DBerners-LeeTim" />


This "says" that the RSS item denoted by the URI http://www.w3.org/Provider/Style/URI was created by "PeteJ". Let's put to one side for the purposes of this discussion whether the literal "PeteJ" is capable of creating anything ;-) Further, if Andy has bookmarked the same resource in his del.icio.us collection, then del.icio.us generates a triple "saying" that the "Cool URIs" document was created by "andypowell". And so on for anyone else who bookmarks that resource.

Of course I'm not a creator of that "Cool URIs" document. Yes, I created my feed, the collection of resources, and I created the bookmark, the description of the document, if you like, but I didn't create the bookmarked document itself

The contradiction becomes perhaps even more apparent when I apply the transform to "expand" the structured tag in my feed: the output from the transform now has two dc:creator elements, each representing a triple with the predicate http://purl.org/dc/elements/1.1/creator, one with the object "PeteJ" and one with the object "Berners-LeeTim".

It seems to me that in the way it generates its RSS feeds, del.icio.us conflates two resources that are quite distinct: the bookmark/description and the bookmarked document are two quite separate resources, created by two different people at different points in time, and if we need to "say" things about both of those resources then we need to refer to them using two distinct URIs.



TrackBack URL for this entry:

Listed below are links to weblogs that reference dctagging revisited:


How about adding a GRDDL link from the RSS document to your XSLT?

Thanks Pete for the wonderful explanation.

The comments to this entry are closed.



eFoundations is powered by TypePad