« January 2010 | Main | March 2010 »

February 26, 2010

The 2nd Linked Data London Meetup & trying to bridge a gap

On Wednesday I attended the 2nd London Linked Data Meetup, organised by Georgi Kobilarov and Silver Oliver and co-located with the JISC Dev8D 2010 event at ULU.

The morning session featured a series of presentations:

  • Tom Heath from Talis started the day with Linked Data: Implications and Applications. Tom introduced the RDF model, and planted the idea that the traditional "document" metaphor (and associated notions like the "desktop" and the "folder") were inappropriate and unnecessarily limiting in the context of Linked Data's Web of Things. Tom really just scratched the surface of this topic, I think, with a few examples of the sort of operations we might want to perform, and there was probably scope for a whole day of exploring it.
  • Tom Scott from the BBC on the Wildlife Finder, the ontology beind it, and some of the issues around generating and exposing the data. I had heard Tom speak before, about the BBC Programmes and Music sites, and again this time I found myself admiring the way he covered potentially quite complex issues very clearly and concisely. The BBC examples provide great illustrations of how linked data is not (or at least should not be) something "apart from" a "Web site", but rather is an integral part of it: they are realisations of the "your Web site is your API" maxim. The BBC's use of Wikipedia as a data source also led into some interesting discussion of trust and provenance, and dealing with the risk of, say, an editor of a Wikipedia page creating malicious content which was then surfaced on the BBC page. At the time of the presentation, the wildlife data was still delivered only in HTML, but Tom announced yesterday that the RDF data was now being exposed, in a similar style to that of the Programmes and Music sites.
  • John Sheridan and Jeni Tennison described their work on initiatives to open up UK government data. This was really something of a whirlwind (or maybe given the presenters' choice of Wild West metaphors, that should be a "twister") tour through a rapidly evolving landscape of current work, but I was impressed by the way they emphasised the practical and pragmatic nature of their approaches, from guidance on URI design through work on provenance, to the current work on a "Linked Data API" (on which more below)
  • Lin Clark of DERI gave a quick summary of support for RDFa in Drupal 7. It was necessarily a very rapid overview, but it was enough to make me make a mental note to try to find the time to explore Drupal 7 in more detail.
  • Georgi Kobilarov and Silver Oliver presented Uberblic, which provides a single integrated point of access to a set of data sources. One of the very cool features of Uberblic is that updates to the sources (e.g. a Wikipedia edit) are reflected in the aggregator in real time.

The morning closed with a panel session chaired by Paul Miller, involving Jeni Tennison, Tom Scott, Ian Davis (Talis) and Timo Hannay (Nature Publishing) which picked up many of the threads from the preceding sessions. My notes (and memories!) from this session seem a bit thin (in my defence, it was just before lunch and we'd covered a lot of ground...), but I do recall discussion of the trade-offs between URI readability and opacity, and the impact on managing persistence, which I think spilled out into quite a lot of discussion on Twitter. IIRC, this session also produced my favourite quote of the day, from Tom Scott, which was something along the lines of, "The idea that doing linked data is really hard is a myth".

Perhaps the most interesting (and timely/topical) session of the day was the workshop at the end of the afternoon by Jeni Tennison, Leigh Dodds and Dave Reynolds, in which they introduced a proposal for what they call a "Linked Data API".

This defines a configurable "middleware" layer that sits in front of a SPARQL endpoint to support the provision of RESTful access to the data, including not only the provision of descriptions of individual identified resources, but also selection and filtering based on simple URI patterns rather than on SPARQL, and the delivery of multiple output formats (including a serialisation of RDF in JSON - and the ability to generate HTML or XHTML). (It only addresses read access, not updates.)

This initiative emerged at least in part out of responses to the data.gov.uk work, and comments on the UK Government Data Developers Google Group and elsewhere by developers unfamiliar with RDF and related technologies. It seeks to try to address the problem that the provision of queries only through SPARQL requires the developers of applications to engage directly with the SPARQL query language, the RDF model and the possibly unfamiliar formats provided by SPARQL. At the same time, this approach also seeks to retain the "essence" of the RDF model in the data - and to provide clients with access to the underlying queries if required: it complements the SPARQL approach, rather than replaces it.

The configurability offers a considerable degree of flexibility in the interface that can be provided - without the requirement to create new application code. Leigh made the important point that the API layer might be provided by the publisher of the SPARQL endpoint, or it might be provided by a third party, acting as an intermediary/proxy to a remote SPARQL endpoint.

IIRC, mentions were made of work in progress on implementations in Ruby, PHP and Java(?).

As a non-developer myself, I hope I haven't misrepresented any of the technical details in my attempt to summarise this. There was a lot of interest in this session at the meeting, and it seems to me this is potentially an important contribution to bridging the gap between the world of Linked Data and SPARQL on the one hand and Web developers on the other, both in terms of lowering immediate barriers to access and in terms of introducing SPARQL more gradually. There is now a Google Group for discussion of the API.

All in all it was an exciting if somewhat exhausting day. The sessions I attended were all pretty much full to capacity and generated a lot of discussion, and it generally felt like there is a great deal of excitement and optimism about what is becoming possible. The tools and infrastructure around linked data are still evolving, certainly, but I was particularly struck - through initiatives like the API project above - by the sense of willingness to respond to comments and criticisms and to try to "build bridges", and to do so in very real, practical ways.

February 19, 2010

In the clouds

So, the Repositories and the Cloud meeting, jointly organised by ourselves and the JISC, takes place on Tuesday next week and I promised to write up my thoughts in advance.  Trouble is... I'm not sure I actually have any thoughts :-(

Let's start from the very beginning (it's a very good place to start)...

The general theory behind cloud solutions - in this case we are talking primarily about cloud storage solutions but I guess this applies more generally - is that you outsource parts of your business to someone else because:

  • they can do it better than you can,
  • they can do it more cheaply than you can,
  • they can do it in a more environmentally-friendly way than you can, or
  • you simply no longer wish to do it yourself for other reasons.

Seems simple enough and I guess that all of these apply to the issues at hand for the meeting next week, i.e. what use is there for utility cloud storage solutions for the data currently sitting in institutional repositories (and physically stored on disks inside the walls of the institution concerned).

Against that, there are a set of arguments or issues that mitigate against a cloud solution, such as:

  • security
  • data protection
  • sustainability
  • resilience
  • privacy
  • loss of local technical knowledge
  • ...

...you know the arguments.  Ultimately institutions are going to end up asking themselves questions like, "how important is this data to us?", "are we willing to hand it over to one or more cloud providers for long term storage?", "can we afford to continue to store this stuff for ourselves?", "what is our exit strategy in the future?", and so on.

Wrapped up in this will be issues about the specialness of the kind of stuff one typically finds in institutional repositories - either because of volume of data (large research data-sets for example), or because stuff is seen as being especially important for various reasons (it's part of the scholarly record for example).

None of which is particularly helpful in terms of where the meeting will take us!  I certainly don't expect any actual answers to come out of it, but I am expecting a good set of discussions both about current capabilities (what the current tools are capable of), policy issues, and about where we are likely to go in the future.

One of the significant benefits the current interest in cloud solutions brings is the abstraction of the storage layer from the repository services.  Even if I never actually make use of Amazon S3, I might still get significant benefit from the cloud storage mindset because my internal repository 'storage' layer is separated from the rest of the software.  That means that I can do things like sharing data across multiple internal stores, sharing data across multiple external stores, or some combination of both, much more easily.  It also potentially opens up the market to competing products.

So, I think this space has wider implications than a simple, "should I use cloud storage?" approach might imply.

From an Eduserv point of view, both as a provider of not-for-profit services to the public, health and education sectors and as an organisation with a brand spanking new data centre I don't think there's any secret in the fact that we want to understand whether there is anything useful we can bring to this space - as a provider of cloud storage solutions that are significantly closer to the community than the utility providers are for example.  That's not to say that we have such an offer currently - but it is the kind of thing we are interested in thinking about.

I don't particularly buy into the view that the cloud is nothing new.  Amazon S3 and its ilk didn't exist 10 years ago and there's a reason for that.  As markets and technology have matured new things have become possible.  But that, on its own, isn't a reason to play in the cloud space. So, I suppose that the real question for the meeting next week is, "when, if ever, is the right time to move to cloud storage solutions for repository content... and why?" - both from a practical and a policy viewpoint.

I don't know the answers to those questions but I'm looking forward to finding out more about it next week.

February 15, 2010

VCard in RDF

Via a post to the W3C Semantic Web Interest Group mailing list from Renato Iannella, I noticed that the W3C has published an updated specification for expressing vCard data in RDF, Representing vCard Objects in RDF.

A comment from the W3C Team provides some of the background to the document, and explains that it represents a consolidation of an earlier (2001) formulation of vCard in RDF by Renato published by the W3C and a subsequent ontology created by Norman Walsh, Brian Suda and Harry Halpin, the latter also used, at least in part, by Yahoo SearchMonkey (see Local, Event, Product, Discussion).

The new document also provides an answer to a question which I've been unsure about for years: whether the class of vCards was disjoint with the class of "agents" (e.g. persons or organisations). Or, to put it another way, I wasn't sure whether the properties of a vCard could be used to describe persons or organisations, e.g. "this document was created by the agent with the vCard name 'John Smith'":

@prefix dc: <http://purl.org/dc/terms/> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
 <> dc:creator
  [
    v:fn "John Smith"
  ] .

(This, I think, is the pattern recommended by Yahoo SearchMonkey: see, for example, the extended "product" example where the manufacturer of a product is an instance of both the v:VCard class and the commerce:Business class.)

The alternative would be that those properties had to be used to describe a vCard-as-"document" (or as something other than agent, at least), which was in turn related to a person or organisation, e.g. "this document has created by the agent who "has a vCard" with the vCard name 'John Smith'" (I invented an example "hasVCard" property here):

@prefix dc: <http://purl.org/dc/terms/> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix ex: <http://example.org/terms/> .
 <> dc:creator
  [
    ex:hasVCard
     [
       v:fn "John Smith"
     ]
 ] .

To keep the examples brief I used blank nodes in the above examples, but URIs might have been used to refer to the vCard and agent resources too.

In its description of the VCard class, the new document says: Resources that are vCards and the URIs that denote these vCards can also be the same URIs that denote people/orgs. That phrasing seems a bit awkward, but the intent seems pretty clear: the classes are not disjoint, a single resource can be both a vCard and a person, and the properties of a vCard can be applied to a person. So I can use the pattern in my first example above without creating any sort of contradiction, and the second pattern is also permitted.

One consequence of this is that consumers of data need to allow for both patterns - in the general case, at least, though it may be that they have additional information that within a particular dataset only a single pattern is in use.

In the example above, I used an example.org URI for the property that relates an agent to a vCard. The discussion on the list highlights that there are a couple of contenders for properties to meet this requirement: (ov:businessCard from Michael Hausenblas and hcterms:hasCard from Toby Inkster. A proposal to add such a property to the FOAF vocabulary has been made on the FOAF community wiki.

February 11, 2010

Google and usability

I made a somewhat negative (and admittedly completely OTT) comment about Google and usability on Twitter earlier on today, stemming in part from the realisation that I find it increasingly hard to get excited by, or even remotely interested in, Google's latest shiny new offering, whether it's Wave or Buzz or whatever comes next:

i suspect it is because with everything google does other than search, their usability sucks big time

I got some positive and negative responses to my statement but the conversation got sidetracked by the issue of Google's handling of multiple accounts which, while part of the problem, isn't really what I was getting at.  That, coupled with the fact that Twitter isn't exactly the best place to have a discussion, meant I didn't really get my point across.

(As an aside, the majority of my usage of Google uses my Google Apps account to access my personal email.  Unfortunately, Google Apps accounts tend not to get access to newer features like Wave and Buzz, at least not when they first come out, which pretty much forces people to maintain at least one other Google account if they want to experiment with those things.  I find that frustrating and not a little annoying - something I've labelled "Google identity disorder" - the state of having multiple google accounts and never being logged into the right one at the right time" - and it is undoubtedly part of the reason that I find it hard to get excited by the newer stuff.)

Like pretty much everyone, I use Google search every single day, probably every single hour of my waking life. Google search is the benchmark of functionality and usability in Internet search - it's what every other search engine compares themselves to and it has been pretty much since it was first released. That's a pretty amazing track record!

What's been the basis of that track record?  It is simplicity, at least as far as the user is concerned, that has kept it in pole position.  Google search does one thing, really, really well.

But it is a track record that I don't think Google have come anywhere close to with their other offerings, with the possible exception of Google Maps, at least of the things I've tried.  Gmail is pretty good I guess, but even after using it continuously for several years I still find things that catch me out (perhaps that's just me?) and I'm still not totally convinced that it falls into the class of being a really good email client for Joe Normal - my mum or dad for example.  (Hmmm... actually my mum and dad are definitely not Joe Normal, but you get the idea!)

I find Google Docs relatively uncompelling and Google Wave just strikes me as a noisy, cluttered mess, with no thought having been given to usability at all.

All in all, it seems to me that Google need to re-learn how to keep things simple, not just in terms of the user experience but also in terms of the proposition being put on the table.

That was all I meant.

Addendum 1 (there may be others): This should never have been a blog post.  It should never even have been a tweet!  People will inevitably keep saying things like, "yes, but Google X is pretty good".  And I'll have to say, "yes, you're right".  I just end up looking like an idiot I guess :-).  At the end of the day... Google have done some good stuff and they've done some bad stuff.  Big deal - hardly the stuff of an useful blog post.  Apologies for my wayward stream of consciousness!  I do think there is something wrapped up in my discomfort that I'd like to return to later but I'll either do that as a second addendum here later today, or as new blog post in due course.

Addendum 2: On balance, I don't think this blog post warrants a second addendum :-)

Repositories and the Cloud - tell us your views

It's now a little over a week to go until the Repositories and the Cloud event (jointly organised by Eduserv and the JISC) takes place in London.  The event is sold out (sorry to those of you that haven't got a place) and we have a full morning of presentations from DuraSpace, Microsoft and EPrints and an afternoon of practical experience (Terry Harmer of the Belfast eScience Centre) and parallel discussion groups looking at both policy and technical issues.

To those of you that are coming, please remember that the afternoon sessions are for discussion.  We want you to get involved, to share your thoughts and to challenge the views of other people at the event (in the nicest way possible of course).  We'd love to know what you think about repositories and the cloud (note that, by that phrase, I mean the use of utility cloud providers as back-end storage for repository-like services).  Please share your thoughts below, or blog them using the event tag - 'repcloud' - or just bring them with you on the day!

I will share my thoughts separately here next week but let me be honest... I don't actually know what I think about the relationship between repositories and the cloud.  I'm coming to the meeting with an open mind.  As a community, we now have some experience of the policy and technical issues in our use of the cloud for things like undergraduate email but I expect the policy issues and technical requirements around repositories to be significantly different.  On that basis, I am really looking forward to the meeting.

The chairs of the two afternoon sessions, Paul Miller (paul.miller (at) cloudofdata.com) who is leading the policy session and Brad McLean (bmclean (at) fedora-commons.org) who is leading the technical session, would also like to hear your views on what you hope their sessions will cover.  If you have ideas please get in touch, either thru the comments form below, via Twitter (using the '#repcloud' hashtag) or by emailing them directly.

Thanks.

February 09, 2010

Virtual World Watch survey call for information

John Kirriemuir has issued a request for updated information for his his eighth Virtual World Watch "snapshot" survey of the use of virtual worlds in UK Higher and Further Education.

Previous survey reports can be found on the VWW site.

For further information about the sort of information John is after, see his post. He would like responses by the end of February 2010.

Our period of funding for this work is approaching its end, so this will be the last survey funded under the Eduserv Research Programme. John is planning to continue some Virtual World Watch activity, at least through 2010, as he indicates in this presentation which he gave to the recent "Where next for Virtual Worlds?" (wn4vw) meeting in London:

The slides from the other presentations from the wn4vw meeting (including a video of the opening presentation by Ralph Schroeder) are also available here, and you can find an archive of tagged Twitter posts from the day here.

I enjoyed the meeting (even if I'm not sure we really arrived at many concrete answers to the question of "where next?"), but it also felt quite sad. It marked the end of the projects Eduserv funded in 2007 on the use of virtual worlds in education. That grants call was the first one I was involved with after joining Eduserv in 2006, and although it was an area that was completely new to me, the response we got, both in terms of the number of proposals and their quality, seemed very exciting. And I still look back on the 2007 Symposium as one of the most successful (if rather nerve-wracking at the time!) events I've been involved in. As things worked out, I wasn't able to follow the progress of the projects as closely as I'd have liked, but the recent meeting reminded me again of the strong sense of community that seems to have built up amongst researchers, learning technologists and educators working in this area, which seems to have outlived particular projects and programmes. Of course we only funded a handful of projects, and other funding agencies helped develop that community too (I'm thinking particularly of JISC with its Open Habitat project, and the EU MUVEnation project), but it's something I'm pleased we were able to contribute to in a small way.

February 03, 2010

More famous than Simon Cowell

I wrote a blog post on my other, Blipfoto, blog this morning, More famous than Simon Cowell, looking at some of the issues around persistent identifiers from the perspective of a non-technical audience. (You'll have to read the post to understand the title).

I used the identifier painted on the side of a railway bridge just outside Bath as my starting point.

It's certainly not an earth-shattering post, but it was quite interesting (for me) to approach things from a slightly different perspective:

What makes the bridge identifier persistent? It's essentially a social construct. It's not a technical thing (primarily). It's not the paint the number is written in, or the bricks of the bridge itself, or the computer system at head office that maps the number to a map reference. These things help... but it's mainly people that make it persistent.

I wrote the piece because the JISC have organised a meeting, taking place in London today, to consider their future requirements around persistent identifiers. For various reasons I was not able to attend - a situation that I'm pretty ambivalent about to be honest - I've sat thru a lot of identifier meetings in the past :-).

Regular readers will know that I've blown hot and cold (well, mainly cold!) about the DOI - an identifier that I'm sure will feature heavily in today's meeting. Just to be clear... I am not against what the DOI is trying to achieve, nor am I in any way negative about the kinds of services, particularly CrossRef, that have been able to grow up around it. Indeed, while I was at UKOLN I committed us to joining CrossRef and thus assigning DOIs to all UKOLN publications. (I have no idea if they are still members).  I also recognise that the DOI is not going to go away any time soon.

I am very critical of some of the technical decisions that the DOI people have made - particularly their decision to encourage multiple ways of encoding the DOI as a URI and the fact that the primary way (the 'doi' URI scheme) did not use an 'http' URI. Whilst persistence is largely a social issue rather than a technological one, I do think that badly used technology can get in the way of both persistence and utility. I also firmly believe in the statement that I have made several times previously... that "the only good long term identifier is a good short term identifier".  The DOI, in both its 'doi' URI and plain-old string of characters forms, is not a good short term identifier.

My advice to the JISC? Start from the principles of Linked Data, which very clearly state that 'http' URIs should be used. Doing so sidesteps many of the cyclic discussions that otherwise occur around the benefits of URNs and other URI schemes and allows people to focus on the question of, "how do we make http URIs work as well and as persistently as possible?" rather than always starting from, "http URIs are broken, what should we build instead?".

February 02, 2010

Data.gov.uk, Creative Commons and the public domain

In a blog post at Creative Commons, UK moves towards opening government data, Jane Park notes that the UK Government have taken a significant step towards the use of Creative Commons licences by making the terms and conditions for the data.gov.uk website compatible with CC-BY 3.0:

In a step towards openness, the UK has opened up its data to be interoperable with the Attribution Only license (CC BY). The National Archives, a department responsible for “setting standards and supporting innovation in information and records management across the UK,” has realigned the terms and conditions of data.gov.uk to accommodate this shift. Data.gov.uk is “an online point of access for government-held non-personal data.” All content on the site is now available for reuse under CC BY. This step expresses the UK’s commitment to opening its data, as they work towards a Creative Commons model that is more open than their former Click-Use Licenses.

This feels like a very significant move - and one that I hadn't fully appreciated in the recent buzz around data.gov.uk.

Jane Park ends her piece by suggesting that "the UK as well as other governments move in the future towards even fuller openness and the preferred standard for open data via CC Zero". Indeed, I'm left wondering about the current move towards CC-BY in relation to the work undertaken a while back by Talis to develop the Open Data Commons Public Domain Dedication and Licence.

As Ian Davis of Talis says, Linked Data and the Public Domain:

In general factual data does not convey any copyrights, but it may be subject to other rights such as trade mark or, in many jurisdictions, database right. Because factual data is not usually subject to copyright, the standard Creative Commons licenses are not applicable: you can’t grant the exclusive right to copy the facts if that right isn’t yours to give. It also means you cannot add conditions such as share-alike.

He suggests instead that waivers (of which CC Zero and the Public Domain Dedication and License (PDDL) are examples) are a better approach:

Waivers, on the other hand, are a voluntary relinquishment of a right. If you waive your exclusive copyright over a work then you are explictly allowing other people to copy it and you will have no claim over their use of it in that way. It gives users of your work huge freedom and confidence that they will not be persued for license fees in the future.

Ian Davis' post gives detailed technical information about how such waivers can be used.

Second Life, scalability and data centres

Interesting article about the scalability issues around Second Life, What Second Life can teach your datacenter about scaling Web apps. (Note: this is not about the 3-D virtual world aspects of Second Life but about how the infrastructure to support it is delivered.)

Plenty of pixels have been spilled on the subject of where you should be headed: to single out one resource at random, Microsoft presented a good paper ("On Designing and Deploying Internet-Scale Services" [PDF]) with no less than 71 distinct recommendations. Most of them are good ("Use production data to find problems"); few are cheap ("Document all conceivable component failure modes and combinations thereof"). Some of the paper's key overarching principles: make sure all your code assumes that any component can be in any failure state at any time, version all interfaces such that they can safely communicate with newer and older modules, practice a high degree of automated fault recovery, auto-provision all resources. This is wonderful advice for very large projects, but herein lies a trap for smaller ones: the belief that you can "do it right the first time." (Or, in the young-but-growing scenario, "do it right the second time.") This unlikely to be true in the real world, so successful scaling depends on adapting your technology as the system grows.

February 01, 2010

HTML5, document metadata and Dublin Core

I recently received a query about the encoding of Dublin Core metadata in HTML5, the revision of the HTML language being developed jointly by the W3C HTML Working Group and the Web Hypertext Application Technology Working Group (WHATWG). It has also been the topic of some recent discussion on the dc-general mailing list. While I've been aware of some of the discussions around metadata features in HTML5, until now I haven't looked in much detail at the current drafts.

There are various versions of the specification(s), all drafts under development and still changing (at times, quite quickly):

  • The WHATWG has a working draft titled HTML5 (including next generation additions still in development). This document is constantly changing; the content at the time I'm writing is dated 30 January 2010, but will no doubt have changed by the time you read this. Of this spec, the WHATWG says: This draft is a superset of the HTML5 work that is being published at the W3C: everything that is in HTML5 is also in the WHATWG HTML spec. Some new experimental features are being added to the WHATWG HTML draft, to continue developing extensions to the language while the W3C work through their milestones for HTML5. In other words, the WHATWG HTML specification is the next generation of the language, while HTML5 is a more limited subset with a narrower scope.
  • The W3C has a "latest public version" of HTML 5: A vocabulary and associated APIs for HTML and XHTML currently the version dated 25 August 2009. (The content of that "date-stamped" version should continue to be available.)
  • The W3C always has a "latest editor's draft" of that document, which at the time of writing is dated 30 January 2010, but also continues to change at frequent intervals. Note that, compared to the previous "latest public version", this draft incorporates some element of restructuring of the content, with some content separated out into "modules".

I can't emphasise too strongly that HTML5 is still a draft and liable to change; as the spec itself says in no uncertain terms: Implementors should be aware that this specification is not stable. Implementors who are not taking part in the discussions are likely to find the specification changing out from under them in incompatible ways..

For the purposes of this discussion I've looked primarily at the third document above, the W3C latest editor's draft. This post is really an attempt to raise some initial questions (and probably to expose my own confusion) rather than to provide any definitive answers. It is based on my (incomplete and very probably imperfect) reading of the drafts as they stand at this point in time - and it represents a personal view only, not a DCMI view.

1. Dublin Core metadata in HTML4 and XHTML

(This section covers DCMI's current recommendations for embedding metadata in X/HTML, so feel free to skip it if you are already familiar with this.)

To date, DCMI's specifications for embedding metadata in X/HTML documents have concerned themselves with representing metadata "about" the document as a whole, "document metadata", if you like. And in HTML4/XHTML, the principal source of document metadata is the head element (HTML4, 7.4). Within the head element:

  • the meta element (HTML4, 7.4.4.2) provides for the representation of "property name" (the value of the @name attribute)/"property value" (the value of the @content attribute) pairs which apply to the document. It also offers the ability to supplement the value with the name of a "scheme" (the value of the @scheme attribute) which is used "to interpret the property's value".
  • the link element (HTML4, 12.3) provides a means of representing a relationship between the document and another resource. It also - in attributes like @hreflang, @title, - suppports the provision of some metadata "about" that second resource.

(I should note here that the above text uses the terminology of the HTML4 specification, not of RDF or the DCMI Abstract Model (DCAM).)

The current DCMI recommendation for embedding document metadata in X/HTML, Expressing Dublin Core metadata using HTML/XHTML meta and link elements - which from here on I'll just refer to as "DC-HTML". Although the current recommendation is dated 2008, that version is only a minor "modernisation" of conventions that DCMI has recommended since the late 1990s. The specification describes a convention for representing what the DCAM calls a description (of the document) - a set of RDF triples - using the HTML meta and link elements and their attributes (and conversely, for interpreting a sequence of HTML meta and link elements as a set of RDF triples/DCAM description set). Contrary to some misconceptions, the convention is not limited to the use of DCMI-owned "terms"; indeed it does not assume the use of any DCMI-owned terms at all.

Consider the example of the following two RDF triples:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<> dc:modified "2007-07-22"^^xsd:date ;
   ex:commentsOn <http://example.org/docs/123> .

Aside: from the perspective of the DCMI Abstract Model, these would be equivalent to the following description set, expresssed using the DC-Text syntax, but for the rest of this post, to keep things simple, I'll just refer to the RDF triples.

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:modified )
      LiteralValueString ( "2007-07-22"
        SyntaxEncodingSchemeURI ( xsd:date )
      )
    )
    Statement (
      PropertyURI ( ex:commentsOn )
      ValueURI ( <http://example.org/docs/123> )
      )
    )
  )
)

Following the conventions of DC-HTML, those triples are represented in XHTML as:

Example 1: DC-HTML profile in XHTML

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>Document 001</title>
    <link rel="schema.DC"
          href="http://purl.org/dc/terms/" />
    <link rel="schema.EX"
          href="http://example.org/terms/" />
    <link rel="schema.XSD"
          href="http://www.w3.org/2001/XMLSchema#" />
    <meta name="DC.modified"
          scheme="XSD.date" content="2007-07-22" />
    <link rel="EX.commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

A few points to highlight:

  • The example is provided in XHTML but essentially the same syntax would be used in HTML4.
  • The triple with literal object is represented using a meta element.
  • The triple with the URI as object is represented using a link element
  • The predicate (property URI) may be any URI; the DC-HTML convention is not limited to DCMI-owned URIs, i.e. DC-HTML seeks to support the sort of URI-based vocabulary extensibility provided by RDF. There is no "registry" of a bounded set of terms to be used in metadata represented using DC-HTML; or, rather, "the Web is the registry". All an implementer requires to introduce a new property is the authority to assign a URI in some URI space they own (or in which they have been delegated rights to assign URIs).
  • A convention for representing property URIs and datatype URIs as "prefixed names" is used, and in this example three other link elements (with @rel="schema.{prefix})" are introduced to act as "namespace declarations" to support the convention. When a document using DC-HTML is processed, no RDF triples are generated for those link elements (Aside: I have occasionally wondered whether this is abusing the rel attribute, which is intended to capture the type of relationship between the document and the target resource, i.e. it is using a mechanism which does carry semantics for an essentially syntactic end (the abbreviation of URIs). But I'll suspend those misgivings for now...).
  • The prefixes used in these "prefixed names" are arbitrary, and DC-HTML does not specify the use/interpretation of a fixed set of @name or @rel attribute values. In the example above, I chose to associate the "DC" prefix with the "namespace URI" http://purl.org/dc/terms/, though "traditionally" it has been more commonly associated with the "namespace URI" http://purl.org/dc/elements/1.1/. Another document creator might associate the same prefix with a quite different URI again.
  • The DC-HTML profile generates triples only for those meta and link elements where the values of the @name and @rel attributes contain a prefixed name with a prefix for which there is a corresponding "namespace declaration".
  • The datatype of the typed literal is represented by the value of the meta/@scheme attribute.
  • There is no support for RDF blank nodes.

For the purposes of this discussion, perhaps the main point to make is that this use/interpretation of meta and link elements is specific to DC-HTML, not a general interpretation defined by the HTML4 specification. The mapping of prefixed names to URIs using link[@rel="schema.ABC"] "namespace declarations" is a DCMI convention, not part of X/HTML. And this is made possible through the use of a feature of HTML4 and XHTML called a "meta data profile": the document creator signals (by providing a specific URI as value of the head/@profile attribute) that they are applying the DC-HTML conventions and the presence of that attribute value licences a consumer to apply that interpretation of the data in a document. And, further, under that profile, as I noted for the example of the "DC" prefix, there is no single "meaning" assigned to meta/@name or link/@rel values.

In the XHTML case, the profile-dependent interpretation is made accessible in machine-processable form through the use of GRDDL, more specifically of a GRDDL profile transformation. i.e. a GRDDL processor uses the profile URI to access an XHTML "profile document" which provides a pointer to an XSLT transform which, when applied to an XHTML document using the DC-HTML profile, generates an RDF/XML document representing the appropriate RDF triples.

It may also be worth noting at this point that the profile attribute actually supports not just a single URI as value but a space-separated list of URIs i.e. within a single document, multiple profiles may be "applicable". And, potentially, those multiple profiles may specify different interpretations of a single @name or @rel value. I think the intent is that in that case all the interpretations should be applied - and in the case that multiple GRDDL profile transformations are provided, the output should be the result of merging the RDF graphs output from each individual transformation.

Now then, having laboured the point about the importance of the concept of the profile, I strongly suspect - though I don't have concrete evidence to support my suspicion - that it is not widely used by applications that provide and consume data using the other conventions described in the DC-HTML document.

It is certainly easy to find many providers of document metadata in X/HTML that follow some of the syntactic conventions of DC-HTML but do not include the @profile attribute. This is (currently, at least) the case even for many documents on DCMI's own site. And I suspect only a (small?) minority of applications consuming/processing DC metadata embedded in X/HTML documents do so by applying the DC-HTML GRDDL profile transform in this way. I suspect the majority of DC metadata embedded in X/HTML documents is processed without reference to the GRDDL transform, probably without using the @profile attribute value as a "trigger", possibly without generating RDF triples, and perhaps even without applying the "prefixed name"-to-URI mapping at all - i.e. these applications are "on level 1" in terms of the DC "interoperability levels" document. I suspect there are tools which use meta elements to generate simple property/(literal) value indexes, and do so on the basis of a fixed set of meta/@name values, i.e. they index on the basis that the expected values of the meta/@name attribute are "DC.title", "DC.date" (etc) and those tools would ignore values like "ABC.title", even if the "ABC" prefix was associated (via a link[@rel="schema.ABC"] "namespace declaration") with the URI http://purl.org/dc/elements/1.1/ (or http://purl.org/dc/terms/). But yes, I'm entering the realms of speculation here, and we really need some concrete evidence of how applications process such data.

2. RDFa in XHTML and HTML4

Since that DCMI document was published, the W3C has published the RDFa in XHTML specification, RDFa in XHTML: Syntax and Processing. as a W3C Recommendation. RDFa provides a syntax for embedding RDF triples in an XHTML document using attributes (a combination of pre-existing XHTML attributes and additional RDFa-specific attributes.) Unlike the conventions defined by DC-HTML, RDFa supports the representation of any RDF triple, not only triples "about" the document (i.e. with the document URI as subject), and RDFa attributes can be used anywhere in an XHTML document.

Any "document metadata" that could be encoded using the DC-HTML profile could also be represented using RDFa. DCMI has not yet published any guidance on the use of RDFa - not because it doesn't consider RDFa important, I hasten to add, but only because of a lack of available effort. Having said that, (it seems to me) it isn't an area where DCMI would need a new "recommendation", but it may be useful to have some primer-/tutorial-style materials and examples highlighting the use of common constructs used in Dublin Core metadata.

I don't intend to provide a full summary of RDFa, but it is worth noting that, at the syntax level, RDFa introduces the use of a datatype called CURIE which supports the abbreviation of URI references as prefixed names. In XHTML, at least, the prefixes are associated with URIs via XML Namespace declarations. The use of CURIEs in RDFa might be seen as a more generalised, standardised approach to the problem that DC-HTML seeks to address through its own "prefixed name"/"namespace declaration" convention.

It is perhaps worth highlighting one aspect of the RDFa in XHTML processing model here. In RDFa the XHTML link/@rel attribute is treated as supporting both XHTML link types and CURIEs, and any value that matches an entry in the list of link types in the section The rel attribute, MUST be treated as if it was a URI within the XHTML vocabulary, and all other values must be CURIEs. So, the XHTML link types are treated as "reserved keywords", if you like, and a @rel attribute value of "next" is mapped to an RDF predicate of http://www.w3.org/1999/xhtml/vocab#next. For the case of XHTML, those "reserved keywords" are defined as part of the XHTML+RDFa document. They are also listed in the "namespace document" http://www.w3.org/1999/xhtml/vocab, which itself is an XHTML+RDFa document (though, N.B., there are other terms "in that namespace" which are not intended for use as link/@rel values). For a @rel value that is neither a member of the list nor a valid CURIE (e.g. rel="foobar" or rel="DC.modified" or rel="schema.DC"), no RDF triple is generated by an RDFa processor. As a consequence, RDFa "co-exists" well with the DC-HTML profile, by which I mean that an RDFa processor should generate no unanticipated triples from DC-HTML data in an XHTML+RDFa document.

Using RDFa in XHTML, then, the two example triples above could be represented as follows:

Example 2: RDFa in XHTML

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" />
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

And of course document metadata could be embedded elsewhere in the XHTML+RDFa document, e.g. instead of using the meta and link elements, the data above could be represented in the body of the document:

Example 3: RDFa in XHTML (2)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
  </head>
  <body>
    <p>
      Last modified on:
      <span property="dc:modified"
            datatype="xsd:date" content="2007-07-22">22 July 2007</span>
    </p>
    <p>
      Comments on:
      <a rel="ex:commentsOn"
          href="http://example.org/docs/123">Document 123</a>
    </p>
  </body>
</html>

These examples do not make use of a head/@profile attribute. According to Appendix C of the RDFa in XHTML specification, the use of @profile is optional: a @profile value of http://www.w3.org/1999/xhtml/vocab may be included to support a GRDDL-based transform, but it is not required by an RDFa processor. (Having said that, looking at the profile document http://www.w3.org/1999/xhtml/vocab, I can't see a reference to a GRDDL profile transformation in that document.)

The initial RDFa in XHTML specification covered the case of XHTML only. But RDFa is intended as an approach to be used with other markup languages too, and recently a working draft HTML+RDFa has been published. Again, this is a draft which is liable to change. This document describes how RDFa can be used in HTML5 (in both the XML and non-XML syntax), but the rules are also intended to be applicable to HTML4 documents interpreted through the HTML5 parsing rules. For the most part, it provides a set of minor changes to the syntax and processing rules specified in the RDFa in XHTML document.

I think (but I'm not sure!) the above example in HTML4 would look like the following, the only differences (for this example) being the change in the empty element syntax and the use of a different DTD for validation:

Example 4: RDFa in HTML4

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/html401-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="HTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" >
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" >
  </head>
</html>

3. HTML5

The document HTML5 differences from HTML4 offers a summary of the principal differences between HTML4 and HTML5. One general point to note here is that HTML5 is defined as an "abstract language" - it is defined in terms of the HTML Document Object Model - which can be serialised in a format which is compatible with HTML4 and also in an XML format. The "differences" document has little to say on issues specifically related to "document metadata", but it does highlight the removal from the language of some elements and attributes, a topic I'll return to below.

As I mentioned above, the current editor's draft version of HTML5 separates some content out into modules. In the current drafts, three items would seem to be of interest when considering conventions for representing metadata "about" a document:

I'll discuss each of these sources in turn (though I think there is some interdependency in the first two).

3.1. Document Metadata in HTML5

The "Document metadata" section defines the meta and link elements in HTML5. In terms of evaluating how the DC-HTML conventions might be used within HTML5, the following points seem significant:

  • For the @name attribute of the meta element, the spec defines some values, and it provides for a wiki-based registry of other values (HTML5ED, 4.2.5.2).
  • The @scheme attribute of the meta element has been made obsolete and "must not be used by authors".
  • In the property/value pairs represented by meta elements, the value must not be a URI.
  • For the @rel attribute of the link element, the spec defines some values - strictly speaking, tokens that can occur , and it provides for a wiki-based registry of other values (HTML5ED, 5.12.3.19).
  • The @profile attribute of the head element has been made obsolete and "must not be used by authors"

On the validation of values for the meta/@name attribute, HTML5 says:

Conformance checkers must use the information given on the WHATWG Wiki MetaExtensions page to establish if a value is allowed or not: values defined in this specification or marked as "proposed" or "ratified" must be accepted, whereas values marked as "discontinued" or not listed in either this specification or on the aforementioned page must be rejected as invalid. Conformance checkers may cache this information (e.g. for performance reasons or to avoid the use of unreliable network connectivity).

When an author uses a new metadata name not defined by either this specification or the Wiki page, conformance checkers should offer to add the value to the Wiki, with the details described above, with the "proposed" status.

So I think this means that, in order to pass this conformance check as valid, all values of the meta/@name attribute must be registered. The registry currently contains an entry (with status "proposed") for all names beginning "DC.", though I'm not sure whether the registration process is really intended to support such "wildcard" entries. The entry does not indicate whether the intent is that the names correspond to properties of the Dublin Core Metadata Element Set (i.e. with URIs beginning http://purl.org/dc/elements/1.1/) or of the DC Terms collection (i.e. with URIs beginning http://purl.org/dc/terms/). Further, as noted above, the current DC-HTML spec does not prescribe a bounded set of @name values; rather it allows for an open-ended set of prefixed name values, not just names referring to the "terms" owned by DCMI. In HTML5, the expectation seems to be that all such values should be registered. So, for example, when DCMI worked with the Library of Congress to make available a set of RDF properties corresponding to the MARC Relator Codes, identified by LoC-owned URIs, I think the expectation would be that, for data using those properties to be encoded, a corresponding set of @name values would need to be registered. Similarly if an implementer group coins a new URI for a property they require, a new @name value would be required.

If the registration process for HTML5 turns out to be relatively "permissive" (which the text above suggests it may be), it may be that this is not an issue, but it does seem to create a new requirement not present in HTML4/XHTML. However, I note that the registration page currently includes a note that suggests a "high bar" for terms to be "Accepted": For the "Status" section to be changed to "Accepted", the proposed keyword must be defined by a W3C specification in the Candidate Recommendation or Recommendation state. If it fails to go through this process, it is "Unendorsed".

Having said that, the microdata specification refers to the possibility that @name values are URIs, and I think that the implication is that such URI values are exempt from the registration requirement (though this does not seem clear from the discussion of registration in the "core" HTML5 spec).

The meta/@scheme attribute, used in DC-HTML to represent datatype URIs for typed literals, is no longer permitted in HTML5. Section 10.2, which offers suggestions for alternatives for some of the features that have been made obsolete, suggests Use only one scheme per field, or make the scheme declaration part of the value., which I think is suggesting either using a different meta/@name value for each potential scheme value (e.g. "date-W3CDTF", "date-someOtherDateFormat") or using some sort of structured string for the @content value with the scheme name embedded (e.g. "2007-07-22|http://purl.org/dc/terms/W3CDTF")

The section on the registration of meta/@name attribute values includes the paragraph:

Metadata names whose values are to be URLs must not be proposed or accepted. Links must be represented using the link element, not the meta element

This constraint appears to rule out the use of meta/@name to represent the property in cases where (in RDF terms) the object is a literal URI. (This is different from the case where the object is an RDF URI reference, which in DC-HTML is covered by the use of the link element.) For example, the DCMI dc:identifier and dcterms:identifier properties may be used in this way to provide a URI which identifies the document - that may be the document URI itself, or it may be another URI which identifies the same document.

A similar issue to that above for the registration of meta/@name attribute values arises for the case of the link/@rel attribute, for which HTML5 says:

Conformance checkers must use the information given on the WHATWG Wiki RelExtensions page to establish if a value is allowed or not: values defined in this specification or marked as "proposed" or "ratified" must be accepted when used on the elements for which they apply as described in the "Effect on..." field, whereas values marked as "discontinued" or not listed in either this specification or on the aforementioned page must be rejected as invalid. Conformance checkers may cache this information (e.g. for performance reasons or to avoid the use of unreliable network connectivity).

When an author uses a new type not defined by either this specification or the Wiki page, conformance checkers should offer to add the value to the Wiki, with the details described above, with the "proposed" status.

AFAICT, the registry currently contains no entries related specifically to DC-HTML or the DCMI vocabularies.

As for the case of name, the microdata specification refers to the possibility that @rel values are URIs, and again I think the implication is that such URI values are exempt from the registration requirement (though, again, this does not seem clear from the discussion in the "core" HTML5 spec).

Finally, the head/@profile attribute is no longer available in HTML5. and Section 10.2 says:

When used for declaring which meta terms are used in the document, unnecessary; omit it altogether, and register the names.

When used for triggering specific user agent behaviors: use a link element instead.

I think DC-HTML's use of head/@profile places it into the second of these categories: the profile doesn't "declare" a bounded set of terms, but it specifies how a (potentially "open-ended") set of attribute values are to be interpreted/processed.

Furthermore, the draft HTML+RDFa document proposes the (optional) use of a link/@rel value of "profile", and there is a corresponding entry in the registry for @rel values. This seems to be a mechanism for (re-)introducing the HTML4 concept of the meta data profile, using a different syntactic form i.e. using link/@rel in place of the profile attribute. I'm not clear about the extent to which this has support within the wider HTML5 community. If it was adopted, I imagine the GRDDL specification would also evolve to use this mechanism, but that is guesswork on my part.

Julian Retschke summarises most of these issues related to DC-HTML in a recent message to the public-html mailing list here.

3.2. Microdata

Microdata is a new specification, specific to HTML5. The "latest editors draft" version is described as "a module that forms part of the HTML5 series of specifications published at the W3C". The content was previously a part of the "core" HTML5 specification, but the decision was taken recently to separate it from the main spec.

Microdata offers similar functionality to that offered by RDFa in that it allows for the embedding of data anywhere in an HTML5 document. Like RDFa, microdata is a generalised mechanism, not one tied to any particular set of terms, and also like RDFa, microdata introduces a new set of attributes, to be used in combination with existing HTML5 attributes. The syntactic conventions used in microdata are inspired principally by the conventions used in various microformats.

As for the case of RDFa, my purpose here is not to provide a full description of microdata, but to examine whether and how microdata can express the data that in HTML4/XHTML is expressed using the conventions of the DC-HTML profile.

The model underlying microdata is one of nested lists of name-value pairs:

The microdata model consists of groups of name-value pairs known as items.

Each group is known as an item. Each item can have an item type, a global identifier (if the item type supports global identifiers for its items), and a list of name-value pairs. Each name in the name-value pair is known as a property, and each property has one or more values. Each value is either a string or itself a group of name-value pairs (an item).

The microdata model is independent of the RDF model, and is not designed to represent the full RDF model. In particular, microdata does not require the use of URIs as identifiers for properties, though it does allow for the use of URIs. Microdata does not offer - as many RDF syntaxes, including RDFa, do - a mechanism for abbreviating property URIs. But the microdata spec does include an algorithm that provides a (partial, I think?) mapping from microdata to a set of RDF triples.

Probably the main feature of RDF that has no correspondence in microdata is literal datatyping - see the discussion by Jeni Tennison here - though there is a distinct element/attribute for date-time values.

Given this constraint, I don't think it is possible to express the first triple of my example above using microdata. If the typed literal is replaced with a plain literal (i.e. the object is "2007-07-22", rather than "2007-07-22"^^xsd:date), then I think the two triples could be encoded (using the XML serialisation) as follows, i.e. the property URIs appear in full as attribute values:

Example 5: Microdata in HTML5

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
    <meta name="http://purl.org/dc/terms/modified"
          content="2007-07-22" />
    <link rel="http://example.org/terms/commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

As for the case of RDFa, microdata supports the embedding of data in the body of the document, so the triples could (I think!) also be represented as:

Example 6: Microdata in HTML5 (2)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
  </head>
  <body>
    <div itemscope="" itemid="http://example.org/doc/001">
      <p>
        Last modified on:
        <time itemprop="http://purl.org/dc/terms/modified"
              datetime="2007-07-22">22 July 2007</time>
      </p>
      <p>
        Comments on:
        <a rel="http://example.org/terms/commentsOn"
           href="http://example.org/docs/123">Document 123</a>
      </p>
  </body>
</html>

My understanding is that the itemid attribute is necessary to set the subject of the triple to the URI of the document, but I could be wrong on this point.

Also I think it's worth noting that the microdata-to-RDF algorithm specifies an RDF interpretation for some "core" HTML5 elements and attributes. For example:

  • the head/title element is mapped to a triple with the predicate http://purl.org/dc/terms/title and the element content as literal object
  • meta elements with name and content attributes are mapped to triples where the predicate is either (if the name attribute value is a URI) the value of the name attribute, or the concatenation of the string "http://www.w3.org/1999/xhtml/vocab#" and the value of the name attribute. So, e.g., a name attribute value of "DC.modified" would generate a predicate http://www.w3.org/1999/xhtml/vocab#DC.modified.
  • A similar rule applies for the link element. So, e.g., a rel attribute value of "EX.commentsOn" would generate a predicate http://www.w3.org/1999/xhtml/vocab#EX.commentsOn and a rel attribute value of "schema.DC" would generate a predicate http://www.w3.org/1999/xhtml/vocab#schema.DC

As far as I can tell, these are rules to be applied to any HTML5 document - there is no "flag" to say that they apply to document A but not to document B - so would need to be taken into consideration in any DCMI convention for using meta/@name and link/@rel attributes in HTML5. For example, given the following HTML5 document (and leaving aside for a moment the registration issue, and assuming that "EX.commentsOn", "schema.DC" and "schema.EX" are registered values for @name and @rel)

Example 7: Microdata in HTML5 (3)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
    <link rel="schema.DC"
          href="http://purl.org/dc/terms/" />
    <link rel="schema.EX"
          href="http://example.org/terms/" />
    <meta name="DC.modified"
          content="2007-07-22" />
    <link rel="EX.commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

the microdata-to-RDF algorithm would generate the following five RDF triples:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix xhv: http://www.w3.org/1999/xhtml/vocab#> .
<> dc:title "Document 001" ;
   xhv:schema.DC <http://purl.org/dc/terms/> ;
   xhv:schema.EX <http://example.org/terms/> ;
   xhv:DC.modified "2007-07-22" ;
   xhv:EX.commentsOn <http://example.org/docs/123> .

It's probably worth emphasising that although the URIs generated here are not-DCMI-owned URIs, it would be quite possible to assert an "equivalence" between a property with a URI beginning http://www.w3.org/1999/xhtml/vocab# and a corresponding DCMI-owned URI, which would imply a second triple using that DCMI-owned URI (e.g. <http://www.w3.org/1999/xhtml/vocab#DC.modified> owl:equivalentProperty <http://purl.org/dc/terms/modified>) - though, AFAICT, no such equivalence is suggested at the moment.

3.3. RDFa in HTML5

I noted above that, although the initial RDFa syntax specification had focused on the case of XHTML, a recent draft sought to extend this by describing the use of RDFa in HTML, including the case of HTML5.

As I already discussed, using RDFa, it is quite possible to represent any data that could be represented in HTML4/XHTML using the DC-HTML profile. So, using RDFa in HTML5, my two example triples could be represented (using the XML serialisation) as:

Example 8: RDFa in HTML5

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" />
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

Note that the RDFa in HTML draft requires the use of the html/@version attribute which, in the current draft, is obsoleted by the "core" HTML5 specification. As noted above, RDFa in HTML also proposes the (optional) use of a link/@rel value of "profile".

In the initial discussion of RDFa above, I noted the existence of a list of "reserved keyword" values for the link/@rel attribute in XHTML, to which an RDFa processor prefixes the URI to generate RDF predicate URIs. In HTML5, that list of reserved values is defined not by the HTML5 specification but by the WHATWG registry of @rel values. So there may be cases where a value used in an link/@rel attribute in an HTML4/XHTML document does not result in an RDFa processor generating a triple (because that value is not included in the HTML4/XHTML reserved list), but the same value in an link/@rel attribute in an HTML5 document does cause an RDFa processor to generate a triple (because that value is included in the HTML5 @rel registry). My understanding is that the RDFa/CURIE processing model is designed to cope with such host-language-specific variations, but it is something document creators will need to be aware of.

4. Some concluding thoughts

DCMI's specifications for embedding metadata in X/HTML have focused on "document metadata", data describing the document. The current DCMI Recommendation for encoding DC metadata in HTML was created in 2008, and is based on the DCMI Abstract Model and on RDF. The syntactic conventions are largely those developed by DCMI in the late 1990s. The current document was developed with reference to HTML4 and XHTML only, and it does not take into consideration the changes introduced by HTML5. The conventions described are not limited to the use of a fixed set of DCMI-owned properties, but support the representation of document metadata using any RDF properties.

Looking at the HTML5 drafts raises various issues:

  1. HTML5 removes some of the syntactic components used by the DC-HTML profile in HTML4/XHTML, namely the scheme and profile attributes.
  2. HTML5 introduces a requirement for the registration of meta/@name and link/@rel attribute values; the current DC-HTML specification makes the assumption that an "open-ended" set of values is available for those attributes.
  3. The status of the concept of the meta data profile in HTML5 seems unclear. On the one hand, the profile attribute has been removed, but the proposed registration of a link/@rel value suggests that the profile approach is still available in HTML5.
  4. The microdata specification provides "global" RDF interpretations for meta and link elements in HTML5.
  5. Microdata offers a mechanism for embedding data in HTML5 documents, and that mechanism can be used for embedding some RDF data in HTML5 documents. Microdata has some limitations (the absence of support for typed literals), but microdata could be used to express a large subset of the data that (in HTML4/XHTML) is expressed using the DC-HTML profile. The microdata specification is still a working draft and liable to change.
  6. RDFa also offers a mechanism for embedding RDF data in HTML5 documents. RDFa is designed to support the RDF model, and RDFa could be used in HTL5 to express the same data that (in HTML4/XHTML) is expressed using the DC-HTML profile. The specification for using RDFa in HTML5 is still a working draft and liable to change.

There seems to be a possible tension between HTML5's requirement for the registration of meta/@name and link/@rel values and the assumption in DC-HTML that an "open-ended" set of values is available. Also, the microdata specification's mapping by simple concatenation of registered meta/@name and link/@rel values to URIs differs from DC-HTML's use of a prefix-to-URI mapping. However, as I suggested above, it seems quite probable that at least some applications using Dublin Core metadata in HTML do indeed operate on the basis of a small set of invariant meta/@name and link/@rel values corresponding to (a subset of the) DCMI-owned properties, i.e. they use a subset of the conventions of DC-HTML to represent document metadata using only a limited set of DCMI-owned properties - to represent data conforming to what DCMI would call a single "description set profile". With the addition of assertions of equivalence between properties (see above), then it would be possible to represent data conforming to the version of "Simple Dublin Core" that I described a little while ago - i.e. using only the properties of the Dublin Core Metadata Element Set with plain literal values - using HTML5 meta/@name (with a registered set of 15 values) and meta/@content attributes.

Both microdata and RDFa, on the other hand, are extensions to HTML5 that are designed to provide generalised, "vocabulary-agnostic" conventions for embedding data in HTML5 documents. Using both microdata and RDFa, data may include, but is not limited to, "document metadata". RDFa is designed to represent RDF data; and microdata can be used to represent some RDF data (it lacks support for typed literals). RDFa includes abbreviation mechanisms for URIs that are broadly similar to those used in the DC-HTML profile (in the sense that they both use a sort of "prefixed name" to abbreviate URIs); microdata does not provide such a mechanism and (I think) would require the use of URIs in full.

Both microdata and RDFa address the problem that DCMI seeks to address via the DC-HTML profile, in the context of a more generalised mechanism for embedding data in HTML5 documents. Both microdata and RDFa could be used in HTML5 to represent document metadata that in HTML4/XHTML is represented using the DC-HTML profile (partially, for the case of microdata because of the absence of datatyping support). Currently, the documents describing RDFa in HTML5 and microdata are both still in draft and the subjects of vigorous debate, and it remains to be seen how/whether they progress through W3C process, and how implementers respond. But it seems to me the use of either would offer an opportunity for DCMI to move away from the maintenance of a DCMI-defined convention and to align with more generalised approaches.

About

Search

Loading
eFoundations is powered by TypePad