« Readability and linkability | Main | Second Life, scalability and data centres »

February 01, 2010

HTML5, document metadata and Dublin Core

I recently received a query about the encoding of Dublin Core metadata in HTML5, the revision of the HTML language being developed jointly by the W3C HTML Working Group and the Web Hypertext Application Technology Working Group (WHATWG). It has also been the topic of some recent discussion on the dc-general mailing list. While I've been aware of some of the discussions around metadata features in HTML5, until now I haven't looked in much detail at the current drafts.

There are various versions of the specification(s), all drafts under development and still changing (at times, quite quickly):

  • The WHATWG has a working draft titled HTML5 (including next generation additions still in development). This document is constantly changing; the content at the time I'm writing is dated 30 January 2010, but will no doubt have changed by the time you read this. Of this spec, the WHATWG says: This draft is a superset of the HTML5 work that is being published at the W3C: everything that is in HTML5 is also in the WHATWG HTML spec. Some new experimental features are being added to the WHATWG HTML draft, to continue developing extensions to the language while the W3C work through their milestones for HTML5. In other words, the WHATWG HTML specification is the next generation of the language, while HTML5 is a more limited subset with a narrower scope.
  • The W3C has a "latest public version" of HTML 5: A vocabulary and associated APIs for HTML and XHTML currently the version dated 25 August 2009. (The content of that "date-stamped" version should continue to be available.)
  • The W3C always has a "latest editor's draft" of that document, which at the time of writing is dated 30 January 2010, but also continues to change at frequent intervals. Note that, compared to the previous "latest public version", this draft incorporates some element of restructuring of the content, with some content separated out into "modules".

I can't emphasise too strongly that HTML5 is still a draft and liable to change; as the spec itself says in no uncertain terms: Implementors should be aware that this specification is not stable. Implementors who are not taking part in the discussions are likely to find the specification changing out from under them in incompatible ways..

For the purposes of this discussion I've looked primarily at the third document above, the W3C latest editor's draft. This post is really an attempt to raise some initial questions (and probably to expose my own confusion) rather than to provide any definitive answers. It is based on my (incomplete and very probably imperfect) reading of the drafts as they stand at this point in time - and it represents a personal view only, not a DCMI view.

1. Dublin Core metadata in HTML4 and XHTML

(This section covers DCMI's current recommendations for embedding metadata in X/HTML, so feel free to skip it if you are already familiar with this.)

To date, DCMI's specifications for embedding metadata in X/HTML documents have concerned themselves with representing metadata "about" the document as a whole, "document metadata", if you like. And in HTML4/XHTML, the principal source of document metadata is the head element (HTML4, 7.4). Within the head element:

  • the meta element (HTML4, 7.4.4.2) provides for the representation of "property name" (the value of the @name attribute)/"property value" (the value of the @content attribute) pairs which apply to the document. It also offers the ability to supplement the value with the name of a "scheme" (the value of the @scheme attribute) which is used "to interpret the property's value".
  • the link element (HTML4, 12.3) provides a means of representing a relationship between the document and another resource. It also - in attributes like @hreflang, @title, - suppports the provision of some metadata "about" that second resource.

(I should note here that the above text uses the terminology of the HTML4 specification, not of RDF or the DCMI Abstract Model (DCAM).)

The current DCMI recommendation for embedding document metadata in X/HTML, Expressing Dublin Core metadata using HTML/XHTML meta and link elements - which from here on I'll just refer to as "DC-HTML". Although the current recommendation is dated 2008, that version is only a minor "modernisation" of conventions that DCMI has recommended since the late 1990s. The specification describes a convention for representing what the DCAM calls a description (of the document) - a set of RDF triples - using the HTML meta and link elements and their attributes (and conversely, for interpreting a sequence of HTML meta and link elements as a set of RDF triples/DCAM description set). Contrary to some misconceptions, the convention is not limited to the use of DCMI-owned "terms"; indeed it does not assume the use of any DCMI-owned terms at all.

Consider the example of the following two RDF triples:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<> dc:modified "2007-07-22"^^xsd:date ;
   ex:commentsOn <http://example.org/docs/123> .

Aside: from the perspective of the DCMI Abstract Model, these would be equivalent to the following description set, expresssed using the DC-Text syntax, but for the rest of this post, to keep things simple, I'll just refer to the RDF triples.

@prefix dc: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
DescriptionSet (
  Description (
    Statement (
      PropertyURI ( dc:modified )
      LiteralValueString ( "2007-07-22"
        SyntaxEncodingSchemeURI ( xsd:date )
      )
    )
    Statement (
      PropertyURI ( ex:commentsOn )
      ValueURI ( <http://example.org/docs/123> )
      )
    )
  )
)

Following the conventions of DC-HTML, those triples are represented in XHTML as:

Example 1: DC-HTML profile in XHTML

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head profile="http://dublincore.org/documents/2008/08/04/dc-html/">
    <title>Document 001</title>
    <link rel="schema.DC"
          href="http://purl.org/dc/terms/" />
    <link rel="schema.EX"
          href="http://example.org/terms/" />
    <link rel="schema.XSD"
          href="http://www.w3.org/2001/XMLSchema#" />
    <meta name="DC.modified"
          scheme="XSD.date" content="2007-07-22" />
    <link rel="EX.commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

A few points to highlight:

  • The example is provided in XHTML but essentially the same syntax would be used in HTML4.
  • The triple with literal object is represented using a meta element.
  • The triple with the URI as object is represented using a link element
  • The predicate (property URI) may be any URI; the DC-HTML convention is not limited to DCMI-owned URIs, i.e. DC-HTML seeks to support the sort of URI-based vocabulary extensibility provided by RDF. There is no "registry" of a bounded set of terms to be used in metadata represented using DC-HTML; or, rather, "the Web is the registry". All an implementer requires to introduce a new property is the authority to assign a URI in some URI space they own (or in which they have been delegated rights to assign URIs).
  • A convention for representing property URIs and datatype URIs as "prefixed names" is used, and in this example three other link elements (with @rel="schema.{prefix})" are introduced to act as "namespace declarations" to support the convention. When a document using DC-HTML is processed, no RDF triples are generated for those link elements (Aside: I have occasionally wondered whether this is abusing the rel attribute, which is intended to capture the type of relationship between the document and the target resource, i.e. it is using a mechanism which does carry semantics for an essentially syntactic end (the abbreviation of URIs). But I'll suspend those misgivings for now...).
  • The prefixes used in these "prefixed names" are arbitrary, and DC-HTML does not specify the use/interpretation of a fixed set of @name or @rel attribute values. In the example above, I chose to associate the "DC" prefix with the "namespace URI" http://purl.org/dc/terms/, though "traditionally" it has been more commonly associated with the "namespace URI" http://purl.org/dc/elements/1.1/. Another document creator might associate the same prefix with a quite different URI again.
  • The DC-HTML profile generates triples only for those meta and link elements where the values of the @name and @rel attributes contain a prefixed name with a prefix for which there is a corresponding "namespace declaration".
  • The datatype of the typed literal is represented by the value of the meta/@scheme attribute.
  • There is no support for RDF blank nodes.

For the purposes of this discussion, perhaps the main point to make is that this use/interpretation of meta and link elements is specific to DC-HTML, not a general interpretation defined by the HTML4 specification. The mapping of prefixed names to URIs using link[@rel="schema.ABC"] "namespace declarations" is a DCMI convention, not part of X/HTML. And this is made possible through the use of a feature of HTML4 and XHTML called a "meta data profile": the document creator signals (by providing a specific URI as value of the head/@profile attribute) that they are applying the DC-HTML conventions and the presence of that attribute value licences a consumer to apply that interpretation of the data in a document. And, further, under that profile, as I noted for the example of the "DC" prefix, there is no single "meaning" assigned to meta/@name or link/@rel values.

In the XHTML case, the profile-dependent interpretation is made accessible in machine-processable form through the use of GRDDL, more specifically of a GRDDL profile transformation. i.e. a GRDDL processor uses the profile URI to access an XHTML "profile document" which provides a pointer to an XSLT transform which, when applied to an XHTML document using the DC-HTML profile, generates an RDF/XML document representing the appropriate RDF triples.

It may also be worth noting at this point that the profile attribute actually supports not just a single URI as value but a space-separated list of URIs i.e. within a single document, multiple profiles may be "applicable". And, potentially, those multiple profiles may specify different interpretations of a single @name or @rel value. I think the intent is that in that case all the interpretations should be applied - and in the case that multiple GRDDL profile transformations are provided, the output should be the result of merging the RDF graphs output from each individual transformation.

Now then, having laboured the point about the importance of the concept of the profile, I strongly suspect - though I don't have concrete evidence to support my suspicion - that it is not widely used by applications that provide and consume data using the other conventions described in the DC-HTML document.

It is certainly easy to find many providers of document metadata in X/HTML that follow some of the syntactic conventions of DC-HTML but do not include the @profile attribute. This is (currently, at least) the case even for many documents on DCMI's own site. And I suspect only a (small?) minority of applications consuming/processing DC metadata embedded in X/HTML documents do so by applying the DC-HTML GRDDL profile transform in this way. I suspect the majority of DC metadata embedded in X/HTML documents is processed without reference to the GRDDL transform, probably without using the @profile attribute value as a "trigger", possibly without generating RDF triples, and perhaps even without applying the "prefixed name"-to-URI mapping at all - i.e. these applications are "on level 1" in terms of the DC "interoperability levels" document. I suspect there are tools which use meta elements to generate simple property/(literal) value indexes, and do so on the basis of a fixed set of meta/@name values, i.e. they index on the basis that the expected values of the meta/@name attribute are "DC.title", "DC.date" (etc) and those tools would ignore values like "ABC.title", even if the "ABC" prefix was associated (via a link[@rel="schema.ABC"] "namespace declaration") with the URI http://purl.org/dc/elements/1.1/ (or http://purl.org/dc/terms/). But yes, I'm entering the realms of speculation here, and we really need some concrete evidence of how applications process such data.

2. RDFa in XHTML and HTML4

Since that DCMI document was published, the W3C has published the RDFa in XHTML specification, RDFa in XHTML: Syntax and Processing. as a W3C Recommendation. RDFa provides a syntax for embedding RDF triples in an XHTML document using attributes (a combination of pre-existing XHTML attributes and additional RDFa-specific attributes.) Unlike the conventions defined by DC-HTML, RDFa supports the representation of any RDF triple, not only triples "about" the document (i.e. with the document URI as subject), and RDFa attributes can be used anywhere in an XHTML document.

Any "document metadata" that could be encoded using the DC-HTML profile could also be represented using RDFa. DCMI has not yet published any guidance on the use of RDFa - not because it doesn't consider RDFa important, I hasten to add, but only because of a lack of available effort. Having said that, (it seems to me) it isn't an area where DCMI would need a new "recommendation", but it may be useful to have some primer-/tutorial-style materials and examples highlighting the use of common constructs used in Dublin Core metadata.

I don't intend to provide a full summary of RDFa, but it is worth noting that, at the syntax level, RDFa introduces the use of a datatype called CURIE which supports the abbreviation of URI references as prefixed names. In XHTML, at least, the prefixes are associated with URIs via XML Namespace declarations. The use of CURIEs in RDFa might be seen as a more generalised, standardised approach to the problem that DC-HTML seeks to address through its own "prefixed name"/"namespace declaration" convention.

It is perhaps worth highlighting one aspect of the RDFa in XHTML processing model here. In RDFa the XHTML link/@rel attribute is treated as supporting both XHTML link types and CURIEs, and any value that matches an entry in the list of link types in the section The rel attribute, MUST be treated as if it was a URI within the XHTML vocabulary, and all other values must be CURIEs. So, the XHTML link types are treated as "reserved keywords", if you like, and a @rel attribute value of "next" is mapped to an RDF predicate of http://www.w3.org/1999/xhtml/vocab#next. For the case of XHTML, those "reserved keywords" are defined as part of the XHTML+RDFa document. They are also listed in the "namespace document" http://www.w3.org/1999/xhtml/vocab, which itself is an XHTML+RDFa document (though, N.B., there are other terms "in that namespace" which are not intended for use as link/@rel values). For a @rel value that is neither a member of the list nor a valid CURIE (e.g. rel="foobar" or rel="DC.modified" or rel="schema.DC"), no RDF triple is generated by an RDFa processor. As a consequence, RDFa "co-exists" well with the DC-HTML profile, by which I mean that an RDFa processor should generate no unanticipated triples from DC-HTML data in an XHTML+RDFa document.

Using RDFa in XHTML, then, the two example triples above could be represented as follows:

Example 2: RDFa in XHTML

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" />
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

And of course document metadata could be embedded elsewhere in the XHTML+RDFa document, e.g. instead of using the meta and link elements, the data above could be represented in the body of the document:

Example 3: RDFa in XHTML (2)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
  </head>
  <body>
    <p>
      Last modified on:
      <span property="dc:modified"
            datatype="xsd:date" content="2007-07-22">22 July 2007</span>
    </p>
    <p>
      Comments on:
      <a rel="ex:commentsOn"
          href="http://example.org/docs/123">Document 123</a>
    </p>
  </body>
</html>

These examples do not make use of a head/@profile attribute. According to Appendix C of the RDFa in XHTML specification, the use of @profile is optional: a @profile value of http://www.w3.org/1999/xhtml/vocab may be included to support a GRDDL-based transform, but it is not required by an RDFa processor. (Having said that, looking at the profile document http://www.w3.org/1999/xhtml/vocab, I can't see a reference to a GRDDL profile transformation in that document.)

The initial RDFa in XHTML specification covered the case of XHTML only. But RDFa is intended as an approach to be used with other markup languages too, and recently a working draft HTML+RDFa has been published. Again, this is a draft which is liable to change. This document describes how RDFa can be used in HTML5 (in both the XML and non-XML syntax), but the rules are also intended to be applicable to HTML4 documents interpreted through the HTML5 parsing rules. For the most part, it provides a set of minor changes to the syntax and processing rules specified in the RDFa in XHTML document.

I think (but I'm not sure!) the above example in HTML4 would look like the following, the only differences (for this example) being the change in the empty element syntax and the use of a different DTD for validation:

Example 4: RDFa in HTML4

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/html401-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="HTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" >
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" >
  </head>
</html>

3. HTML5

The document HTML5 differences from HTML4 offers a summary of the principal differences between HTML4 and HTML5. One general point to note here is that HTML5 is defined as an "abstract language" - it is defined in terms of the HTML Document Object Model - which can be serialised in a format which is compatible with HTML4 and also in an XML format. The "differences" document has little to say on issues specifically related to "document metadata", but it does highlight the removal from the language of some elements and attributes, a topic I'll return to below.

As I mentioned above, the current editor's draft version of HTML5 separates some content out into modules. In the current drafts, three items would seem to be of interest when considering conventions for representing metadata "about" a document:

I'll discuss each of these sources in turn (though I think there is some interdependency in the first two).

3.1. Document Metadata in HTML5

The "Document metadata" section defines the meta and link elements in HTML5. In terms of evaluating how the DC-HTML conventions might be used within HTML5, the following points seem significant:

  • For the @name attribute of the meta element, the spec defines some values, and it provides for a wiki-based registry of other values (HTML5ED, 4.2.5.2).
  • The @scheme attribute of the meta element has been made obsolete and "must not be used by authors".
  • In the property/value pairs represented by meta elements, the value must not be a URI.
  • For the @rel attribute of the link element, the spec defines some values - strictly speaking, tokens that can occur , and it provides for a wiki-based registry of other values (HTML5ED, 5.12.3.19).
  • The @profile attribute of the head element has been made obsolete and "must not be used by authors"

On the validation of values for the meta/@name attribute, HTML5 says:

Conformance checkers must use the information given on the WHATWG Wiki MetaExtensions page to establish if a value is allowed or not: values defined in this specification or marked as "proposed" or "ratified" must be accepted, whereas values marked as "discontinued" or not listed in either this specification or on the aforementioned page must be rejected as invalid. Conformance checkers may cache this information (e.g. for performance reasons or to avoid the use of unreliable network connectivity).

When an author uses a new metadata name not defined by either this specification or the Wiki page, conformance checkers should offer to add the value to the Wiki, with the details described above, with the "proposed" status.

So I think this means that, in order to pass this conformance check as valid, all values of the meta/@name attribute must be registered. The registry currently contains an entry (with status "proposed") for all names beginning "DC.", though I'm not sure whether the registration process is really intended to support such "wildcard" entries. The entry does not indicate whether the intent is that the names correspond to properties of the Dublin Core Metadata Element Set (i.e. with URIs beginning http://purl.org/dc/elements/1.1/) or of the DC Terms collection (i.e. with URIs beginning http://purl.org/dc/terms/). Further, as noted above, the current DC-HTML spec does not prescribe a bounded set of @name values; rather it allows for an open-ended set of prefixed name values, not just names referring to the "terms" owned by DCMI. In HTML5, the expectation seems to be that all such values should be registered. So, for example, when DCMI worked with the Library of Congress to make available a set of RDF properties corresponding to the MARC Relator Codes, identified by LoC-owned URIs, I think the expectation would be that, for data using those properties to be encoded, a corresponding set of @name values would need to be registered. Similarly if an implementer group coins a new URI for a property they require, a new @name value would be required.

If the registration process for HTML5 turns out to be relatively "permissive" (which the text above suggests it may be), it may be that this is not an issue, but it does seem to create a new requirement not present in HTML4/XHTML. However, I note that the registration page currently includes a note that suggests a "high bar" for terms to be "Accepted": For the "Status" section to be changed to "Accepted", the proposed keyword must be defined by a W3C specification in the Candidate Recommendation or Recommendation state. If it fails to go through this process, it is "Unendorsed".

Having said that, the microdata specification refers to the possibility that @name values are URIs, and I think that the implication is that such URI values are exempt from the registration requirement (though this does not seem clear from the discussion of registration in the "core" HTML5 spec).

The meta/@scheme attribute, used in DC-HTML to represent datatype URIs for typed literals, is no longer permitted in HTML5. Section 10.2, which offers suggestions for alternatives for some of the features that have been made obsolete, suggests Use only one scheme per field, or make the scheme declaration part of the value., which I think is suggesting either using a different meta/@name value for each potential scheme value (e.g. "date-W3CDTF", "date-someOtherDateFormat") or using some sort of structured string for the @content value with the scheme name embedded (e.g. "2007-07-22|http://purl.org/dc/terms/W3CDTF")

The section on the registration of meta/@name attribute values includes the paragraph:

Metadata names whose values are to be URLs must not be proposed or accepted. Links must be represented using the link element, not the meta element

This constraint appears to rule out the use of meta/@name to represent the property in cases where (in RDF terms) the object is a literal URI. (This is different from the case where the object is an RDF URI reference, which in DC-HTML is covered by the use of the link element.) For example, the DCMI dc:identifier and dcterms:identifier properties may be used in this way to provide a URI which identifies the document - that may be the document URI itself, or it may be another URI which identifies the same document.

A similar issue to that above for the registration of meta/@name attribute values arises for the case of the link/@rel attribute, for which HTML5 says:

Conformance checkers must use the information given on the WHATWG Wiki RelExtensions page to establish if a value is allowed or not: values defined in this specification or marked as "proposed" or "ratified" must be accepted when used on the elements for which they apply as described in the "Effect on..." field, whereas values marked as "discontinued" or not listed in either this specification or on the aforementioned page must be rejected as invalid. Conformance checkers may cache this information (e.g. for performance reasons or to avoid the use of unreliable network connectivity).

When an author uses a new type not defined by either this specification or the Wiki page, conformance checkers should offer to add the value to the Wiki, with the details described above, with the "proposed" status.

AFAICT, the registry currently contains no entries related specifically to DC-HTML or the DCMI vocabularies.

As for the case of name, the microdata specification refers to the possibility that @rel values are URIs, and again I think the implication is that such URI values are exempt from the registration requirement (though, again, this does not seem clear from the discussion in the "core" HTML5 spec).

Finally, the head/@profile attribute is no longer available in HTML5. and Section 10.2 says:

When used for declaring which meta terms are used in the document, unnecessary; omit it altogether, and register the names.

When used for triggering specific user agent behaviors: use a link element instead.

I think DC-HTML's use of head/@profile places it into the second of these categories: the profile doesn't "declare" a bounded set of terms, but it specifies how a (potentially "open-ended") set of attribute values are to be interpreted/processed.

Furthermore, the draft HTML+RDFa document proposes the (optional) use of a link/@rel value of "profile", and there is a corresponding entry in the registry for @rel values. This seems to be a mechanism for (re-)introducing the HTML4 concept of the meta data profile, using a different syntactic form i.e. using link/@rel in place of the profile attribute. I'm not clear about the extent to which this has support within the wider HTML5 community. If it was adopted, I imagine the GRDDL specification would also evolve to use this mechanism, but that is guesswork on my part.

Julian Retschke summarises most of these issues related to DC-HTML in a recent message to the public-html mailing list here.

3.2. Microdata

Microdata is a new specification, specific to HTML5. The "latest editors draft" version is described as "a module that forms part of the HTML5 series of specifications published at the W3C". The content was previously a part of the "core" HTML5 specification, but the decision was taken recently to separate it from the main spec.

Microdata offers similar functionality to that offered by RDFa in that it allows for the embedding of data anywhere in an HTML5 document. Like RDFa, microdata is a generalised mechanism, not one tied to any particular set of terms, and also like RDFa, microdata introduces a new set of attributes, to be used in combination with existing HTML5 attributes. The syntactic conventions used in microdata are inspired principally by the conventions used in various microformats.

As for the case of RDFa, my purpose here is not to provide a full description of microdata, but to examine whether and how microdata can express the data that in HTML4/XHTML is expressed using the conventions of the DC-HTML profile.

The model underlying microdata is one of nested lists of name-value pairs:

The microdata model consists of groups of name-value pairs known as items.

Each group is known as an item. Each item can have an item type, a global identifier (if the item type supports global identifiers for its items), and a list of name-value pairs. Each name in the name-value pair is known as a property, and each property has one or more values. Each value is either a string or itself a group of name-value pairs (an item).

The microdata model is independent of the RDF model, and is not designed to represent the full RDF model. In particular, microdata does not require the use of URIs as identifiers for properties, though it does allow for the use of URIs. Microdata does not offer - as many RDF syntaxes, including RDFa, do - a mechanism for abbreviating property URIs. But the microdata spec does include an algorithm that provides a (partial, I think?) mapping from microdata to a set of RDF triples.

Probably the main feature of RDF that has no correspondence in microdata is literal datatyping - see the discussion by Jeni Tennison here - though there is a distinct element/attribute for date-time values.

Given this constraint, I don't think it is possible to express the first triple of my example above using microdata. If the typed literal is replaced with a plain literal (i.e. the object is "2007-07-22", rather than "2007-07-22"^^xsd:date), then I think the two triples could be encoded (using the XML serialisation) as follows, i.e. the property URIs appear in full as attribute values:

Example 5: Microdata in HTML5

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
    <meta name="http://purl.org/dc/terms/modified"
          content="2007-07-22" />
    <link rel="http://example.org/terms/commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

As for the case of RDFa, microdata supports the embedding of data in the body of the document, so the triples could (I think!) also be represented as:

Example 6: Microdata in HTML5 (2)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
  </head>
  <body>
    <div itemscope="" itemid="http://example.org/doc/001">
      <p>
        Last modified on:
        <time itemprop="http://purl.org/dc/terms/modified"
              datetime="2007-07-22">22 July 2007</time>
      </p>
      <p>
        Comments on:
        <a rel="http://example.org/terms/commentsOn"
           href="http://example.org/docs/123">Document 123</a>
      </p>
  </body>
</html>

My understanding is that the itemid attribute is necessary to set the subject of the triple to the URI of the document, but I could be wrong on this point.

Also I think it's worth noting that the microdata-to-RDF algorithm specifies an RDF interpretation for some "core" HTML5 elements and attributes. For example:

  • the head/title element is mapped to a triple with the predicate http://purl.org/dc/terms/title and the element content as literal object
  • meta elements with name and content attributes are mapped to triples where the predicate is either (if the name attribute value is a URI) the value of the name attribute, or the concatenation of the string "http://www.w3.org/1999/xhtml/vocab#" and the value of the name attribute. So, e.g., a name attribute value of "DC.modified" would generate a predicate http://www.w3.org/1999/xhtml/vocab#DC.modified.
  • A similar rule applies for the link element. So, e.g., a rel attribute value of "EX.commentsOn" would generate a predicate http://www.w3.org/1999/xhtml/vocab#EX.commentsOn and a rel attribute value of "schema.DC" would generate a predicate http://www.w3.org/1999/xhtml/vocab#schema.DC

As far as I can tell, these are rules to be applied to any HTML5 document - there is no "flag" to say that they apply to document A but not to document B - so would need to be taken into consideration in any DCMI convention for using meta/@name and link/@rel attributes in HTML5. For example, given the following HTML5 document (and leaving aside for a moment the registration issue, and assuming that "EX.commentsOn", "schema.DC" and "schema.EX" are registered values for @name and @rel)

Example 7: Microdata in HTML5 (3)

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html>
  <head>
    <title>Document 001</title>
    <link rel="schema.DC"
          href="http://purl.org/dc/terms/" />
    <link rel="schema.EX"
          href="http://example.org/terms/" />
    <meta name="DC.modified"
          content="2007-07-22" />
    <link rel="EX.commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

the microdata-to-RDF algorithm would generate the following five RDF triples:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix xhv: http://www.w3.org/1999/xhtml/vocab#> .
<> dc:title "Document 001" ;
   xhv:schema.DC <http://purl.org/dc/terms/> ;
   xhv:schema.EX <http://example.org/terms/> ;
   xhv:DC.modified "2007-07-22" ;
   xhv:EX.commentsOn <http://example.org/docs/123> .

It's probably worth emphasising that although the URIs generated here are not-DCMI-owned URIs, it would be quite possible to assert an "equivalence" between a property with a URI beginning http://www.w3.org/1999/xhtml/vocab# and a corresponding DCMI-owned URI, which would imply a second triple using that DCMI-owned URI (e.g. <http://www.w3.org/1999/xhtml/vocab#DC.modified> owl:equivalentProperty <http://purl.org/dc/terms/modified>) - though, AFAICT, no such equivalence is suggested at the moment.

3.3. RDFa in HTML5

I noted above that, although the initial RDFa syntax specification had focused on the case of XHTML, a recent draft sought to extend this by describing the use of RDFa in HTML, including the case of HTML5.

As I already discussed, using RDFa, it is quite possible to represent any data that could be represented in HTML4/XHTML using the DC-HTML profile. So, using RDFa in HTML5, my two example triples could be represented (using the XML serialisation) as:

Example 8: RDFa in HTML5

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:ex="http://example.org/terms/"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
      version="XHTML+RDFa 1.0">
  <head>
    <title>Document 001</title>
    <meta property="dc:modified"
          datatype="xsd:date" content="2007-07-22" />
    <link rel="ex:commentsOn"
          href="http://example.org/docs/123" />
  </head>
</html>

Note that the RDFa in HTML draft requires the use of the html/@version attribute which, in the current draft, is obsoleted by the "core" HTML5 specification. As noted above, RDFa in HTML also proposes the (optional) use of a link/@rel value of "profile".

In the initial discussion of RDFa above, I noted the existence of a list of "reserved keyword" values for the link/@rel attribute in XHTML, to which an RDFa processor prefixes the URI to generate RDF predicate URIs. In HTML5, that list of reserved values is defined not by the HTML5 specification but by the WHATWG registry of @rel values. So there may be cases where a value used in an link/@rel attribute in an HTML4/XHTML document does not result in an RDFa processor generating a triple (because that value is not included in the HTML4/XHTML reserved list), but the same value in an link/@rel attribute in an HTML5 document does cause an RDFa processor to generate a triple (because that value is included in the HTML5 @rel registry). My understanding is that the RDFa/CURIE processing model is designed to cope with such host-language-specific variations, but it is something document creators will need to be aware of.

4. Some concluding thoughts

DCMI's specifications for embedding metadata in X/HTML have focused on "document metadata", data describing the document. The current DCMI Recommendation for encoding DC metadata in HTML was created in 2008, and is based on the DCMI Abstract Model and on RDF. The syntactic conventions are largely those developed by DCMI in the late 1990s. The current document was developed with reference to HTML4 and XHTML only, and it does not take into consideration the changes introduced by HTML5. The conventions described are not limited to the use of a fixed set of DCMI-owned properties, but support the representation of document metadata using any RDF properties.

Looking at the HTML5 drafts raises various issues:

  1. HTML5 removes some of the syntactic components used by the DC-HTML profile in HTML4/XHTML, namely the scheme and profile attributes.
  2. HTML5 introduces a requirement for the registration of meta/@name and link/@rel attribute values; the current DC-HTML specification makes the assumption that an "open-ended" set of values is available for those attributes.
  3. The status of the concept of the meta data profile in HTML5 seems unclear. On the one hand, the profile attribute has been removed, but the proposed registration of a link/@rel value suggests that the profile approach is still available in HTML5.
  4. The microdata specification provides "global" RDF interpretations for meta and link elements in HTML5.
  5. Microdata offers a mechanism for embedding data in HTML5 documents, and that mechanism can be used for embedding some RDF data in HTML5 documents. Microdata has some limitations (the absence of support for typed literals), but microdata could be used to express a large subset of the data that (in HTML4/XHTML) is expressed using the DC-HTML profile. The microdata specification is still a working draft and liable to change.
  6. RDFa also offers a mechanism for embedding RDF data in HTML5 documents. RDFa is designed to support the RDF model, and RDFa could be used in HTL5 to express the same data that (in HTML4/XHTML) is expressed using the DC-HTML profile. The specification for using RDFa in HTML5 is still a working draft and liable to change.

There seems to be a possible tension between HTML5's requirement for the registration of meta/@name and link/@rel values and the assumption in DC-HTML that an "open-ended" set of values is available. Also, the microdata specification's mapping by simple concatenation of registered meta/@name and link/@rel values to URIs differs from DC-HTML's use of a prefix-to-URI mapping. However, as I suggested above, it seems quite probable that at least some applications using Dublin Core metadata in HTML do indeed operate on the basis of a small set of invariant meta/@name and link/@rel values corresponding to (a subset of the) DCMI-owned properties, i.e. they use a subset of the conventions of DC-HTML to represent document metadata using only a limited set of DCMI-owned properties - to represent data conforming to what DCMI would call a single "description set profile". With the addition of assertions of equivalence between properties (see above), then it would be possible to represent data conforming to the version of "Simple Dublin Core" that I described a little while ago - i.e. using only the properties of the Dublin Core Metadata Element Set with plain literal values - using HTML5 meta/@name (with a registered set of 15 values) and meta/@content attributes.

Both microdata and RDFa, on the other hand, are extensions to HTML5 that are designed to provide generalised, "vocabulary-agnostic" conventions for embedding data in HTML5 documents. Using both microdata and RDFa, data may include, but is not limited to, "document metadata". RDFa is designed to represent RDF data; and microdata can be used to represent some RDF data (it lacks support for typed literals). RDFa includes abbreviation mechanisms for URIs that are broadly similar to those used in the DC-HTML profile (in the sense that they both use a sort of "prefixed name" to abbreviate URIs); microdata does not provide such a mechanism and (I think) would require the use of URIs in full.

Both microdata and RDFa address the problem that DCMI seeks to address via the DC-HTML profile, in the context of a more generalised mechanism for embedding data in HTML5 documents. Both microdata and RDFa could be used in HTML5 to represent document metadata that in HTML4/XHTML is represented using the DC-HTML profile (partially, for the case of microdata because of the absence of datatyping support). Currently, the documents describing RDFa in HTML5 and microdata are both still in draft and the subjects of vigorous debate, and it remains to be seen how/whether they progress through W3C process, and how implementers respond. But it seems to me the use of either would offer an opportunity for DCMI to move away from the maintenance of a DCMI-defined convention and to align with more generalised approaches.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8345203ba69e2012876d02aaa970c

Listed below are links to weblogs that reference HTML5, document metadata and Dublin Core:

Comments

The comments to this entry are closed.

About

Search

Loading
eFoundations is powered by TypePad