« Browser use in UK universities | Main | Publisher Interface Study - final report »

September 01, 2009

Experiments with DSP and Schematron

There has been some discussion recently around DCMI's draft Description Set Profile specification, both on the dc-architecture Jiscmail list and, briefly, on Twitter.

From my perspective, the DSP specification is one of the most interesting recent technical developments made by DCMI. For me, it provides the long-needed piece of the jigsaw that enables us to construct a coherent picture of what a "DC application profile" is. What do these tabular lists of "terms", or combinations of terms, that have typically appeared in these documents people call "DC application profiles" actually "say"? What does "use dc:subject with vocabulary encoding schemes S and T" actually "mean"? How can we formalise this information?

To recap, the DSP specification takes the approach that what is at issue here is a set of "structural constraints" on the information structure that the DCMI Abstract Model calls a "description set". The DCAM itself defines the basic structure (a "description set" contains "descriptions"; a "description" contains "statements"; a "statement" contains a "property URI", an optional "value URI" and "vocabulary encoding scheme URI", and so on). But that's where the DCAM stops: it doesn't say anything about any particular set of property URIs or vocabulary encoding scheme URIs; it doesn't specify whether, in the particular set of description sets I'm creating, plain literals should be in English or Spanish. This is where the DSP spec comes in. The DSP model allows us to say, "I want to apply a more specific set of requirements: a description of a book must provide a title (i.e. must include a statement with property URI http://purl.org/dc/terms/title) and must include exactly two subject terms from LCSH (i.e. must include two statements with property URI http://purl.org/dc/terms/subject and vocabulary encoding scheme URI http://purl.org/dc/terms/LCSH), or a description of a person is optional, but if included it must provide a name (i.e. must include a statement with property URI http://xmlns.com/foaf/0.1/name).

To express these constraints, the spec defines a model of "Description Templates", in turn containing sets of "Statement Templates". A set of such templates provides a set of "patterns", if you like, to which some set of actual "instance" descriptions sets can "match" or "conform". The specification also defines both an XML syntax and an RDF vocabulary for representing such a set of constraints.

As an aside, it's also worth noting that a single description set may be matched against multiple profiles, depending on the context (or indeed against none: there is no absolute requirement that a description set matches any DSP at all). The same description set may be tested against a fairly permissive set of constraints in one context, and a "tighter" set of constraints in another: the same description set may match the former, and fail to match the latter. To paraphrase James Clark's comments on XML schema, "validity" should be treated not as a property of a description set but as a relationship between a description set and a description set profile.

The current draft is very much just that, a draft on which feedback is being gathered. Are the current constraints fully/clearly specified? Is the processing algorithm complete/unambiguous? Are the current constraint types the ones typically required? Are there other constraint types which would be useful? And it is almost certain that there will be changes made in a future version, but nevertheless, it seems to me it is a very solid first step, and it's very encouraging to see that implementers are starting to test out the current model in earnest.

One of the questions that I've been asked in discussions is that of how the DSP model relates to XML schema languages.

A description set might be represented in many different concrete formats, including XML formats. XML schema languages (and here I'm using that term in a generic sense to refer to the family of technologies, not specifically to W3C XML Schema, one particular XML schema language) allow you to express a set of structural constraints on an XML document.

An XML format which is designed to serialise the description set structure provides a mapping between the components in that structure and some set of components in an XML document (XML elements and attributes, their names and their content and values).

And so, for such an XML format, it should be possible to map a DSP - a set of structural constraints on a description set - into a corresponding set of constraints on an instance of that XML format. I say "should" because there are a number of factors to be taken into consideration here:

  • The current draft DSP model includes some constraints which are not strictly structural. For example, the model allows a Statement Template to include a "Sub-property Constraint" (6.4.2), which allows a DSP to "say" things like "This statement template applies to a statement referring to any subproperty of the property dc:contributor". A processor attempting to determine whether or not a particular statement referring to some property ex:property matches such a constraint requires information about that property external to the description set itself in order to know whether the DSP requirement is met or not
  • Whether all the constraints can be reflected in an XML schema depends on the characteristics of the XML format and on the features of the XML schema language. Different XML schema languages have different capabilities when it comes to expressing structural constraints, and, for a single XML format, one schema language may be able to express constraints which another can not. So for the case of mapping the DSP constraints into an XML schema, it may be that, depending on the nature of the XML format, one XML schema language is capable of capturing more of the constraints on the XML document than another.

Anyway, to try to illustrate one possible application of the DSP model, I've spent some time recently playing around with XSLT and Schematron to try to create an XSLT transformation which:

  • takes as input an instance of the DSP-XML format described in the current draft (version of 2008-03-31) i.e. a representation of a DSP in XML; and
  • provides as output a Schematron schema containing a corresponding set of patterns expressing constraints on an instance of the XML format described in the proposed recommendation for the XML format known as DC-DS XML (version of 2008-09-01).

I should emphasise that I'm very much a newcomer to Schematron, my XSLT is a bit rusty, I haven't tested what I've done exhaustively, and I've worked on this on and off over a few days and haven't done a great deal to tidy up the results. So I'm sure there are more elegant and efficient ways of achieving this, but, FWIW, I've put where I've got to on a page on the DCMI Architecture Forum wiki.

The transform is dsp2sch-dcds.xsl

To illustrate its use, I created a short DSP-XML document and a few DC-DS XML instances.

bookdsp.xml is an DSP-XML representation of a short example DSP. It's loosely based on the book-person example that Tom Baker and Karen Coyle used in their recently published Guidelines for Dublin Core Application Profiles, but I've tweaked and extended it to include a broader range of constraints.

Running the transform against that DSP generates a Schematron schema: dsp-dcds.xml.

The page on the wiki lists a few example DC-DS XML instances, and the results of validating those instances against this Schematron schema. So for example, book4.xml is a DC-DS XML instance which conforms to the syntactic rules of the format, but fails to match some of the constraints of the Book DSP (the DSP allows the "book" description to have only two statements using the dc:creator property, and the example has three; and the DSP allows only two "person" descriptions, and the example has three). The result of validation using the Schematron schema is the document valbook4.xml. (The Schematron processor outputs an XML format called Schematron Validation Report Language (SVRL), which is a bit verbose, but fairly self-explanatory; it could be post-processed into a more human-readable format).

The approach taken is, roughly, that the transform generates:

  • a Schematron pattern with a rule with context dcds:descriptionSet, which, for each Description Template, tests for the number of child dcds:description elements that satisfy that Description Template's Resource Class Membership Constraint (more on this below), using a corresponding XPath predicate. e.g. from the bookdsp example dcds:description[dcds:statement[@dcds:propertyURI='http://www.w3.org/1999/02/22-rdf-syntax-ns#type' and (@dcds:valueURI='http://purl.org/dc/dcmitype/Collection')]]
  • for each DSP Description Template, a Schematron pattern with a rule with context dcds:description[the resource class membership predicate above], which tests the Standalone Constraints, and then, for each Statement Template, tests for the number of child dcds:statement elements that satisfy the Statement Template's Property Constraint, using a corresponding XPath predicate. e.g. from the bookdsp example dcds:statement[@dcds:propertyURI='http://purl.org/dc/terms/title']
  • for each DSP Statement Template that specifies a Type Constraint, a Schematron pattern with a rule with context dcds:description[the resource class membership predicate above]/dcds:statement[the property predicate above], which tests for the various other (Literal or Non-Literal) constraints specified within the Statement Template.

A few thoughts and notes are in order.

  1. The transform is specific to the version of the DSP-XML format specified in the current draft, and to the current version of the DC-DS XML format. If either of these change then the transform will require modificaton. Another transform could be written to generate patterns for another XML format, e.g. for RDF/XML (or maybe more easily, a defined "profile" of RDF/XML) or even for the use of the DC-HTML profile for embedding data in XHTML meta/link elements (subject to the limitations of that profile in terms of which aspects of the DCAM description model are supported).
  2. It assumes that the DSP XML instance is valid, and that the DC-DS XML instance is valid, in the sense that it conforms to the basic syntactic rules of that format. (I've got some additional general, DSP-independent Schematron patterns for DC-DS XML, which in theory could be "included" in the generated schema, but I haven't managed to get that to work correctly yet.)
  3. The output from the current version includes a lot of "informational" reporting ("this description contains three statements" etc), as well as actual error messages for mismatches with the DSP constraints. Mostly this was to help me debug the transform and get my head round how Schematron was working, but it makes the output rather verbose. I've left it in for now, but I might remove or reduce it in a subsequent version.
  4. What I've come up with currently implements only a subset of the model in the DSP draft. In particular, I've ignored constraints that go beyond the structural and require checking beyond the syntactic level (like the Subproperty Constraint I mentioned above). And for some other constraints, I've adopted a "strictly structural" interpretation: this is the case for the Description Template Resource Class Membership Constraint (5.5), which I interpreted as "the description should contain a statement referring to the rdf:type property with a value URI matching one of the listed URIs", and for the Statement Template Value Class Membership Constraint (6.6.2), which I interpreted as "there should be a description of the value containing a statement referring to the rdf:type property with a value URI matching one of the listed URIs". i.e. I haven't allowed for the possibility that an application might derive information about the resource type from property semantics (e.g. from inferencing based on RDF schema range/domain).
  5. Finally, the handling of literals is somewhat simplistic. In particular, I haven't given any thought to the handling of XML literals, but even leaving that aside it probably needs some additional character escaping.

Anyway, I intend this not as any sort of "authorised" tool, nor as "the finished article", but as a fairly rough first stab at an example of the sort of XML-schema-based functionality that I think can be built starting from the DSP model, and as a contribution to the ongoing discussion of the current working draft.


TrackBack URL for this entry:

Listed below are links to weblogs that reference Experiments with DSP and Schematron:


The comments to this entry are closed.



eFoundations is powered by TypePad