Why an abstract model for Dublin Core metadata?
In a comment on my earlier post about the DCMI Abstract Model, Jonathan Rochkind asked for some clarification on the motivation for developing the DCAM, and the problems it was designed to address.
(I should emphasise that this represents only my personal view, and I'm not speaking on behalf of the co-authors of the DCAM or of DCMI.)
In my previous post, I said that there were two primary aspects to the DCMI Abstract Models:
- it describes an abstract information structure, the description set, in terms of the components which make up that structure (description, resource URI, statement etc) and the relationships between those components (a description set contains one or more descriptions; a description contains one or more statements; and so on)
- it describes how an instance of that structure is to be interpreted, in terms of "what it says" about "things in the world" (each statement in a description "says" that the described resource is related in the way specified by (the resource identified by) the property URI to another resource; and so on)
Aa part of that second aspect, the DCAM also describes the types of "metadata terms" that are referenced in description sets (property, class, vocabulary encoding scheme, syntax encoding scheme) and the relationships that can exist between terms of those types (subproperty or element refinement, subclass).
I guess the first thing to say is that, before the development of the document we call the DCMI Abstract Model (which took place from about mid-2003, with the final document made a DCMI Recommendation in early 2005), there already was an "abstract model" for Dublin Core metadata. DCMI had already embraced the notion that what some piece of DC metadata "said" was distinct from the (multiple) ways in which it might be "encoded": "Dublin Core metadata" could be encoded in multiple digital formats, and, conversely, instances of two different formats could be interpreted as encodings of the same metadata. Underlying this was an assumption - perhaps not as fully or clearly stated as it might have been - that there was some "abstraction" of "Dublin Core metadata" which was independent of any of those "concrete" syntaxes. Of the pre-DCAM documents published by DCMI, the one which comes closest to capturing fully what that abstraction was is probably the Usage Board's DCMI Grammatical Principles. I think it's reasonable to argue that the DCAM, first and foremost, consolidates, rationalises and makes explicit information which already existed (and also provides a more formal representation of it, through the use of UML).
However, that view is a slightly rose-tinted one of a somewhat muddier reality. It is perhaps more accurate to say that there were several such descriptions of "what Dublin Core metadata was", and those descriptions were not always completely consistent with each other. They often differed at least in their use of terminology, if not in the concepts they described. Some were in documents published by DCMI (e.g. the DCMI Grammatical Principles and the descriptions of "abstract models" included in Guidelines for implementing Dublin Core in XML), others in documents published elsewhere (e.g. Tom Baker's, "A Grammar of Dublin Core" in Dlib). Some were published early in the development of DC, others were more recent. With the emergence of the W3C's Resource Description Framework (RDF), DCMI published documents describing the use of RDF for DC metadata, and the use of DC often featured in examples in documents about RDF published by other parties. The terminology and concepts of RDF entered the lexicon of (at least a subset of) the Dublin Core community. While this brought the considerable benefits of aligning Dublin Core with a more formally defined model (and enabling the use of software tools that implemented that model), it also raised new questions: was DCMI's notion of element really the same as RDF's concept of property? Was DCMI's notion of vocabulary encoding scheme the same as RDF Schema's notion of class?
The consequence of this was that while there was, broadly at least, a shared conceptualisation of what DC metadata was, the devil was in the detail: there were sometimes subtle but significant differences between those different descriptions of the DC "abstract model". Implementers ended up with slightly different notions of what DC metadata was, and those differences were sometimes reflected in the software applications that were developed (e.g. the developers of two different applications might take different approaches to implementing the concept of element refinement). If data was processed within a single application (or a set of applications based on the same conceptualisation), then no inconsistencies were visible; but if data was transferred to an application based on a different conceptualisation then the applications might behave differently. As Stu Weibel puts it bluntly, "we don’t even have broad interoperability across Dublin Core systems, let alone with other metadata communities".
On that last note, I think the importance of clarifying these questions really came home to me when I started engaging in conversations involving people coming from different metadata communities, examining questions of interoperability between systems based on different metadata specifications. Those different metadata specifications had their own "abstract models" (even if they weren't always clearly identified as such), their own specifications of information structures and how those structures are interpreted. Often, those different communities apply names to the concepts within their models which are similar to, or the same as, the names used by the DCMI community for a rather different concept. The term "element" was one such example. It quickly became clear that, in our desire to find commonality rather than difference, we risked falling victim to the tendency to see what in my French O-Level class we called "false friends", making assumptions that superficially similar names in different languages must refer to the same concept.
So, to return to Jonathan's question of what problems we had before we developed the DCAM, I'd suggest they include the following;
- While we ("the DC community") did have a broadly shared understanding of "what DC metadata was", there were some areas where understandings and perceptions differed, and in the worst cases those differences resulted in two DC metadata applications behaving in different ways.
- There were multiple encodings for DC metadata, some defined by DCMI, some defined by other parties. Sometimes it was difficult or impossible to establish whether an instance of one such encoding represented the same information as an instance of another, which severely limited interoperability between systems using them.
- There was some confusion between features associated with certain digital formats or syntaxes that were used to represent DC metadata (e.g. the use of XML Namespaces to qualify the names of XML elements and attributes) and features associated with the abstract information structure (e.g. the use of URIs to identify metadata terms and other resources)
- We had begun to develop the concept of the DC application profile (DCAP) as a specification of how DC metadata was constructed in the context of some domain or application, typically describing the set of terms used and providing guidance about how the terms were used in that context, but beyond that general notion, there were different perceptions and understandings of the nature of a DCAP, and particularly of the types of terms that a DCAP might refer to
- Closely related to the previous point, there was some confusion about whether and how terms should be "declared" for use in DC metadata
- There was some confusion about the relationship between Dublin Core and RDF
- Our capacity to engage in dialogue about interoperability between systems based on different metadata specifications suffered because of a lack of clarity about our own abstract model and those of the other communities, and from misunderstandings arising from our use of terminology
- More broadly, there were varying perceptions of "what DC metadata was" (a set of fifteen elements, "metadata's lowest common denominator", something that appears in the <meta> element of HTML pages, an XML format used by the OAI Protocol for Metadata Harvesting, and so on)
Has the publication of the DCAM solved all these problems? Well, no. Not yet anyway ;-) (And indeed in the meeting of the DCMI Architecture Working Group at DC-2006, we discussed the requirements to make some changes to the DCAM based on feedback received from various sources!) But having the DCAM as a formal specification has, I think, put us on a better footing to be able to start addressing them.
The DCAM gives us a single point of reference for talking about the nature of DC metadata. When we use a term like "element" in conversation, we can point to the DCAM as the source for what we mean by that term. Perhaps more fundamentally we have a description of what our digital formats are "saying". We have a clear description of what information structure we are seeking to represent when we are developing a format for the representation of DC metadata. And conversely, given a format which claims to be a format for representing DC metadata, we can analyse that format in terms of the DCAM and ask whether it serves the purpose of describing how to represent a description set.
We don't yet have a formal description of what a DC application profile is, but establishing that the information structure we are interested in is the description set and that a description set contains references to particular types of metadata terms provides the foundations for doing so. And conversely, given a document that claims to be a DCAP, we can ask whether it specifies how to create a DC description set, and whether the terms it refers to are terms of the type described by the DCAM. Similarly, we have a better understanding of the nature of the terms used in DC metadata, and on that basis we are in a better position to provide guidance to implementers on how to specify and describe any additional terms they may require.
The draft document Expressing DC metadata using the Resource Description Framework seeks to clarify the relationship between Dublin Core and RDF by describing how concepts defined by the DCAM correspond to concepts defined by RDF. I think the presence of the DCAM has facilitated our dialogues with other metadata communities: it enables us to to explain our own approaches, but perhaps more importantly it helps us to reflect on aspects of those other communities' approaches that perhaps have not been explicit in the past. While in the short term, it may be that this highlights differences rather than similarities, that is a vital part of the process of working towards interoperability. Finally, as our paper at DC-2006, "Towards an interoperability framework for metadata standards" [Preprint], suggested, I think this process is helping us to develop a better understanding of the nature of metadata standards and the challenges of interoperability between standards.
This is helpful, thanks very much, but what would REALLY be helpful is a CONCRETE example of a problem. What is an actual historical instance where "subtle but significant differences between those different descriptions of the DC [implicit] 'abstract model'" actually caused inter-operabiltiy problems? What's an actual specific historical example of "areas where understandings and perceptions differed, and in the worst cases those differences resulted in two DC metadata applications behaving in different ways."
Becuase I admit, to this naive newcomer who is NOT part of the DC community---it seems like DC is 'simple enough' that everything should have 'just worked' without needing to formalize the DCAM. I would have thought---as the DC community DID think at first---that everything was more or less self-evidenct. But the DC community apparently learned that was NOT in fact the case---a particular specific concrete example would help me 'follow along at home', since I wasn't along for the original ride.
Another example. "Sometimes it was difficult or impossible to establish whether an instance of one such encoding represented the same information as an instance of another."---to me, this seems surprising, I wouldn't have expected that (as no doubt the DC community of five years ago wouldn't have either!). An actual example would help me. Etc.
All that said, back to the original idea that (at least for me) began this dialog: What exactly is the difference between the kind of model that DCMA is, and the kind of model that FRBR is? I must say, it's still not clear to me (but in this case, I think it's definitely because it is not in fact clear, not just me missing something!). In fact, reading your explanation here of why DCMA was needed---it is very much like arguments I have seen (and used) for why FRBR is needed. In a post on someone elses blog, they suggested that FRBR was about entities and relationships and attributes and DCMA isn't---but in fact, DCMA is also (are not 'descriptions' 'description sets' etc entities?). But it's true that DCMA is in some senses more abstract than FRBR (although FRBR is certainly abstract too). It's confusing. They are clearly models of different things, they don't really overlap at all in their applications, they could easily both be used together, they solve entirely different application problems. FRBR and DCMA.
But they are both explicit formal models of what had previously been implicit (and therefore not always entirely consistent) shared mental models. They are both formalizations of abstract ideas of 'ontology' (what are the things we are concerned with, how do those things relate, what are the properties of those things).
Hmm.
Posted by: Jonathan Rochkind | November 28, 2006 at 07:28 PM
It also occurs to me that DCAM is _further along_ than FRBR. The specific problems that DCAM is meant to solve are problems that FRBR hasn't even encountered yet, because it's still in it's infancy (even though it may have been invented BEFORE DC, it has a developmental disorder of some kind. Anyway.)
In addition to being models of different parts of the universe, the problems DCAM is meant to solve are largely ones that the FRBR community won't even run into until FRBR is actually widely implemented, which it is not yet. FRBR instead solves the kind of problems that DC itself was meant to solve even before you got to DCAM.
Hmm. I'm not sure if this makes any sense or not.
Posted by: Jonathan Rochkind | November 28, 2006 at 07:31 PM