June 19, 2009

Repositories and linked data

Last week there was a message from Steve Hitchcock on the UK jisc-repositories@jiscmail.ac.uk mailing list noting Tim Berners-Lee's comments that "giving people access to the data 'will be paradise'". In response, I made the following suggestion:

If you are going to mention TBL on this list then I guess that you really have to think about how well repositories play in a Web of linked data?

My thoughts... not very well currently!

Linked data has 4 principles:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.

Of these, repositories probably do OK at 1 and 2 (though, as I’ve argued before, one might question the coolness of some of the http URIs in use and, I think, the use of cool URIs is implicit in 2).

3, at least according to TBL, really means “provide RDF” (or RDFa embedded into HTML I guess), something that I presume very few repositories do?

Given lack of 3, I guess that 4 is hard to achieve. Even if one was to ignore the lack of RDF or RDFa, the fact that content is typically served as PDF or MS formats probably means that links to other things are reasonably well hidden?

It’d be interesting (academically at least), and probably non-trivial, to think about what a linked data repository would look like? OAI-ORE is a helpful step in the right direction in this regard.

In response, various people noted that there is work in this area: Mark Diggory on work at DSpace, Sally Rumsey (off-list) on the Oxford University Research Archive and parallel data repository (DataBank), and Les Carr on the new JISC dotAC Rapid Innovation project. And I'm sure there is other stuff as well.

In his response, Mark Diggory said:

So the question of "coolness" of URI tends to come in second to ease of implementation and separation of services (concerns) in a repository. Should "Coolness" really be that important? We are trying to work on this issue in DSpace 2.0 as well.

I don't get the comment about "separation of services". Coolness of URIs is about persistence. It's about our long term ability to retain the knowledge that a particular URI identifies a particular thing and to interact with the URI in order to obtain a representation of it. How coolness is implemented is not important, except insofar as it doesn't impact on our long term ability to meet those two aims.

Les Carr also noted the issues around a repository minting URIs "for things it has no authority over (e.g. people's identities) or no knowledge about (e.g. external authors' identities)" suggesting that the "approach of dotAC is to make the repository provide URIs for everything that we consider significant and to allow an external service to worry about mapping our URIs to "official" URIs from various "authorities"". An interesting area.

As I noted above, I think that the work on OAI-ORE is an important step in helping to bring repositories into the world of linked data. That said, there was some interesting discussion on Twitter during the recent OAI6 conference about the value of ORE's aggregation model, given that distinct domains will need to layer their own (different) domain models onto those aggregations in order to do anything useful. My personal take on this is that it probably is useful to have abstracted out the aggregation model but that the hypothesis still to be tested that primitive aggregation is useful despite every domain needing own richer data and, indeed, that we need to see whether the way the ORE model gets applied in the field turns out to be sensible and useful.

May 22, 2009

Google Rich Snippets

As ever, I'm slow off the mark with this, but last week's big news within the Web metadata and Semantic Web communities was the announcement by Google of a feature they are calling Rich Snippets, which provides support for the parsing of structured data within HTML pages - based on a selection of microformats and on RDFa using a specified RDF vocabulary - and the surfacing of that data in Google search result sets. In the first instance, at least, only a selected set of sources are being indexed, with the hope of extending participation soon (see the discussion in the O'Reilly interview with Othar Hansson and Guha.)

A number of commentators, including Ian Davis, Tom Scott, and Jeni Tennison have pointed out that Google's support for RDFa, at least as currently described, is somewhat partial, and its reliance on a centralised Google-owned URIspace for terms is at odds with RDF's support for the distributed creation of vocabularies - and indeed in coining that Google vocabulary, Google appears to have ignored the existence of some already widely deployed vocabularies.

And of course, Yahoo was ahead of the game here with their (complete) support for RDFa in their Search Monkey platform (which I mentioned here).

Nevertheless, it's hard to disagree with Timothy O'Brien's recognition of the huge power that Google wields in this space:

Google is certainly not the first search engine to support RDFa and Microformats, but it certainly has the most influence on the search market. With 72% of the search market, Google has the influence to make people pay attention to RDFa and Microformats.

Or, to put it another way, we may be approaching a period in which, to quote Dries Buytaert of the Drupal project, "structured data is the new search engine optimization" - with, I might add, both the pros and cons that come with that particular emphasis!

One of the challenges to an approach based on consuming structured data from the open Web is, of course, that of dealing with inaccuracies, spammers and "gamers" - see, for example, Cory Doctorow's "metacrap" piece, from back in 2001. But as Jeni Tennison notes towards the end of her post, having Google in a position where they have an interest in tackling this problem must be a good thing for the data web community more widely:

They will now have a stake in answering the difficult questions around trust, confidence, accuracy and time-sensitivity of semantic information.

Google's announcement is also one of the topics discussed in the newly released Semantic Web Gang podcast from Talis, and in that conversation - which is well worth a listen as it covers many of the issues I've mentioned here and more besides - Tom Tague from Thomson-Reuters highlights another potential outcome when he expresses optimism that the interest in embedded metadata generated by the Google initiative will also provide an impetus for the development of other tools to consume that data, such as browser plug-ins.

Thinking about activities that I have some involvement in, I think the use of RDFa in general is an area that should be entering on the radar of the "repositories" community in their efforts to improve access to the outputs of scholarly research.

It's also an area that I think the Dublin Core Metadata Initiative should be engaging with. Embedding metadata in HTML pages with the intent of facilitating the discovery of those pages using search engines was probably one of the primary motivating cases, at least in the early days of work on DC, though of course there has historically been little support from the global search engines for the approach, in large part because of the sort of problems identified by Doctorow. The current DCMI recommendation for doing this makes use of an HTML metadata profile (associated with a GRDDL namespace transformation). While on the one hand, RDFa is "just another syntax for RDF", it might be useful for DCMI to produce a short document illustrating the use of RDFa (and perhaps to consider the use of RDFa in its own documents). Of course, as far as the use of DCMI's own RDF vocabularies in data exposed to Google is concerned, it remains to be seen whether support for RDF vocabularies other than Google's own will be introduced. (Having said that, it's also worth noting that one of the strengths of RDFa is that the attribute-based syntax is fairly amenable to the use of multiple vocabularies in combination.)

Finally, I think this is an area which Eduserv should be tracking carefully with regard to its relevance to the services it provides to the education sector and to the wider public sector in the UK: it seems notable that, as I mentioned a few weeks ago, some of the early deployments of RDFa have been within UK government services.

May 08, 2009

The Nature of OAI, identifiers and linked data

In a post on Nascent, Nature's blog on web technology and science, Tony Hammond writes that Nature now offer an OAI-PMH interface to articles from over 150 titles dating back to 1869.

Good stuff.

Records are available in two flavours - simple Dublin Core (as mandated by the protocol) and Prism Aggregator Message (PAM), a format that Nature also use to enhance their RSS feeds.  (Thanks to Scott Wilson and TicTocs for the Jopml listing).

Taking a quick look at their simple DC records (example) and their PAM records (example) I can't help but think that they've made a mistake in placing a doi: URI rather than an http: URI in the dc:identifier field.

Why does this matter?

Imagine you are a common-or-garden OAI aggregator.  You visit the Nature OAI-PMH interface and you request some records.  You don't understand the PAM format so you ask for simple DC.  So far, so good.  You harvest the requested records.  Wanting to present a clickable link to your end-users, you look to the dc:identifier field only to find a doi: URI:

doi:10.1038/nature01234

If you understand the doi: URI scheme you are fine because you'll know how to convert it to something useful:

http://dx.doi.org/10.1038/nature01234

But if not, you are scuppered!  You'll just have to present the doi: URI to the end-user and let them work it out for themselves :-(

Much better for Nature to put the http: URI form in dc:identifier.  That way, any software that doesn't understand DOIs can simply present the http: URI as a clickable link (just like any other URL).  Any software that does understand DOIs, and that desperately wants to work with the doi: URI form, can do the conversion for itself trivially.

Of course, Nature could simply repeat the dc:identifier field and offer both the http: URI form and the doi: URI form side-by-side.  Unfortunately, this would run counter the the W3C recommendations not to mint multiple URIs for the same resource (section 2.3.1 of the Architecture of the World Wide Web):

A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

On balance I see no value (indeed, I see some harm) in surfacing the non-HTTP forms of DOI:

10.1038/nature01234

and

doi:10.1038/nature01234

both of which appear in the PAM record (somehwat redundantly?).

The http: URI form

http://dx.doi.org/10.1038/nature01234

is sufficient.  There is no technical reason why it should be perceived as a second-class form of the identifier (e.g. on persistence grounds).

I'm not suggesting that Nature gives up its use of DOIs - far from it.  Just that they present a single, useful and usable variant of each DOI, i.e. the http: URI form, whenever they surface them on the Web, rather than provide a mix of the three different forms currently in use.

This would be very much in line with recommended good practice for linked data:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.

April 24, 2009

More RDFa in UK government

It's quite exciting to see various initiatives within UK government starting to make use of Semantic Web technologies, and particularly of RDFa. At the recent OKCon conference, I heard Jeni Tennison talk about her work on using RDFa in the London Gazette. Yesterday, Mark Birbeck published a post outlining some of his work with the Central Office of Information.

The example Mark focuses on is that of a job vacancy, where RDFa is used to provide descriptions of various related resources: the vacancy, the job for which the vacancy is available, a person to contact, and so on. Mark provides an example of a little display app built on the Yahoo SearchMonkey platform which processes this data.

As a a footnote (a somewhat lengthy one, now that I've written it!), I'd just draw attention to Mark's description of developing what he calls an RDF "argot" for constructing such descriptions:

The first vocabularies -- or argots -- that I defined were for job vacancies, but in order to make the terminology usable in other situations, I broke out argots for replying to the vacancy, the specification of contact details, location information, and so on.

An argot doesn't necessarily involve the creation of new terms, and in fact most of the argots use terms from Dublin Core, FOAF and vCard. So although new terms have been created if they are needed, the main idea behind an argot is to collect together terms from various vocabularies that suit a particular purpose.

I was struck by some of the parallels between this and DCMI's descriptions of developing what it calls an "DC application profile" - with the caveat that DCMI typically talks in terms of the DCMI Abstract Model rather than directly of the RDF model. e.g. the Singapore Framework notes:

In a Dublin Core Application Profile, the terms referenced are, as one would expect, terms of the type described by the DCMI Abstract Model, i.e. a DCAP describes, for some class of metadata descriptions, which properties are referenced in statements and how the use of those properties may be constrained by, for example, specifying the use of vocabulary encoding schemes and syntax encoding schemes. The DC notion of the application profile imposes no limitations on whether those properties or encoding schemes are defined and managed by DCMI or by some other agency

And in the draft Guidelines for Dublin Core Application Profiles:

the entities in the domain model -- whether Book and Author, Manifestation and Copy, or just a generic Resource -- are types of things to be described in our metadata. The next step is to choose properties for describing these things. For example, a book has a title and author, and a person has a name; title, author, and name are properties.

The next step, then, is to scan available RDF vocabularies to see whether the properties needed already exist. DCMI Metadata Terms is a good source of properties for describing intellectual resources like documents and web pages; the "Friend of a Friend" vocabulary has useful properties for describing people. If the properties one needs are not already available, it is possible to declare one's own

And indeed the Job Vacancy argot which Mark points to would, I think, probably be fairly recognisable to those familiar with the DCAP notion: compare, for example, with the case of the Scholarly Works Application Profile. The differences are that (I think) an "argot" focuses on the description of a single resource type, and I don't think it goes as far as a formal description of structural constraints in quite the same way DCMI's Description Set Profile model does.

March 20, 2009

Unlocking Audio

I spent the first couple of days this week at the British Library in London, attending the Unlocking Audio 2 conference.  I was there primarily to give an invited talk on the second day.

You might notice that I didn't have a great deal to say about audio, other than to note that what strikes me as interesting about the newer ways in which I listen to music online (specifically Blip.fm and Spotify) is that they are both highly social (almost playful) in their approach and that they are very much of the Web (as opposed to just being 'on' the Web).

What do I mean by that last phrase?  Essentially, it's about an attitude.  It's about seeing being mashed as a virtue.  It's about an expectation that your content, URLs and APIs will be picked up by other people and re-used in ways you could never have foreseen.  Or, as Charles Leadbeater put it on the first day of the conference, it's about "being an ingredient".

I went on to talk about the JISC Information Environment (which is surprisingly(?) not that far off its 10th birthday if you count from the initiation of the DNER), using it as an example of digital library thinking more generally and suggesting where I think we have parted company with the mainstream Web (in a generally "not good" way).  I noted that while digital library folks can discuss identifiers forever (if you let them!) we generally don't think a great deal about identity.  And even where we do think about it, the approach is primarily one of, "who are you and what are you allowed to access?", whereas on the social Web identity is at least as much about, "this is me, this is who I know, and this is what I have contributed". 

I think that is a very significant difference - it's a fundamentally different world-view - and it underpins one critical aspect of the difference between, say, Shibboleth and OpenID.  In digital libraries we haven't tended to focus on the social activity that needs to grow around our content and (as I've said in the past) our institutional approach to repositories is a classic example of how this causes 'social networking' issues with our solutions.

I stole a lot of the ideas for this talk, not least Lorcan Dempsey's use of concentration and diffusion.  As an aside... on the first day of the conference, Charles Leadbeater introduced a beach analogy for the 'media' industries, suggesting that in the past the beach was full of a small number of large boulders and that everything had to happen through those.  What the social Web has done is to make the beach into a place where we can all throw our pebbles.  I quite like this analogy.  My one concern is that many of us do our pebble throwing in the context of large, highly concentrated services like Flickr, YouTube, Google and so on.  There are still boulders - just different ones?  Anyway... I ended with Dave White's notions of visitors vs. residents, suggesting that in the cultural heritage sector we have traditionally focused on building services for visitors but that we need to focus more on residents from now on.  I admit that I don't quite know what this means in practice... but it certainly feels to me like the right direction of travel.

I concluded by offering my thoughts on how I would approach something like the JISC IE if I was asked to do so again now.  My gut feeling is that I would try to stay much more mainstream and focus firmly on the basics, by which I mean adopting the principles of linked data (about which there is now a TED talk by Tim Berners-Lee), cool URIs and REST and focusing much more firmly on the social aspects of the environment (OpenID, OAuth, and so on).

Prior to giving my talk I attended a session about iTunesU and how it is being implemented at the University of Oxford.  I confess a strong dislike of iTunes (and iTunesU by implication) and it worries me that so many UK universities are seeing it as an appropriate way forward.  Yes, it has a lot of concentration (and the benefits that come from that) but its diffusion capabilities are very limited (i.e. it's a very closed system), resulting in the need to build parallel Web interfaces to the same content.  That feels very messy to me.  That said, it was an interesting session with more potential for debate than time allowed.  If nothing else, the adoption of systems about which people can get religious serves to get people talking/arguing.

Overall then, I thought it was an interesting conference.  I suspect that my contribution wasn't liked by everyone there - but I hope it added usefully to the debate.  My live-blogging notes from the two days are here and here.

February 19, 2009

What is ORE for, really?

Pete has rather nicely answered the question, "What is ORE, really?".  In response, I'm tempted to ask a slightly different question, "What is ORE for, really?  In the ORE User Guide - Primer we find a 'Motivating Example' section which lays out some hard-to-reject statements about the importance of aggregations but which doesn't give us many verbs - it doesn't tell us what it is we can expect to be able to do to those aggregations, nor why we might want to.  The previous introductory section does propose three sample uses:

Because aggregations are not well-defined on the Web, we are limited in what we can do with them, especially in terms of the services or automated processes that make the Web useful. People who wish to save or print a multiple page document must manually click through each page and invoke the appropriate browser command. Programs that transfer multiple page documents among information systems must rely on the API's of the individual system architectures and their definition of document boundaries. Search engines must use heuristics to group individual Web pages into logical documents so that search results have the proper granularity.

On the face of it these are perfectly valid functional requirements but I think the underlying point that Pete makes in his post in that ORE, on its own, doesn't meet them.  The necessary knowledge that allows one bit of software to say, "ah, these are the pages of a document and I need to print them in this order" or "these are the boundaries of a document" or "it makes sense to group these individual Web pages in this way" based on the data it gets from another bit of software is not captured by ORE.  Life is not as simple as saying "here is an aggregation" because the aggregation might not be a set of printable pages from a document, or a set of Web pages, or a coherent set of anything else for that matter and there is very little in ORE that tells you anything about the relationship(s) between the things in the aggregation or their relationship to the outside world.  And if ORE doesn't meet its own functional requirements particularly well, it is even further from the kind of functional requirements we envisaged in the work on SWAP.  Requirements like, "show me the latest freely available version of this research paper".

Now, I accept that ORE does provide a way of layering that additional information (which might be in the form of SWAP for example) over the top of the aggregation.  On that basis the pertinent questions, or so it seems to me, are "given that we probably need that extra level of information to do anything useful with the aggregation, is the information about the aggregation useful on its own?" and "does SWAP capture the right level of detail and is it realistic to expect real-world systems to handle this level of complexity?".

I think the jury is out on both.  (Note: I am certainly not arguing that SWAP is better than ORE - they are sufficiently different for that to be a pointless statement anyway and the bottom line is that I'm not completely sure that I'm convinced by either if I'm absolutely honest.)  I would say that in the world of learning objects there is quite a long history of treating things as reasonably unrefined aggregations (usually refered to as 'content packages') and that in that space the usefulness of that approach has been fairly minimal.

February 17, 2009

What is ORE, really? (A personal viewpoint)

This is another post that I've had sitting around in draft for ages, but which some recent discussion has prompted me to dig out and try to finish. Chris Keene commented on my post of some time ago about the publication of OAI ORE specs, asking for some clarification on what it is that OAI ORE provides, "what ORE is", I suppose, and I promised I'd take a stab at answering. I guess I should emphasise that this is my personal view only, but here's my attempt at a response to Chris' questions.

is it a protocol like OAI-PMH or a file standard? I read a primer (somewhat quickly) and it seems to be almost a XML file specification to be read over HTTP, which describes a resource such as a repository? is that right?

I think it's helpful - and maybe why I think it's important will become clearer by the end of this post - to distinguish between the parts of the ORE specifications which are specific to ORE and the parts which provide guidance on how to apply principles and conventions which have been defined outside of the ORE context, are not dependent on the use of the ORE-specific parts of ORE, and are more general in their application. (The distinction I'm making here doesn't quite match the separation ORE itself makes between "Specifications" and "User Guides".)

Some parts of the ORE specifications are "ORE-specific", they define or describe things that aren't defined or described elsewhere. Those things are:

  1. A simple data model for the things ORE calls a Resource Map, an Aggregation and a Proxy. This is defined by the Abstract Data Model document. Here the term "data model" is used in the sense of a "model of (some part of) the world", a "domain model", if you like - though in the ORE case, it is intended to be quite a generally applicable one.
  2. An RDF vocabulary used, in association with terms from some existing RDF vocabularies, for representing instances of that model. This is defined in human-readable form by the Vocabulary document, and in machine-processable form by the RDF/XML "namespace document" http://www.openarchives.org/ore/terms/.
  3. A variant of what I might call - following the terminology used by Alistair Miles - a "Graph Profile", a specification of some structural constraints on an RDF graph which should be met if that graph is to serve as an ORE Resource Map, a set of "triple patterns", if you like, for the triples that make up an ORE Resource Map. This is defined in Section 6 of the Abstract Data Model document.
  4. A set of conventions for representing an ORE Resource Map as an Atom Entry Document, using the Atom Syndication Format. This is defined by the ORE document Resource Map Implementation in Atom
  5. A set of conventions for disclosing and discovering ORE Resource Maps, defined by the document Resource Map Discovery. Some of these are applications of existing conventions, but as there are some ORE-specific aspects (e.g. the definition of http://www.openarchives.org/ore/html/ as an HTML profile specifying the use of "resourcemap" as an X/HTML link type), I'm including it in this list.

Those are the things I tend to focus on when I try to answer the question "What is ORE, really?"

In addition to those ORE-specific elements, the ORE specifications also provide guidelines for how to make use of various other existing specifications and conventions when deploying the ORE model:

  1. The two documents, Resource Map Implementation in RDF/XML and Resource Map Implementation in RDFa describe how to use those two existing syntaxes, defined by W3C Recommendations, to represent Resource Maps
  2. The document HTTP Implementation describes how to apply the principles and patterns define by the W3C TAG's httpRange-14 resolution and the Cool URIs for the Semantic Web document

For the most part, these documents don't really provide new information, at least in the same way those noted above do: instead, they indicate how to apply some existing, more general specifications when making use of the ORE-specific specifications listed above.

That's not to say they aren't useful guidelines: they are, not least because they "contextualise" the general information provided by the more general specifications, and provide ORE-specific examples of their use. The ORE HTTP Implementation document selects from the patterns of the Cool URIs document and provides illustrations of their use for the URIs of Aggregations and Resource Maps.

My main point here is that I think it's important - particularly for audiences who are perhaps encountering some of these more general principles and conventions for the first time in the specific context of ORE - to "decouple" these two aspects, and to make clear that the use of these principles and conventions is not dependent on the ORE-specific parts, and they can - and indeed should - be applied in other contexts too. More on that later.

To answer, Chris' specific questions above: no, ORE isn't a protocol; no, it isn't (what I think of as) a "file standard", though it describes the use of some existing formats; and while ORE does deal with the description of things, the things it deals with are what it calls "aggregations", not "repositories", at least as that term is typically used in the OAI context, to refer to a system that supports some functions. The concept of a repository doesn't feature in ORE.

And I'm not sure how it fits in with OAI-PMH does it replace, or improve, or cater for different needs (they both seem to cater for getting an item from one system to another).

I think ORE is largely orthogonal to OAI-PMH. ORE was not designed to "replace" or "improve" OAI-PMH. ORE can be used independently of OAI-PMH, or, as I think the Discovery document illustrates, it can be used in the context of OAI-PMH, i.e. you could expose ORE Resource Maps as metadata records over OAI-PMH.

Having said that, I do think the approaches underpinning ORE provide at least some hints of how the sort of functionality which is currently provided by OAI-PMH in an RPC-like idiom, where a client "harvester" sends protocol-specific requests to a "repository", might be offered using a more "resource-oriented" approach. Here, I'm not using the term "resource-oriented" to highlight a distinction between "resource" and "metadata", but rather to emphasise the notion of treating all the "items of interest" to the application as "resources" in the sense that the Web Architecture uses that term, assigning them URIs, and supporting interaction using the uniform interface defined by the HTTP protocol. And those "items of interest" can include resources which are descriptions of other resources, and resources which are collections of resources - collections based on various criteria. Anyway, it isn't my intention here to embark on specifying an alternative approach to OAI-PMH. :-)

Chris also asked:

And what about things like SWAP and SWORD?

Let's take the case of SWORD first, as it's the one I know less about! :-) I'm not a SWORD/Atompub expert at all but I think ORE is independent of SWORD, but designed to be usable in the context of SWORD, i.e. in principle at least, an ORE Resource Map could form the subject of a SWORD "deposit". Richard Jones ponders three variant approaches, and there is some discussion on the OAI ORE Google Group.

The case of the Scholarly Works Application Profile (SWAP) raises some issues which I think illustrate some of the points I was making above about the wider applicability of some of the conventions used within ORE.

First, I think there are differences in "scope and purpose". SWAP focuses very specifically on the "eprint" and on supporting a more or less well-defined set of operations, particularly operations related to "versioning" and the various types of relationships between resources which one encounters when dealing with those issues; ORE focuses on a rather simpler, more generic concept of "aggregation" and membership of a set. Having said that, the ORE model can also be applied to the case of the eprint, and indeed some of the examples in the specifications and in supporting presentations use examples of applying ORE to eprint resources.

Second, again as noted above, ORE makes use of some general principles and patterns for exposing resources and resource descriptions on the Web. But those principles and patterns are equally applicable in the context of data models other than ORE; what ORE calls a "Resource Map" is a specialised case of an RDF graph, and the HTTP patterns for providing access to a Resource Map are applications of patterns which can be - and are - applied to provide access to data describing resources of any type - including resources of the type described by SWAP. It isn't necessary to make use of the ORE concept of the Aggregation to use those patterns.

Now then, it is true that the SWAP documentation does not make reference to these patterns, but that is probably because of two considerations. First, at the time of its development, the primary context of use considered was that of exposing data over OAI-PMH. Second, although the httpRange-14 resolution had been agreed, it hadn't been as widely disseminated/popularised  as it has been subsequently, particularly in the form of the Linked Data tutorial and the Cool URIs document. But as I discussed in a recent post, those same principles and patterns used in ORE can be applied to the FRBR case - and if SWAP was being developed now, I'm sure reference to those approaches would be included. (Well, they would if I had any input to the process!)

Third, picking up on my attempt above to identify what I think are the "core" characteristics of ORE, ORE and SWAP are based on two different "models of the world", both of which can be applied to the case of the eprint. From the perspective of the ORE model, the eprint is viewed as an aggregation made up of a number of component/member resources; with SWAP, the perspective is that of the FRBR model - a Work realised in one or more Expressions, each embodied in one or more Manifestations, each exemplified by one or more Items (possibly with relationships between this Work and other Works, between Expressions of the same or different Works, between Works, Expressions etc and Agents, and so on).

In the FRBR case, although, as in the ORE case, there are multiple related resources involved, there isn't necessarily a notion of "aggregation" involved: a FRBR Work (or indeed any of the FRBR Group 1 entities) may be a composite/aggregate resource, but it isn't necessarily the case. There is nothing in FRBR that treats, say, the set of all the Items which exemplify the Manifestations of the Expressions of a single Work as a single aggregate entity - but FRBR does allow for the expression of whole/part relationships between instances of the various Group 1 entities.

So, I think it is important to remember that the choice to use either ORE or SWAP to model an eprint is just that: a modeling choice, one which enables certain functionality on the basis of the data created. Depending on what we want to achieve with the data, different choices may be appropriate.

So to return to Chris' question, it seems to me the core difference between ORE and SWAP is that they offer different models which can be applied to the "eprint". And here, I think I'm revisiting the point that, quite some time ago now, Andy made in terms of contrasting what he called "compound objects" and "complex objects". I must admit I didn't and don't like the term "complex object" - if I describe a set and its members, I understand that the set is the "compound object", but if I describe a document and its three authors, or a FRBR Work, its Expressions, their Manifestations, their Items, and a number of related Agents, which one of them is the "complex object"? - but the point remains a good one: many of the functions we wish to perform rely on our capacity to represent relationships other than relationships of "aggregation" or "composition".

Of course, the ORE concept of the Resource Map does allow for the expression of any other types of relationship, in addition to the required ore:aggregates relationship (and I think using ORE and FRBR together would requires some careful analysis, given the nature of whole/part relationships in FRBR); but one can also construct descriptions expressing other types of relationship, and make those descriptions available using the community-agreed conventions of the Cool URIs document, without using ORE.

So, that turned into another rather rambling post, and I'm not sure how much it helps, but that's my take on "what ORE is".

February 11, 2009

Repository usability - take 2

OK... following my 'rant' yesterday about repository user-interface design generally (and, I suppose, the Edinburgh Research Archive in particular), Chris Rusbridge suggested I take a similar look at an ePrints.org-based repository and pointed to a research paper by Les Carr in the University of Southampton School of Electronics and Computer Science repository by way of example.  I'm happy to do so though I'm going to try and limit myself to a 10 minute survey of the kind I did yesterday.

The paper in question was originally published in The Computer Journal (Oxford University Press) and is available from http://comjnl.oxfordjournals.org/cgi/content/abstract/50/6/703 though I don't have the necessary access rights to see the PDF that OUP make available.  (In passing, it's good to see that OUP have little or no clue about Cool URIs, resorting instead to the totally useless (in Web terms at least) DOI as text string, "doi:10.1093/comjnl/bxm067" as their means of identification :-( ).

Ecs The jump-off page for the article in the repository is at http://eprints.ecs.soton.ac.uk/14352/, a URL that, while it isn't too bad, could probably be better.  How about replacing 'eprints.ecs' by 'research' for example to mitigate against changes in repository content (things other than eprints) and organisational structure (the day Computer Science becomes a separate school).

The jump-off page itself is significantly better in usability terms than the one I looked at yesterday.  The page <title> is set correctly for a start.  Hurrah!  Further, the link to the PDF of the paper is near the top of the page and a mouse-over pop-up shows clearly what you are going to get when you follow the link.  I've heard people bemoaning the use of pop-ups like this in usability terms in the past but I have to say, in this case, I think it works quite well.  On the downside, the link text is just 'PDF' which is less informative than it should be.

Following the abstract a short list of information about the paper is presented.  Author names are linked (good) though for some reason keywords are not (bad).  I have no idea what a 'Performance indicator' is in this context, even less so the value "EZ~05~05~11".  Similarly I don't see what use the ID Code is and I don't know if Last Modified refers to the paper or the information about the paper.  On that basis, I would suggest some mouse-over help text to explain these terms to end-users like myself.

The 'Look up in Google Scholar' link fails to deliver any useful results, though I'm not sure if that is a fault on the part of Google Scholar or the repository?  In any case, a bit of Ajax that indicated how many results that link was going to return would be nice (note: I have no idea off the top of my head if it is possible to do that or not).

Each of the references towards the bottom of the page has a 'SEEK' button next to them (why uppercase?).  As with my comments yesterday, this is a button that acts like a link (from my perspective as the end-user) so it is not clear to me why it has been implemented in the way it has (though I'm guessing that it is to do with limitations in the way Paracite (the target of the link) has been implemented.  My gut feeling is that there is something unRESTful in the way this is working, though I could be wrong.  In any case, it seems to be using an HTTP POST request where a HTTP GET would be more appropriate?

There is no shortage of embedded metadata in the page, at least in terms of volume, though it is interesting that <meta name="DC.subject" ... > is provided whereas the far more useful <meta name="keywords" ... > is not.

The page also contains a large number of <link rel="alternate" ... > tags in the page header - matching the wide range of metadata formats available for manual export from the page (are end-users really interested in all this stuff?) - so many in fact, that I question how useful these could possibly be in any real-world machine-to-machine scenario.

Overall then, I think this is a pretty good HTML page in usability terms.  I don't know how far this is an "out of the box" ePrints.org installation or how much it has been customised but I suggest that it is something that other repository managers could usefully take a look at.

Usability and SEO don't centre around individual pages of course, so the kind of analysis that I've done here needs to be much broader in its reach, considering how the repository functions as a whole site and, ultimately, how the network of institutional repositories and related services (since that seems to be the architectural approach we have settled on) function in usability terms.

Once again, my fundamental point here is not about individual repositories.  My point is that I don't see the issues around "eprint repositories as a part of the Web" featuring high up the agenda of our discussions as a community (and I suggest the same is true of  learning object repositories), in part because we have allowed ourselves to get sidetracked by discussion of community-specific 'interoperability' solutions that we then tend to treat as some kind of magic bullet, rolling them out whenever someone questions one approach or another.

Even where usability and SEO are on the agenda (as appears to be the case here) It's not enough that individual repositories think about the issues, even if some or most make good decisions, because most end-users (i.e. researchers) need to work across multiple repositories (typically globally) and therefore we need the usability of the system as a whole to function correctly.  We therefore need to think about these issues as a community.

February 10, 2009

Repository usability

In his response to my previous post, Freedom, Google-juice and institutional mandates, Chris Rusbridge responded using one of his Ariadne articles as an illustrative example.

By way of, err... reward, I want to take a quick look (in what I'm going to broadly call 'usability' terms) at the way in which that article is handled by the Edinburgh Research Archive (ERA).  Note that I'm treating the ERA as an example here - I don't consider it to be significantly different to other institutional repositories and, on that basis, I assume that most of what I am going to say will also apply to other repository implementations.

Much of this is basic Web 101 stuff...

The original Ariadne article is at http://www.ariadne.ac.uk/issue46/rusbridge/ - an HTML document containing embedded links to related material in the References section (internally linked from the relevant passage in the text).  The version deposited into ERA is a 9 page PDF snapshot of the original article.  I assume that PDF has been used for preservation reasons, though I'm not sure.  Hypertext links in the original HTML continue to work in the PDF version.

So far, so good.  I would naturally tend to assume that the HTML version is more machine-readable than the PDF version and on that basis is 'better', though I admit that I can't provide solid evidence to back up that statement.

Era The repository 'jump-off' page for the article is at http://www.era.lib.ed.ac.uk/handle/1842/1476 though the page itself tells us (in a human-readable way) that we should use http://hdl.handle.net/1842/1476 for citation purposes.

So we already have 4 URLs for this article and no explicit machine-readable knowledge that they all identify the same resource.  Further, the URLs that 15 years of using a Web browser lead me to use most naturally (those of the jump-off page, the original Ariadne article or the PDF file) are not the one that the page asks me to use for citation purposes.  So, in Web usability terms, I would most naturally bookmark (e.g. using del.icio.us) the wrong URL for this article and where different scholars choose to bookmark different URLs, services like del.icio.us are unlikely to be able to tell that they are referring to the same thing (recent experience of Google Scholar notwithstanding).

OK, so now let's look more closely at the jump-off page...

Firstly, what is the page title (as contained in the HTML <title> tag)?  Something useful like "Excuse Me... Some Digital Preservation Fallacies?".  No, it's "Edinburgh Research Archive : Item 1842/1476". Nice!? Again, if I bookmark this page in del.icio.us, that is the label is going to appear next to the URL, unless I manually edit it.

Secondly, what other metadata and/or micro-formats are embedded into this page?  All that nice rich Dublin Core metadata that is buried away inside the repository?  Nah.  Nothing.  A big fat zilch.  Not even any <meta name="keywords" ...> stuff.  I mean, come on.  The information is there on the page right in front of me... it's just not been marked up using even the most basic of HTML tags.  Most university Web site managers would get shot for turning out this kind of rubbish HTML.

Note I'm not asking for embedded Dublin Core metadata here - I'm asking for useful information to be embedded in useful (and machine-readable) ways where there are widely accepted conventions for how to to that.

So, let's look at those human-readable keywords again.  Why aren't they hyperlinked to all all other entries in ERA that use those keywords (in the way that Flickr and most other systems do with tags)?  Yes, the institutional repository architectural approach means that we'd only get to see other stuff in ERA, not all that useful I'll grant you, but it would be better than nothing.

Similarly, what about linking the author's name to all other entries by that author.  Ditto with the publisher's name.  Let's encourage a bit of browsing here shall we?  This is supposed to be about resource discovery after all!

So finally, let's look at the links on the page.  There at the bottom is a link labelled 'View/Open' which takes me to the PDF file - phew, the thing I'm actually looking for!  Not the most obvious spot on the page but I got there in the end.  Unfortunately, I assume that every single other item in ERA uses that exact same link text for the PDF (or other format) files.  Link text is supposed to indicate what is being linked to - it's a kind of metadata for goodness sake.

And then, right at the bottom of the page, there's a button marked "Show full item record".  I have no idea what that is but I'll click on it anyway.  Oh, it's what other services call "Additional information".  But why use an HTML form button to hide a plain old hypertext link?  Strange or what?

OK, I apologise... I've lapsed into sarcasm for effect.  But the fact remains that repository jump-off pages are potentially some of the most important Web pages exposed by universities (this is core research business after all) yet they are nearly always some of the worst examples of HTML to be found on the academic Web.  I can draw no other conclusion than that the Web is seen as tangential in this space.

I've taken 10 minutes to look at these pages... I don't doubt that there are issues that I've missed.  Clearly, if one took time to look around at different repositories one would find examples that were both better and worse (I'm honestly not picking on ERA here... it just happened to come to hand).  But in general, this stuff is atrocious - we can and should do better.

February 05, 2009

httpRange-14, Cool URIs & FRBR

The W3C Technical Architecture Group's resolution to what had become known as "the httpRange-14 question" introduced a distinction between the subset of resources for which representations may be served using the HTTP protocol - a subset which the Architecture of the World Wide Web refers to as "information" resources - and - by implication at least - a disjoint subset of resources which may be identified using the http URI scheme but which is not "representable" -  for which no representations are provided using the HTTP protocol - though they may be described, by "information resources" identified by their own distinct URIs.

A subsequent note by Leo Sauermann and Richard Cyganiak of the W3C Semantic Web Education and Outreach (SWEO) Interest Group, Cool URIs for the Semantic Web provides an extended discussion of the issue, together with a set of "patterns" for assigning http URIs and for the appropriate responses to HTTP requests using such URIs. This document uses the terms "Web documents" and "real-world objects" to refer to the two classes of resources, noting that the latter class includes "real-world objects like people and cars, and even abstract ideas and non-existing things like a mythical unicorn".

The question raised by this division is where the boundary between the two classes lies. From the viewpoint of the consumer/user of URIs, the point is somewhat moot: they simply need to follow the information provided, in the form of responses to HTTP requests by the owner of the URI (or possibly also from metadata provided by other parties). Information about the nature of the resource can be provided both by HTTP response codes and by explicit descriptions of the resource. Following the httpRange-14 guideline, if the HTTP response to a GET is 2xx, then the resource identified by the URI is an information resource. I think it's worth emphasising the point that this is the only response code which allows the user to make a "positive" inference about resource type; if the response is 303 See Other, that in itself says nothing about the type of the resource.

The URI owner, on the other hand, needs to make the choice, for each resource, whether to provide a representation or not, based on their understanding of the nature of the resources they are exposing on the Web. The Architecture of the World Wide Web document offers the following (somewhat "slippery", to me!) criterion for a resource being an "information resource": The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message.

I've been trying to think through how this set of conventions should be applied to the case of the Functional Requirements for Bibliographic Records (FRBR) and more specifically to the "FRBR Group 1 Entities", i.e. instances of the the classes of Work, Expression, Manifestation and Item which FRBR uses to model the universe of resources described by bibliographic records.

The work on the development of the Scholarly Works Application Profile (SWAP) focused primarily on deployment in the context of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). OAI-PMH provides an RPC-like layer on top of HTTP, and SWAP focuses on how to deliver descriptions of the SWAP/FRBR entities using that RPC layer, rather than the question of how those entities could be represented as Web resources.

I'm starting from the FRBR model here; I'm asking the question, "If I'm exposing on the Web a set of resources based on the FRBR model, are there any general rules for which of these resources are 'representable'?". I'm not trying to address the broader question of whether/how the distinctions made in the Web Architecture reflect, or can be defined in terms of, the FRBR model.

Taking the "easy" cases first, FRBR defines a Work as follows:

The first entity defined in the model is work: a distinct intellectual or artistic creation.

A work is an abstract entity; there is no single material object one can point to as the work. We recognize the work through individual realizations or expressions of the work, but the work itself exists only in the commonality of content between and among the various expressions of the work.

- FRBR Section 3.2.1

It seems fairly clear from this description that a FRBR Work is a "conceptual resource", like an idea. In the terms of the "Cool URIs" document, it is a "real-world object", albeit an abstract one, and not a "Web document". And on this basis, while a FRBR Work may be identified by an http URI, an HTTP server should not return a representation and a 200 status code in response to a GET for that URI, though the server may provide access (using one of the patterns provided in the Cool URIs document) to a description of the Work, a "Web document" itself identified by a distinct URI.

A similar argument can, I think, be made for the case of the FRBR Expression:

An expression is the specific intellectual or artistic form that a work takes each time it is "realized." Expression encompasses, for example, the specific words, sentences, paragraphs, etc. that result from the realization of a work in the form of a text, or the particular sounds, phrasing, etc. resulting from the realization of a musical work. The boundaries of the entity expression are defined, however, so as to exclude aspects of physical form, such as typeface and page layout, that are not integral to the intellectual or artistic realization of the work as such.

- FRBR Section 3.2.2

Again we're dealing with an "abstraction", albeit a more "specific", less "generic" one than a Work. And on this basis, like the Work, it falls into the category of "real-world objects", and again, while an Expression may be identified by an http URI and an HTTP server may provide access to a description of an Expression, it should not provide a representation of an Expression.

In considering the other two FRBR Group 1 entity types, Manifestation and Item, it is perhaps easiest to consider the application of FRBR to physical resources and to digital resources separately.

Considering the physical world first, it is perhaps helpful to consider the Item first, as it seems to me it also sheds some light on the nature of the Manifestation. The FRBR definition of Item is very much grounded in the physical:

The entity defined as item is a concrete entity. It is in many instances a single physical object (e.g., a copy of a one-volume monograph, a single audio cassette, etc.). There are instances, however, where the entity defined as item comprises more than one physical object (e.g., a monograph issued as two separately bound volumes, a recording issued on three separate compact discs, etc.).

- FRBR Section 3.2.4

These Items are the "real world objects" which traditionally libraries have been concerned with managing (acquiring, storing, maintaining, providing access to, distributing, disposing of). From the perspective of httpRange-14 and the "Cool URIs" document, then, these "real-world objects" may be described by Web documents, but they are not themselves Web documents. So a physical Item may be identified by an http URI, and an HTTP server may provide access to a description of such an Item, but it can't provide a representation of it.

Now take the case of the Manifestation:

The third entity defined in the model is manifestation: the physical embodiment of an expression of a work.

The entity defined as manifestation encompasses a wide range of materials, including manuscripts, books, periodicals, maps, posters, sound recordings, films, video recordings, CD-ROMs, multimedia kits, etc. As an entity, manifestation represents all the physical objects that bear the same characteristics, in respect to both intellectual content and physical form.

- FRBR Section 3.2.3

Again a Manifestation is dealing with physical form, but furthermore, a Manifestation is still an abstraction: its role in the FRBR model is to capture characteristics that are true of a set of individual Items which "exemplify" that Manifestation (even in the case where a unique Item which is the sole exemplar of a Manifestation). Seen in this light, then, I think a Manifestation also falls into the category of things which may be described by one or more Web documents, but is not itself a Web document.

In turning to the context of the digital world, I think it's worth highlighting that although the FRBR specification contains some references to "electronic resources", the coverage of digital resources in the text very limited, and indeed the introduction acknowledges that "the dynamic nature of entities recorded in digital formats" is one of the areas that require further analysis.

It seems relatively straightforward to transfer the concepts of Work and Expression into the digital sphere, as they are independent of the form in which content is "embodied".

The question of what constitutes a FRBR Item in the digital domain is rather more difficult to pin down, particularly since the FRBR document itself focuses exclusively on the physical in its discussion of the Item. Ingbert Floyd and Allen Renear take on this challenge in their poster, "What Exactly is an Item in the Digital World?" (ASIST, 2007)

In the physical world, they argue, the thing which carries information is the same thing for which information managers typically describe characteristics such as provenance, condition, and access restrictions - the attributes of the Item in FRBR. In the digital world, this is no longer true: information is carried by the physical state of some component of a computer system, something the authors call an instance of "patterned matter and energy" (PME) - but information managers rarely concern themselves with managing such entities and recording their attributes. Entities such as a "file", however, are the focus of management and description - but a digital "file" isn't really the "concrete entity" that FRBR calls an Item. Two approaches to the Item are possible, then: the Item-as-PME approach, which "maintains that a fundamental aspect of being an item is being a concrete physical thing", or the Item-as-"file" approach which addopts the pragmatic position that "items are the things, whatever their nature (physical, abstract, or metaphorical), which play the role in bibliographic control that FRBR assigns to items".

The question I'm posing here is, I think, a different, and narrower, one from the broader one grappled with by Ingbert and Renear: if we are treating a FRBR Item as a Web resource, for the case of an exemplar of a Manifestation in digital format, is that resource an "information resource", for which a representation can be served? From the Web Architecture perspective, it seems to me that it is the case that "all of their essential characteristics can be conveyed in a message". The Scholarly Works Application Profile takes this approach: the copy of a PDF document available from an institutional repository server, or the copy of an mp3 file constituting an episode of a podcast, is the FRBR Item. These, after all, are the things which, "play the role in bibliographic control that FRBR assigns to items".

A further issue here is that FRBR lists "Access Address (URL)" as an attribute of a Manifestation, rather than of an Item, and I'm not sure whether this is compatible with the SWAP approach.

The concept of Manifestation in the digital case seems the most difficult to categorise. On the one hand, as noted above, FRBR states that a Manifestation is an abstraction corresponding to a set of objects with the same characteristics of both form and content. On the other hand, it seems to me that one could argue that for Manifestations in digital form, it is true that "all of their essential characteristics can be conveyed in a message", since the notion of Manifestation encapsulates that of specific intellectual content "embodied" in a specific form. For consistency with the physical case, I guess the former would be best, but I'm not sure on this point.

So those rather lengthy musings might suggest the following (tentative, I hasten to add... I'm mostly just trying to think through my rationale here) approach to identifying and serving representations/descriptions of the FRBR entities, at least using the approach that SWAP takes to the Item:

Entity Type HTTP Behaviour
Work

Identify using http URI

Provide description of Work. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Expression

Identify using http URI

Provide description of Expression. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Manifestation Physical

Identify using http URI

Provide description of Manifestation. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Digital

Identify using http URI

Provide description of Manifestation. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Item Physical

Identify using http URI

Provide description of Item. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Digital

Identify using http URI

Provide representation of Item. (Respond to GET with 200 and representation).

One final point.... The use of HTTP content negotiation on the Web introduces a dimension which I'm not sure sits very easily within the FRBR model. Using content negotiation, I may decide to expose a single resource on the Web, using a single URI, but configure my server so that, at any point in time, depending on a range of factors (the preferences of the client, the IP address of the client, etc.) it returns different representations of that resource - representations which may vary by (amongst other things) media type (format) or language. From the FRBR perspective, such variations would, I think, result in the creation of different Manifestations (for the media type case) or even different Expressions (for the language case). In the SWAP case, I think the implication is that Item representations should not vary, at least by media type or language.

January 30, 2009

Surveying with voiD

Michael Hausenblas yesterday announced the availability of version 1.0 of the voiD specification. void specifies an RDF-based approach to the description of RDF datasets that have been constructed following the principles of linked data.

Although the emphasis is very much on those characteristics specific to a void:Dataset - and particularly the nature of links between datasets - this sort of approach reminded me of that taken in the area of collection-level description, an area which Andy and I both contributed to in the past, leading to work within DCMI on the development of the Dublin Core Collections Application Profile. - though of course that profile is much more generally scoped than voiD.

Michael describes the problem addressed by voiD in his article in a recent issue of Nodalities:

Now, the main challenge is: how can I, as someone who wants to build an application on top of linked data, find and select appropriate linked datasets? Note that there are two basic issues here: first, finding an appropriate dataset (discovery) then selecting one - that is, you have a bunch of possible candidates, which one is the ‘best suited’.

This reminded me of the much quoted (not least by me back when I was running round doing presentations as part of UKOLN's Collection Description Focus!) metaphor used by Michael Heaney in his An Analytical Model of Collections and their Catalogues, with reference to an academic researcher approaching the "landscape" of research collections:

The scholar surveying this landscape is looking for the high points. A high point represents an area where the potential for gleaning desired information by visiting that spot (physically or by remote means) is greater than that of other areas. To continue the analogy, the scholar is concerned at the initial survey to identify areas rather than specific features – to identify rainforest rather than to retrieve an analysis of the canopy fauna of the Amazon basin.

Judging by the response on the W3C public-lod mailing list, there's a considerable interest in voiD in the linked data community, and I look forward to seeing what sort of new services emerge using it.

January 22, 2009

Why can't I find a library book in my search engine?

There's a story in today's Guardian, Why you can't find a library book in your search engine, (seen online but I assume that it is also in the paper version) covering the ongoing situation around the licensing of OCLC WorldCat catalog records.  Rob Styles provides some of the background to this, OCLC, Record Usage, Copyright, Contracts and the Law, though, as he notes, he works for Talis which is one of the commercial organisations that stands to benefit from a change in OCLC's approach.

I don't want to comment in too much detail on this story since I freely admit to not having properly done my homework, but I will note that my default position on this kind of issue is that we (yes, all of us) are better off in those cases where data is able to be made available on an 'open' rather than 'proprietary' basis and I think this view of the world definitely applies in this case.

The Guardian story is somewhat simplistic, IMHO, not on the question of 'open' vs. 'closed' but on how easy it would be for such data, assuming that it was to be made openly available, to get into search engines (by which I assume the article really means Google?) in a meaningful way.  Flooding the Web with multiple copies of metadata about multiple copies of books is non-trivial to get right (just think of the issues around sensibly assigning 'http' URIs to this kind of stuff for example) such that link counting, ranking of books vs. other Web resources, and providing access to appropriate copies can be done sensibly.  There has to be some point of 'concentration' (to use Lorcan Dempsey's term) around which such things can happen - whether that is provided by Google, Amazon, Open Library, OCLC, Talis, the Library of Congress or someone else.  Too many points of concentration and you have a problem... or so it seems to me.

January 14, 2009

Resource List Management on the Semantic Web

Via a post by Ivan Herman of the W3C, I came across a W3C case study titled A Linked Open Data Resource List Management Tool for Undergraduate Students, based on work done between Talis and the University of Plymouth.

Andy and I visited Talis, well, I was going to say a few months ago, but it was probably the middle of last year, and Rob Styles, Chris Clarke & other Talisians talked to us a little bit then about this work, but at that point I don't think they had a live system to show.

This looks pretty neat stuff. It's an RDF application, based on the Talis Platform. They make use of a number of existing ontologies (SIOC, BIBO) and have designed a simple ontology for the Reading Lists themselves and also one for the organisational structure of an academic institution, the AIISO ontology - which I imagine may be of interest to other projects working in this area.

Intelligent "bookmarking" tools for adding items to lists use a variety of techniques to extract metadata from Web pages (in a similar way to the Zotero citation manager tool); the metadata is exposed as RDFa in XHTML representations of the lists, which makes it available to systems like Yahoo's SearchMonkey; other RDF formats are available via content negotiation (following the Linked Data/Cool URIs for the Semantic Web principles); and a SPARQL endpoint for the dataset is available (though I'm not sure whether this is public). The system also allows students to provide annotations, which are also stored as RDF data, but in a separate data store from the "primary" reading list data, allowing different access controls.

December 22, 2008

Construction--Sand

How does that song go that we used to have to sing in Sunday School - "the wise man builds his house upon the rock" or something?

In Uncool URIs, Ed Summers reports that he has been asked to close down lcsh.info. I don't know much of the detail here but I strongly suggest that the work that Ed has been doing in this area has been both ground-breaking and important in terms of showing how to transition vocabularies from the old world to the new.

In thinking about the demise of this activity I'm torn between the short-sightedness of the Library of Congress in shutting this down without having a credible alternative in place and the obvious dangers of building and sharing this kind of infrastructural service without the full institutional backing of those who have the power to pull the rug from under it.

Shame...

December 18, 2008

JISC IE and e-Research Call briefing day

I attended the briefing day for the JISC's Information Environment and e-Research Call in London on Monday and my live-blogged notes are available on eFoundations LiveWire for anyone that is interested in my take on what was said.

Quite an interesting day overall but I was slightly surprised at the lack of name badges and a printed delegate list, especially given that this event brought together people from two previously separate areas of activity. Oh well, a delegate list is promised at some point.  I also sensed a certain lack of buzz around the event - I mean there's almost £11m being made available here, yet nobody seemed that excited about it, at least in comparison with the OER meeting held as part of the CETIS conference a few weeks back.  At that meeting there seemed to be a real sense that the money being made available was going to result in a real change of mindset within the community.  I accept that this is essentially second-phase money, building on top of what has gone before, but surely it should be generating a significant sense of momentum or something... shouldn't it?

A couple of people asked me why I was attending given that Eduserv isn't entitled to bid directly for this money and now that we're more commonly associated with giving grant money away rather than bidding for it ourselves.

The short answer is that this call is in an area that is of growing interest to Eduserv, not least because of the development effort we are putting into our new data centre capability.  It's also about us becoming better engaged with the community in this area.  So... what could we offer as part of a project team? Three things really: 

  • Firstly, we'd be very interested in talking to people about sustainable hosting models for services and content in the context of this call.
  • Secondly, software development effort, particularly around integration with Web 2.0 services.
  • Thirdly, significant expertise in both Semantic Web technologies (e.g. RDF, Dublin Core and ORE) and identity standards (e.g. Shibboleth and OpenID).

If you are interested in talking any of this thru further, please get in touch.

November 14, 2008

Two drafts from DCMI

Last week DCMI announced the publication of a couple of working drafts.

One is a slightly updated version of the Interoperability Levels for Dublin Core Metadata document that I've mentioned already in some previous posts.

The second is a document co-authored by Karen Coyle and Tom Baker, titled Guidelines for Dublin Core Application Profiles.

I think the work done in the area of DCAPs over the last couple of years, particularly the "Singapore Framework" and the Description Set Profile model, is very important for DCMI, as it (albeit somewhat belatedly!) seeks to clarify what the term "DC Application Profile" really means.

The new draft is intended as a "user guide" to complement the more formal documents: it seeks to explain more fully the nature of the components which make up a DCAP, and describe what is involved in creating those components.

As I think some of the current discussion of the draft on the dc-general Jiscmail list illustrates, like other such "primer" documents, it faces the tension between on the one hand, trying to present some occasionally subtle and complex concepts to a (relatively) broad audience, who approach it from varying perspectives and degrees of experience, and on the other maintaining a level of precision and consistency with some of the more "specialised" sources which it references and builds upon. And I don't envy the authors the challenge of trying to maintain that balance!

I think one of the challenges DCMI faces is the preconception that developing a DC Application Profile is, or should be, "easy". In some cases, yes, it is relatively easy, but it really depends on the sort of functionality one is trying to support with the metadata: relatively simple levels of function can be provided using relatively simple metadata based on a relatively simple DCAP - though even in that case, I'd suggest that the process of arriving at "simplicity" isn't necessarily "easy".

But the DCAP concept supports arbitrary levels of complexity, and as one seeks to provide richer functionality, the requirement for "richer" metadata - more extensive descriptions of a wider range of resources and the relationships between them - typically increases too. As many have realised "Simple Dublin Core" can only get you so far, however much you might try to bend it and stretch it. In the general case, I don't think creating a Dublin Core Application Profile is an "easy" task at all. Or at least it's no more so than, say, designing a relational database schema is: it does require some specialised skills and a grounding in some concepts which may not be familiar to all. So the audience for this document is, I think, still a fairly specialised one. And that's OK.

Which is not to say that DCMI doesn't need to explain those concepts as clearly as possible. And I think the current draft is a very good step towards doing that.

November 07, 2008

Some (more) thoughts on repositories

I attended a meeting of the JISC Repositories and Preservation Advisory Group (RPAG) in London a couple of weeks ago.  Part of my reason for attending was to respond (semi-formally) to the proposals being put forward by Rachel Heery in her update to the original Repositories Roadmap that we jointly authored back in April 2006.

It would be unfair (and inappropriate) for me to share any of the detail in my comments since the update isn't yet public (and I suppose may never be made so).  So other than saying that I think that, generally speaking, the update is a step in the right direction, what I want to do here is rehearse the points I made which are applicable to the repositories landscape as I see it more generally.  To be honest, I only had 5 minutes in which to make my comments in the meeting, so there wasn't a lot of room for detail in any case!

Broadly speaking, I think three points are worth making.  (With the exception of the first, these will come as no surprise to regular readers of this blog.)

Metrics

There may well be some disagreement about this but it seems to me that the collection of material we are trying to put into institutional repositories of scholarly research publications is a reasonably well understood and measurable corpus.  It strikes me as odd therefore that the metrics we tend to use to measure progress in this space are very general and uninformative.  Numbers of institutions with a repository for example - or numbers of papers with full text.  We set targets for ourselves like, "a high percentage of newly published UK scholarly output [will be] made available on an open access basis" (a direct quote from the original roadmap).  We don't set targets like, "80% of newly published UK peer-reviewed research papers will be made available on an open access basis" - a more useful and concrete objective.

As a result, we have little or no real way of knowing if are actually making significant progress towards our goals.  We get a vague feel for what is happening but it is difficult to determine if we are really succeeding.

Clearly, I am ignoring learning object repositories and repositories of research data here because those areas are significantly harder, probably impossible, to measure in percentage terms.  In passing, I suggest that the issues around learning object repositories, certainly the softer issues like what motivates people to deposit, are so totally different from those around research repositories that it makes no sense to consider them in the same space anyway.

Even if the total number of published UK peer-reviewed research papers is indeed hard to determine, it seems to me that we ought to be able to reach some kind of suitable agreement about how we would estimate it for the purposes of repository metrics.  Or we could base our measurements on some agreed sub-set of all scholarly output - the peer-reviewed research papers submitted to the current RAE (or forthcoming REF) for example.

A glass half empty view of the world says that by giving ourselves concrete objectives we are setting ourselves up for failure.  Maybe... though I prefer the glass half full view that we are setting ourselves up for success.  Whatever... failure isn't really failure - it's just a convenient way of partitioning off those activities that aren't worth pursuing (for whatever reason) so that other things can be focused on more fully.  Without concrete metrics it is much harder to make those kinds of decisions.

The other issue around metrics is that if the goal is open access (which I think it is), as opposed to full repositories (which are just a means to an end) then our metrics should be couched in terms of that goal.  (Note that, for me at least, open access implies both good management and long-term preservation and that repositories are only one way of achieving that).

The bottom-line question is, "what does success in the repository space actually look like?".  My worry is that we are scared of the answers.  Perhaps the real problem here is that 'failure' isn't an option?

Executive summary: our success metrics around research publications should be based on a percentage of the newly published peer-reviewed literature (or some suitable subset thereof) being made available on an open access basis (irrespective of how that is achieved).

Emphasis on individuals

Across the board we are seeing a growing emphasis on the individual, on user-centricity and on personalisation (in its widest sense).  Personal Learning Environments, Personal Research Environments and the suite of 'open stack' standards around OpenID are good examples of this trend.  Yet in the repository space we still tend to focus most on institutional wants and needs.  I've characterised this in the past in terms of us needing to acknowledge and play to the real-world social networks adopted by researchers.  As long as our emphasis remains on the institution we are unlikely to bring much change to individual research practice.

Executive summary: we need to put the needs of individuals before the needs of institutions in terms of how we think about reaching open access nirvana.

Fit with the Web

I written and spoken a lot about this in the past and don't want to simply rehash old arguments.  That said, I think three things are worth emphasising:

Concentration

Global discipline-based repositories are more successful at attracting content than institutional repositories.  I can say that with only minimal fear of contradiction because our metrics are so poor - see above :-).  This is no surprise.  It's exactly what I'd expect to see.  Successful services on the Web tend to be globally concentrated (as that term is defined by Lorcan Dempsey) because social networks tend not to follow regional or organisational boundaries any more.

Executive summary: we need to work out how to take advantage of global concentration more fully in the repository space.

Web architecture

Take three guiding documents - the Web Architecture itself, REST, and the principles of linked data.  Apply liberally to the content you have at hand - repository content in our case.  Sit back and relax. 

Executive summary: we need to treat repositories more like Web sites and less like repositories.

Resource discovery

On the Web, the discovery of textual material is based on full-text indexing and link analysis.  In repositories, it is based on metadata and pre-Web forms of citation.  One approach works, the other doesn't.  (Hint: I no longer believe in metadata as it is currently used in repositories).  Why the difference?  Because repositories of research publications are library-centric and the library world is paper-centric - oh, and there's the minor issue of a few hundred years of inertia to overcome.  That's the only explanation I can give anyway.  (And yes, since you ask... I was part of the recent movement that got us into this mess!). 

Executive summary: we need to 1) make sure that repository content is exposed to mainstream Web search engines in Web-friendly formats and 2) make academic citation more Web-friendly so that people can discovery repository content using everyday tools like Google.

Simple huh?!  No, thought not...

I realise that most of what I say above has been written (by me) on previous occasions in this blog.  I also strongly suspect that variants of this blog entry will continue to appear here for some time to come.

October 23, 2008

Thoughts on DC-2008

A somewhat belated report on my time at the DC-2008 conference of the Dublin Core Metadata Initiative in Berlin a couple of weeks ago.

I travelled ahead of the conference itself in order to attend the meeting of the Usage Board held over the weekend. I've attended a few UB meetings previously as a "guest", but this was the first one I'd attended as a member. I think it was a reasonably productive meeting: thanks to Tom Baker's ever-efficient chairing, we managed to get through the agenda and make a few decisions, even if at least one of them involved passing on the issue to someone else to deal with!

As I already mentioned, I gave a tutorial presentation on the Monday, focusing mainly on the DCMI Abstract Model, with a short section on syntaxes for representing DC metadata. I was horribly nervous about it, probably more so than for any other presentation I've done, in the last few years anyway, partly because of the amount of material I was trying to cover, and partly because some of the topics have, from past experience, proved to be quite difficult to explain - but I think it went OK in the end. It was the second of four tutorials, the others given by Jane Greenberg, Mikael Nilsson and Marcia Zeng. The conference has traditionally included a set of tutorials, and I think this was the second occasion on which they were all presented in sequence on the same day, rather than one per day at the start of the day. This arrangement, with its juxtaposition of content from several different presenters, and a higher probability that the same audience will sit through them all, did bring home to me the importance of ensuring that the presentations form a coherent "whole", that we have a shared foundation, and particularly that we use terminology consistently. I think we just about managed it this time, but the eagle-eyed observer may have spotted a few points where the messages were a bit mixed or where we used different terms for the same concept.

The conference proper featured the usual (for DCMI) combination of keynotes, papers and workshop sessions. The conference theme was "Metadata for Semantic and Social Applications"and there were several papers on topics related to the Semantic Web and "Linked Data", as well as some on "tagging", though, perhaps slightly disappointingly, few that I can recall on other dimensions of metadata use in "social software". In comparison with last year's conference where the "Singapore Framework" model for DC Application Profiles came to provide something of a recurring motif, it was less clear to me that there was a "dominant" theme at DC-2008. The paper on interoperability levels was referred to a few times, but if there was a running theme, I think it was probably a renewed emphasis on the Web. The paper presentation I probably enjoyed most was a paper by Stuart Sutton & Diny Golder, where Stuart described their experience of modelling educational achievement standards and exposing data based on that model on the Web. IIRC, Stuart concluded by saying something along the lines of "Probably the most important thing we did was to clarify the things we were interested in and assign them all dereferenceable URIs" - where, in this case, the "things of interest" included not just "documents" but "statements" within those documents. Such themes were echoed in the keynote by Paul Miller, with his emphasis on Berners-Lee's "Linked Data" principles and the emergence of the "Linked Open Data" community.

And I think that emphasis is important for DCMI. The recent focus within DCMI on conceptual frameworks for DC metadata - the DCAM, the Description Set Profile model and the Singapore Framework - has, I think, been necessary, and indeed those frameworks are designed to be grounded in the Web. But it's good to be reminded that our applications are operating within the context of the Web, and, to quote from a slide by another presenter, Ed Summers, who was in turn referencing Paul Graham, we need to ensure our implementations are "aligned with the grain of the Web" - and I'd probably add, for the DCMI case, with the "grain" of other related developments going on around us, such as the work on Linked Data.

And going back to the topic of the tutorials for a moment, I think it would probably be helpful if we can find some way of establishing some of these fundamental principles in that context too.

I admit I was slightly disappointed that, in at least some of the workshop sessions, we didn't really seem to get to the point of advancing the work of the group very much. But perhaps that is inevitable: I think there has always been a tension in these sessions between a desire on the one hand to provide enough background that they are open to newcomers and serve almost as a specialised tutorial, and on the other to focus in on specific issues and try to find resolutions to specific problems or at least plan out how to do so.

Away from the formal sessions, it was good to meet some people with whom I'd previously had exchanges only in weblog comments, by email or on Twitter, as well as to catch up with old friends.

And as I kinda expected, I enjoyed Berlin a lot: for a European capital city, it felt a very relaxed and welcoming place, as well as a lively and interesting one, and I hope I'll be able to visit again.

October 21, 2008

ORE 1.0 published

I'm pleased to note that, at the end of last week, Carl Lagoze and Herbert Van de Sompel announced the publication of version 1.0 of the OAI ORE specifications. I was travelling for most of the week, and had very little time to keep up with email, so the last minute dotting of i's and crossing of t's fell to the other editors and I'm grateful for their efforts in pulling things together.

(Of course, we're already noticing various minor things which need correcting!)

I think the main changes from the previous (0.9) release are:

As it happened, I was talking about ORE in a presentation last week (more on that in a follow-up post) and I expressed the opinion then that, leaving aside for a moment the core ORE model of Aggregations and Aggregated Resources, I think one of the significant contributions of ORE may turn out to be its emphasis on what I think of as a "resource-centric" approach and (at least some of) the conventions of the Semantic Web and "Linked Data" communities. In particular, I think this is a potentially important change for the "Open Archives"/"eprint repository" community, where to a large extent - not entirely, but to a large extent - repository developments on the Web have been conditioned by the more "service-oriented" framework of the OAI-PMH protocol and an emphasis on XML and XML Schema. It's also probably fair to say that I don't think the ORE project really started from this perspective, but rather things evolved and shifted - perhaps not always in a straight line! - in this direction as the work proceeded.

The ORE model itself is quite general in nature, and, as Herbert acknowledges in a presentation here (a nice set of slides which provides a good overview in itself, I think), it's not easy to predict how ORE might be applied: a number of experimental/test applications are noted in that presentation, but many others are possible. For my own part, I'm particularly interested in seeing how/whether ORE can be used in association with other models, like FRBR.

September 24, 2008

Tutorial at DC-2008

The slides from the tutorial I gave at the DC-2008 conference on Monday are now available on Slideshare and embedded below.

I think it went OK. Although I've done presentations like this a few times now, I still don't feel I've quite found the ideal way of presenting the material, and however hard I try to build things up gradually, I always hit a point where I introduce a lot of detail over the course of four or five slides.

September 17, 2008

DC-2008

Next week I'll be attending the DC-2008 conference of the Dublin Core Metadata Initiative in Berlin.

I'll be attending the meeting of the Usage Board over the weekend (cue my usual grumbles about weekend meetings: if I'm not paying full attention on Saturday afternoon, it's because I'm following the football scores on the BBC site, OK?); I'm giving a tutorial on Monday (I'm pretty nervous about that, but at least it'll be over with early in the week); and I think that's about it for things with my name on them, but I'll no doubt be chipping in in various working group meetings after that. It's difficult to predict the "burning issues", but I get the feeling that the (perennial?) tensions between quite "informal" approaches and approaches based on RDF or the DCMI Abstract Model will feature, and I'd like to think that the draft note on "levels of interoperability" that I mentioned a while ago - and that is still very much a work in progress, I hasten to add - will help shed some light on the underlying questions here.

I'm quite excited about visting Berlin. I haven't been before, and it's a city I've wanted to see for ages. Like Manhattan or Paris, it's one of those places I half feel I know already from having seen it in films, but I know the reality of seeing a place for the first time is always quite different. I love German beer, and some of my favourite music in recent years seems to have come from Berlin. I had initially planned to travel overland, but the revisions to the Eurostar timetable following the recent fire have kinda scuppered that, so I caved in and got a flight this week :-( (But I'm still getting the train home!)

September 03, 2008

Proposed XML format for DC description sets

DCMI announced yesterday the availability for public comment of the document Expressing Dublin Core description sets using XML (DC-DS-XML). (Disclaimer: Andy and I are co-authors of the document!) This document, a DCMI Proposed Recommendation, describes an XML format called "DC-DS-XML" which supports the serialization of a "description set", the information structure defined by the DCMI Abstract Model. A note providing some further information on the background to the development of the specification, and its relationship to other specifications, was also published.

It's important to note that, in the terms of the document Interoperability levels for Dublin Core metadata which I mentioned a while ago, the DC-DS-XML format is intended to support what that document calls "level 3 interoperability", based on the creation/exchange of records structured as DC description sets. The DC-DS-XML format explicitly addresses a fairly minimal set of requirements, and does not seek to address the additional requirements of "level 4" in that document; in particular it does not concern itself with the implementation of the sorts of structural constraints which might be expressed in a Description Set Profile.

Also, the aim is not to promote DC-DS-XML as "the one and only" XML format for Dublin Core metadata - or even "the one and only" DCMI-owned XML format for Dublin Core metadata. The DCMI Architecture Forum continues to gather requirements for other formats, particularly requirements arising from the use of Description Set Profiles - i.e. from "level 4" in the Interoperability Levels document. The relationship between the checking of the structural constraints specified by a DSP and validation using XML schema technologies of various hues will be a factor to consider here. This is likely to be one of the topics for discussion at the f2f meeting of the DCMI Architecture Forum, to be held on Thursday 25 September at the DC-2008 conference in Berlin in a couple of weeks.

The DC-DS-XML format is based on a "TRiX-like" approach, by which I mean that it makes the structure of the description set explicit in the syntax in a similar fashion to the way the TRiX XML format makes explicit the structure of the RDF graph. Just as TRiX uses XML element names and XML attribute names corresponding to the names of the components of the RDF graph (<graph>, <triple>, <typedLiteral datatype="..."> etc), so DC-DS-XML uses XML element names and XML attribute names corresponding to the names of the components of the description set (<descriptionSet>, <description>, <statement>, <valueString sesURI="..."> etc). In DC-DS-XML, the various URIs in the description set model are represented as XML attribute values, and literals are represented as XML element content.

A GRDDL Namespace Transformation is provided, in the form of an XSLT stylesheet, following the mapping from a description set to an RDF graph described by the DCMI Recommendation Expressing DC metadata using RDF. This means that any instance of the DC-DS-XML format can be translated into an RDF/XML document, and a GRDDL-aware application can automatically extract an RDF graph corresponding to the description set encoded in a DC-DS-XML instance.

A W3C XML Schema for the DC-DS-XML format is provided. A (rather more drafty!) RELAX NG schema is also available.

Comments on the new document are welcome, and should be sent to the DC-Architecture Jiscmail mailing list.

August 18, 2008

The importance of non-literals to linked data

I'm less involved in Dublin Core metadata discussions than I used to be but a brief exchange on one of the DC lists caught my eye and reminded me how confusing the concepts underpinning metadata on the Web can be.  The exchange started with an apparently simple question:

I'm currently involved in the selection of standard fields for a metadata project and we have some fields that we are calling Dublin Core fields (Subject and Relation fields), but we are including free text or uncontrolled terms. I notice that the DC Subject and Relation fields are "intended to be used with a non-literal value." I'm not sure what this means. Is there anyone that can explain in simple terms? I've looked at the DCMI Abstract Model and I'm still not sure what they mean by "non-literal" value. Also, can you say you are using Qualified Dublin Core for some fields and Simple Dublin Core for other fields in an  application profile?

To which the initial response included:

I suspect the members of the small coterie that could explain this are all on vacation at this time. I am not in that group, but I will attempt an explanation anyway ;-)

...

Leaving aside the DCAM (which is often puzzling), it seems to me that you need a way to indicate 1) whether or not the values in the subject field are controlled and 2) if they are controlled, what list they come from.

...

Well, OK... point taken!  Let me attempt a response... though I note that, in support of the comments above, it will probably be neither simple nor clear!

I'll start by quoting the DCMI Abstract Model:

The abstract model of the resources described by descriptions is as follows:

  • Each described resource is described using one or more property-value pairs.
  • Each property-value pair is made up of one property and one value.
  • Each value is a resource - the physical, digital or conceptual entity or literal that is associated with a property when a property-value pair is used to describe a resource. Therefore, each value is either a literal value or a non-literal value:
    • A literal value is a value which is a literal.
    • A non-literal value is a value which is a physical, digital or conceptual entity.
  • A literal is an entity which uses a Unicode string as a lexical form, together with an optional language tag or datatype, to denote a resource (i.e. "literal" as defined by RDF).

Resource Description Framework (RDF): Concepts and Abstract Syntax (the foundation stone of the Semantic Web), says this about literals:

Literals are used to identify values such as numbers and dates by means of a lexical representation. Anything represented by a literal could also be represented by a URI, but it is often more convenient or intuitive to use literals.

So, in Dublin Core metadata, each value is either "a literal" (a literal value) or "a physical, digital or conceptual entity" (a non-literal value) and the choice of which to use is largely one of convenience.

In the case of dc:subject, the value is the "topic" of the resource being described.  The topic might be a concept ("physics"), a place ("Bath, UK") or a person ("Albert Einstein") or something else.  While that topic could be treated as a literal value (and superficially at least, doing so may appear to be more convenient) good practice suggests that it is better if it is treated as a non-literal value, i.e. as a physical, digital or conceptual entity.  Why?  Because if the topic is treated as a non-literal value then it can be assigned a URI and can become the subject of other descriptions.  If the topic is treated as a literal value then it becomes a descriptive cul-de-sac - no further description of the topic is possible.

A literal may be the object of an RDF statement, but not the     subject or the predicate.

In short, by treating values as non-literal resources and assigning URIs to them we give ourselves (and others) the hooks on which to hang further descriptions.  This is a very fundamental part of the way in which the Semantic Web (and indeed the Web) works.

Unfortunately, in my experience at least, people find it difficult to grasp the importance of this point, particularly if they come to Dublin Core from the "old world" of library cataloguing, attribute/value pairs and text-string values.  For them, values have always been strings of text written onto cards, or the electronic equivalent of cards.  Doing things that way has always been good enough.  Why should things have to change?  Answer: the Web changes everything - even library cataloguing... eventually!

By way of an example...  let's consider the case of a book, Shakespeare: The World as a Stage by Bill Bryson which I happened to read on my summer holiday a couple of weeks ago.  The dc:subject of this book is William Shakespeare, a person.

Now, as I indicated above, we could treat this as a literal value, the string "William Shakespeare" for example (though a string taken from a well recognised name-authority file would be better).  But in doing so we don't provide any hooks that allow other people to say, "hang on, I know something useful about William Skakespeare and I'm going to provide a description of him so that it can be automatically integrated into your metadata if you want".  By treating William Shakespeare as a resource (as a non-literal value) and by assigning him a URI, we give people that hook.  We give them an unambiguous way of saying, "here is a description of the person that you are saying is the subject of that book by Bill Bryson".  Indeed, it allows us to go further...  it allows us to say things like, "that person who you say is the dc:subject of that book by Bill Bryson is also the dc:creator of these plays".  We can build a massive global graph of data about stuff, all linked together unambiguously thru their 'http' URIs in much the same way that Web pages are currently linked together with their 'http' URIs.  This is known as linked data.

Now, of course, that leaves us a with a very fundamental problem.  Who the hell is going to mint 'http' URIs for people like William Shakespeare, for concepts like physics and for places like Bath, UK?

This is not an easy question to answer and there are arguments that we should all just go ahead and start doing so, leaving it to someone else to say, "hang on, this is actually the same as that".  But I would argue that libraries and related organisations are well placed to mint persistent and reusable 'http' URIs for much of this stuff (and indeed some of them are now beginning to do so) provided that they drag themselves out of the old world of text strings on cards and into the new world of the Web and the URI.  Just look at the list of example data sets on the Wikipedia page for linked data - can you spot the missing contributing organisations? :-(

August 12, 2008

Metadata and microformats

<shamelessPlug>I happened to notice earlier on (OK, I admit it... I was checking at the time) that my Does metadata matter? slidecast has been featured on the Slideshare homepage.</shamelessPlug>  In doing do, I also spotted a similarly featured presentation called What Brian Cant Never Taught You About Metadata by Drew McLellan - such a great title that I couldn't possibly ignore it!

Turns out to be quite good, though much of it is about microformats (which isn't hinted at by the title).

Don't know what microformats are?  Try watching Drew's 5 minute Clangers' Guide to Microformats.

August 05, 2008

ORE and Atom

At the end of last week, Herbert Van de Sompel posted an important proposal to the OAI ORE Google Group, suggesting significant changes in the way ORE Resource Maps are represented using Atom.

The proposal has two key components:

  • To express an ORE Aggregation at the level of an Atom Entry, rather than (as in the current draft) at the level of an Atom Feed
  • To convey ORE-specific relationships types using add-ons/extensions, rather than by making ORE-specific interpretations of pre-existing Atom relationship types

There are some details still to be worked out, particularly on the second point, and especially given that it is a significant change at quite a late stage in the development of the specifications, the project is looking for feedback on the proposal.

If possible, please respond to the OAI ORE Google Group, rather than by commenting here :-)

DC-HTML profile becomes DCMI Recommendation

DCMI announced today that the document Expressing Dublin Core metadata using HTML/XHTML meta and link elements has become a DCMI Recommendation. This document is an HTML meta data profile, in the sense that term is used in the HTML specification.

The new document updates DCMI's previous Recommendation for encoding DC metadata in HTML to provide explicitly:

This latter component is particularly exciting as it means that metadata encoded in XHTML headers using the conventions of this profile is automatically mapped to an RDF graph and becomes available to GRDDL-aware RDF applications, without any additional effort on the part of the document creator.

It's taken rather longer than I'd hoped to get this piece of the jigsaw into place, and it's a small piece in the scheme of things; but I do think it is an important one in (finally!) making the longest-established (I think?) convention for representing Dublin Core metadata a mechanism for contributing data to the Semantic Web. Anyway, I'm very pleased that it is finally out there.

It's probably worth highlighting the critical role of the HTML meta data profile feature in providing this functionality. The profile URI is the "hook" on which it hangs, if you like: for the data provider, the use of the URI of the specific profile in the value of the head/@profile attribute discloses the conventions they have used (e.g. the "schema.ABC" convention for abbreviating URIs, which is a profile-specific convention, not part of X/HTML); and for the consumer, it is the presence of that URI which licenses them to process/interpret the document in the specific way described by the profile.

August 04, 2008

Two new Linked Data sources

There were a couple of interesting announcements in the world of linked data last week.

The first came from Matthew Shorter of the BBC with the announcement of the public beta of their music artists pages. The site pulls in data from MusicBrainz, a community database of music metadata, and from Wikipedia (biographical information), and integrates it with the BBC's own data on when tracks by the artist have been played on BBC programmes. Using HTTP content negotiation you can get an RDF representation of the data. At the moment this doesn't include the play counts, but a message to the W3C public-lod mailing list indicates that the addition of this information is in the pipeline.

Hot on the heels of that came a second announcement on that same list, from Oktie Hassanzadeh & Mariano Consens of the University of Toronto, pointing to their Linked Movie Database project, which draws data from Freebase, Wikipedia and Geonames and provides (automatically generated, using a tool called ODDLinker) links to entries in several other datasets.

Now if someone extends the Southampton Pubs approach to Bristol and Bath, and then triplifies SoccerBase.com, we'll be well on the way to meeting many of my extra-curricular data needs :-)

July 18, 2008

Does metadata matter?

This is a 30 minute slidecast (using 130 slides), based on a seminar I gave to Eduserv staff yesterday lunchtime.  It tries to cover a broad sweep of history from library cataloguing, thru the Dublin Core, Web search engines, IEEE LOM, the Semantic Web, arXiv, institutional repositories and more.

It's not comprehensive - so it will probably be easy to pick holes in if you so choose - but how could it be in 30 minutes?!

The focus is ultimately on why Eduserv should be interested in 'metadata' (and surrounding areas), to a certain extent trying to justify why the Foundation continues to have a significant interest in this area.  To be honest, it's probably weakest in its conclusions about whether, or why, Eduserv should retain that interest in the context of the charitable services that we might offer to the higher education community.

Nonetheless, I hope it is of interest (and value) to people.  I'd be interested to know what you think.

As an aside, I found that the Slideshare slidecast editing facility was mostly pretty good (this is the first time I've used it), but that it seemed to struggle a little with the very large number of slides and the quickness of some of the transitions.

Vapour Linked Data Validator

Spotted via an announcement to the W3C Linked Open Data mailing list, Vapour is a validation/checking service for "linked data", produced by the research team at the CTIC Foundation (Center for the Development of Information and Communication Technologies)  in Asturias, Spain. Given the URI of a resource it tests whether interactions follow the conventions recommended by Tim Berners-Lee's Linked Data principles, the Best Practice Recipes for Publishing RDF Vocabularies from the W3C Semantic Web Deployment Working Group, and Cool URIs for the Semantic Web from the W3C Semantic Web Education and Outreach (SWEO) Interest Group. The tool is also available as open source software.

Vapour bundles together nicely a set of functions which until now probably required using a few different tools and then applying a certain degree of manual sifting and interpretation; the human-readable reports produced are very clear and nicely designed (e.g. the HTTP interactions are represented in graphics similar in style to those used in the Best Practice Recipes... document). A neat, useful package.

June 17, 2008

Google Tech Talk on RDFa

A nice overview of RDFa and its potential applications, mostly here looking at Javascript client-side stuff, by Mark Birbeck, one of the co-editors of the spec, given as a Google Tech Talk. Probably best to look at the high quality version on YouTube to see the code examples, which according to Mark's comment there are also available from the lib-xh Google Code repository.

June 16, 2008

Web 2.0 and repositories - have we got our repository architecture right?

For the record... this is the presentation I gave at the Talis Xiphos meeting last week, though to be honest, with around 1000 Slideshare views in the first couple of days (presumably thanks to a blog entry by Lorcan Dempsey and it being 'featured' by the Slideshare team) I guess that most people who want to see it will have done so already:

Some of my more recent presentations have followed the trend towards a more "picture-rich, text-poor" style of presentation slides.  For this presentation, I went back towards a more text-centric approach - largely because that makes the presentation much more useful to those people who only get to 'see' it on Slideshare and it leads to a more useful slideshow transcript (as generated automatically by Slideshare).

As always, I had good intentions around turning it into a slidecast but it hasn't happened yet, and may never happen to be honest.  If it does, you'll be the first to know ;-) ...

After I'd finished the talk on the day there was some time for Q&A.  Carsten Ulrich (one of the other speakers) asked the opening question, saying something along the lines of, "Thanks for the presentation - I didn't understand a word you were saying until slide 11".  Well, it got a good laugh :-).  But the point was a serious one... Carsten admitted that he had never really understood the point of services like arXiv until I said it was about "making content available on the Web".

OK, it's a sample of one... but this endorses the point I was making in the early part of the talk - that the language we use around repositories simply does not make sense to ordinary people and that we need to try harder to speak their language.

June 04, 2008

FRBR & "Time-Based Media", Part 4: Alternate Forms & Supplementary Materials

Another post continuing my ruminations on the use of the FRBR model for Time-Based Media. Here I'll examine three different cases which I loosely group together as dealing with "alternate forms" or supplementary resources (for/to the original video).

First, suppose I create a short "summary" or "trailer" video for my full tutorial. This might be something very short which acts as a "moving image thumbnail" e.g. for display to users browsing result sets, or it might be a more extended summary/overview. In both cases, the characteristic that distinguishes this case from my "clip" in the earlier example is that it is not a simple "part" of the whole video: rather, the content may be drawn from various parts of the whole, and it may include additional content not available in the tutorial itself. To keep things simple, let's assume I make my summary video available in only a single format from a single source.

As in the case of the clip, my summary video contains significantly different content from the original video, so from the FRBR viewpoint, we are dealing with a new Work (W21), realized in a single Expression (E21), embodied in a single Manifestation (M21), exemplified in a single Item. The relationships between these entities and the corresponding FRBR Group 1 Entities for the whole video reflect those for the case of the clip/segment (see Figure 2 in the clip/segment case), with the distinction that in this case the Work-Work and Expression-Expression relationships are of type hasSummary/isSummaryOf (rather than hasPart/isPartOf). Again for simplicity, I'm omitting the Items in the diagrams here:

Figure 1

Second, consider the case of an audio only version of my tutorial. This isn't simply a copy of the video soundtrack, but rather a version created specifically for audio, so perhaps contains additional commentary not present in the video soundtrack, and omits some other content which relies heavily on the visual representation. Again, let's assume this is available in a single format from a single source, And as in the first example, we have a new Work (W22), realized in a single Expression (E22), embodies in a single Manifestation (M22), exemplified in a single Item. So the relationships with the "original" entities form a similar "pattern" to the first case, but here the Work-Work and Expression-Expresson relationships are of type hasAdaptation/isAdaptationOf:

Figure 2

Third, I may provide text transcripts, for both of these cases, i.e a transcript for each of the video and audio tutorials. For the transcript of the video, again, following FRBR and particularly Martha Yee's point that its visual nature is a key characteristic of the moving image work, the change from moving image to text represents the creation of a new Work (and Expression, Manifestation and Item):

Figure 3

For the transcript of the audio tutorial, my initial inclination was to mirror the video case and treat the transcript as a new Work:

Figure 4

But, OTOH, FRBR does provide some examples where musical scores and musical performances are treated as multiple Expressions of a single Work. Also I notice that in the Harry Potter and the Goblet of Fire example analysed by William Denton and also by Ian Davis, the text editions and the audiobook editions are modelled as Expressions of a single Work. I think the different treatment comes down to the fact that in the case of the moving image, the presence of the visual aspect means that there is a significant difference in the "intellectual content" of the moving image as compared to the text transcript, and so they are considered as two distinct Works, but in the audio case, the difference in content is much less significant: the audio tutorial is simply a "reading of" the text of the transcript. I'm not sure what (if any) relationship should exist between the two Expressions (because hasAdaptation/isAdaptationOf applies to Expressions of different Works):

Figure 5

Hmmm. I think thst is consistent with the examples I see elsewhere. Even so, the lack of "symmetry" between the moving image/transcript and audio/transcript cases does leave me a little uneasy.

ORE Implementer Community Wiki

A very quick addendum to my post yesterday about the Beta ORE specs: the ORE Technical Committee has set up a "community wiki" for the use of implementers examining and using these specs, to build up a collection of notes of experiences, reflections, "best practice", etc, and generally for sharing any other useful supplementary materials. The structure and content is fairly skeletal at the moment, but will be expanded over the coming days.

Thanks to Rob Sanderson of the University of Liverpool for setting this up.

June 03, 2008

Beta Release of ORE Specifications and User Guides

You've probably seen this announcement on various mailing lists by now, but yesterday Carl Lagoze and Herbert Van de Sompel announced the publication of "beta" versions of the specifications being developed by the OAI ORE project:

Over the past eighteen months the Open Archives Initiative <http://www.openarchives.org/> (OAI), in a project called Object Reuse and Exchange <http://www.openarchives.org/ore/> (OAI-ORE), has gathered international experts from the publishing, web, library, and eScience community to develop standards for the identification and description of aggregations of online information resources.  These aggregations, sometimes called compound digital objects, may combine distributed resources with multiple media types including text, images, data, and video.  The goal of these standards is to expose the rich content in these aggregations to applications that support authoring, deposit, exchange, visualization, reuse, and preservation.  Although a motivating use case for the work is the changing nature of scholarship and scholarly communication, and the need for cyberinfrastructure to support that scholarship, the intent of the effort is to develop standards that generalize across all web-based information including the increasing popular social networks of "web 2.0".

I'm a member of the "editorial group" of the Technical Committee which worked on the documents. To be honest, I've struggled to find the time to make as much input as I'd have liked in the couple of weeks, so I'm grateful to the other members of that group for getting content into shape for this release.

Is DCMI hiding its light under a bushel?

Good grief... Dublin Core gets some bad press at times - some of it justified, some of it not - and I have a tendency to blow hot and cold on the subject myself every so often but my blood near boils when I see people mis-representing Dublin Core as being just "a set of 15 basic fields" and then comparing it to other metadata standards that have, oh, let's say, 80 fields as though that makes them necessarily better and more expressive.

The recent Metadata for digital libraries: state of the art and future directions report published by JISC TechWatch is a case in point.  "State of the art and future directions"?  I'm sorry... I think I might have missed something?  The report doesn't even mention the Semantic Web or RDF - not even once.  So, if you want a report looking at the state of the art of METS and MODS in digital libraries this is the report for you - otherwise look elsewhere.  My suggestion for the TechWatch people is, "give your reports more appropriate titles - after all, it is probably the single most important metadata field!".  And for the rest of you... here's my Bluffer's Guide to the Dublin Core tip - if anyone starts using constructs like "creator.author" when they are talking about Dublin Core you can confidently tell them that they are about 5 years behind the curve.

Just for the record, Dublin Core hasn't been just "a set of 15 basic fields" since about 1995.  The current list of DCMI metadata terms stands at 50 or 60 I guess (not all of which are properties by the way) but the numbers are largely irrelevant.  What the Dublin Core provides is a set of flexible and extensible frameworks (primarily the DCMI Abstract Model but also the more recent and ongoing work looking at application profiles in the form of the Singapore Framework for Dublin Core Application Profiles) that are tightly bound to the core standards that make up the Semantic Web and that provide a toolkit for building metadata applications rich enough to meet any (yes I really do mean any) set of functional requirements whilst still remaining semantically interoperable with each other.

OK, rant over, and I apologise in part to the TechWatch report author.  As I say, if you want to know more about METS and MODS and how they fit with digital libraries then I'm pretty sure that the report is an excellent place to start.  More importantly, there are mitigating circumstances which make it understandable why people make the assumption that Dublin Core is just "a set of 15 basic fields".  DCMI has an identity crisis - it is torn between, on the one hand, promoting the highly extensible, flexible, semantically rich but conceptually challenging frameworks outlined above and, on the other, the simple, easy to understand but ultimately rather useless original 15 elements.  The only formal standards documents produced by the DCMI (ISO 15836 / NISO Z39.85 and RFC 2413) both focus solely on the original 15 elements, presumably leaving some people with the view that this is all that matters.

What I think has happened is that over the years the DCMI have tried, with some success, to associate the "Dublin Core" brand with only the 15 elements, using other terminology (usually with the prefix "DCMI") for everything else.  The result is something of a confusing mess, leaving the real value proposition offered by DCMI hidden under a bushel.  This is a shame IMHO.

To sum up... Dublin Core (at least in its widest interpretation) is definitely, 100%, absolutely, categorically not just 15 metadata elements but if you want to know what it is you'll have to look beyond the old standards documents and spend some time understanding the thinking that underpins the DCMI Abstract Model, the Singapore Framework for Dublin Core Application Profiles and various associated documents.  Only at that point will it be possible to have a sensible discussion about whether MODS and/or METS (or anything else for that matter) are "richer" than the Dublin Core or not.

To refer back to a comment by Irvin Flack on Pete's "Dublin Core layered model" post, Donkey may be right to suggest that Dublin Core stinks like an onion but if he is it isn't because it only has 15 metadata fields!

May 29, 2008

FRBR & "Time-Based Media", Part 3: Stills

In my previous post on using the FRBR model for "Time-Based Media", I outlined an example based on a video tutorial and a clip of that tutorial, the former made available in multiple formats, and both of them "versioned" over time.

Another similar, but slightly different, scenario which we want to represent is the case in which one or more still images is created from the content of a video. Suppose I create a sequence of images to use in some sort of summary page describing my video tutorial (or indeed to be used quite independently of the video itself), and I make these available as both a JPEG and a PNG format. The extraction of a still image clearly involves the creation of a new FRBR Work, so each of my stills is a distinct Work (W11, W12, etc), realized in a single Expression (E11, E12, etc), each embodied in two Manifestations (M11, M12, M13, M14, etc), each exemplified in a single Item.

Again, I'm tempted to use a whole-part relationship to express relationships between these new Works and Expressions and my original "complete video" Work (and Expression). It does seem slightly odd to use the same relationship type to express both the relationship between a (moving image) clip and a video and the relationship between a still image and a video, but perhaps from the perspective of film and video as made up of a sequence of discrete events/frames, it can be justified. So in Figure 1 below, I end up with a similar set of relationships to those illustrated in Figure 2 of my previous post. Again, I'll leave out the representation of the Items for conciseness:

Figure 1

And using Ian Davis' and Richard Newman's FRBR RDFS vocabulary again, a Turtle representation would look like:

@prefix frbr: <http://purl.org/vocab/frbr/core#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/examples/> .

ex:W01
  a frbr:Work ;
  dcterms:title "Second Life Tutorial" ;
  frbr:part ex:W11 ;
  frbr:part ex:W12 ;
  frbr:realization ex:E01 .

ex:W11
  a frbr:Work ;
  dcterms:title "Second Life Tutorial Still 1" ;
  frbr:partOf ex:W01 ;
  frbr:realization ex:E11 .

ex:W12
  a frbr:Work ;
  dcterms:title "Second Life Tutorial Still 2" ;
  frbr:partOf ex:W01 ;
  frbr:realization ex:E12 .

ex:E01
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial, Version 1.0" ;
  frbr:realizationOf ex:W01 ;
  frbr:part ex:E11 ;
  frbr:part ex:E12 ;
  frbr:embodiment ex:M01 ;
  frbr:embodiment ex:M02 ;
  frbr:embodiment ex:M03 .

ex:E11
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial Still 1" ;
  frbr:realizationOf ex:W11 ;
  frbr:partOf ex:E01 ;
  frbr:embodiment ex:M11 ;
  frbr:embodiment ex:M12 .

ex:E12
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial Still 2" ;
  frbr:realizationOf ex:W12 ;
  frbr:partOf ex:E01 ;
  frbr:embodiment ex:M13 ;
  frbr:embodiment ex:M14 .

The first issue raised by the stills example (and also by the previous example of clips/segments, although I didn't address it in my earlier post) is that of how to express the temporal "location" of the part (clip or still) within the whole i.e. to be able to say that my still is taken from a point T1 within the video, or that my clip is taken from the point within the whole video starting at a point in time T1 and continuing until time T2 (or starting at time T1 and having some specified duration). Given that  duration is considered to be an attribute of the Expression, I'd expect this to be represented at the level of the Expression. As far as I can tell, FRBR itself doesn't provide attributes for capturing this level of detail. I can imagine three ways of addressing this:

  • adding an attribute at the Expression level which provided some sort of human-readable note providing the information. This would serve the purpose of presenting the information to a human reader, but it wouldn't be sufficient to support e.g. an application presenting my stills along a timeline or supporting searches for stills or clips extracted from within a specified period of the video
  • adding start-point-within-whole and end-point-within-whole attributes to the "part" Expression. This would enable such processing as suggested above, but would be sufficient only if the  part participated in at most one part-whole relationship
  • modelling the part-whole relationship as a resource in its own right with start point and end point attributes

And the part-whole relationship may have a spatial aspect as well as a temporal aspect e.g. the case where a still is only some of some specific spatial region of the whole screen image. Potentially, there's a great deal of complexity here - as is reflected in the capabilities within MPEG-7 to represent quite complex segmentation/decomposition relationships - but I suspect for the purposes of this exercise we'll be aiming to try to satisfy some of the simple cases required to support the discovery, selection and navigation of resources, probably using some variant of the second bullet above, while acknowledging that there is some complexity which isn't modelled within the DCAP.

Secondly,  I probably want to capture the fact that my set of still images form a set or sequence, distinct from my video clip(s) which also form parts of the same whole. The best fit I can see for this in FRBR would seem to be to express hasSuccessor/isSuccessorTo relationships between the stills in the sequence (ex:E11 frbr:successor ex:E12. ex:E12 frbr:successorOf ex:E11 . And so on.) Whether it's also useful/necessary to represent the sequence as a distinct work, I'm less sure. Probably not.

The final issue raised here I wanted to note is the assumption that both the video and the still images are described using the FRBR model, where whole-part relationships exist between instances of the same Group 1 Entity Type i.e. when a Work is the subject of a has-Part or is-Part-Of relationship, the assumption is that the object of that relationship is also a Work (and the same for Expressions, Manifestations and Items). As I mentioned in a comment on my earlier post, one of the "sibling" projects to the Time-Based Media project is developing at a DCAP for describing still images, and that project has recommended the use of a model which is derived from the FRBR model but which substitutes a single entity type of Image for the Work-Expression pair used in FRBR. Mick Eadie describes the project's reasons for that choice in a recent Ariadne article:

In essence what is being done by FRBR is not the modelling of the simple image and its relationships, but rather an attempt to model the artistic / intellectual process and all resultant manifestations of it. We decided this was inappropriate for the IAP for a number of reasons. While possible, an application profile of this complexity would require detailed explanation that could be a barrier to take-up. Moreover, it strays from the core remit of the images IAP to facilitate a simple exchange of image data between repositories. While the FRBR approach attempts to build relationships between objects, e.g. slides, photographs, objects and digital surrogates, this facility already exists in, for example, the Visual Resources Association Core (VRA) schema. Our intention was not to reinvent or in any way replicate existing standards that are robust and heavyweight enough to deal with most image types. Rather our intention was to build a lightweight layer that could sit above these standards, and work with them, facilitating a simple image search across institutional repositories.

Using the IAP model to describe the still image, there are no distinct Works and Expressions, only Images, so it seems to me that integrating that data within a strictly FRBR-based view would require some mapping between the two models, and the separating out of attributes of the IAP Image entity which in the FRBR model apply to the Work from those which apply to the Expression.

May 27, 2008

A "layered" model for interoperability using Dublin Core metadata

Mikael Nilsson has circulated (to the DCMI Architecture forum mailing list) a short draft document titled Interoperability levels for Dublin Core metadata. The document is a result of both the discussions around the relationship of the DCMI Abstract Model and RDF which surfaced around a series of posts by Stu Weibel a few months ago, and also the longer running efforts within DCMI to reconcile the use of the term "Dublin Core metadata" to refer both to data created within these formal frameworks and to data created using less formal, more ad hoc approaches.

The document presents a "layered" approach, describing four distinct "interoperability levels", each building on the previous one, and attempting to specify clearly the assumptions and constraints which apply at each of those levels, and the expectations which a consumer can have for metadata provided "at" a specified level.

  • Level 1: "Informal interoperability", based essentially on the natural-language definitions of metadata terms;
  • Level 2: "Semantic interoperability", based on the RDF model;
  • Level 3: "DCAM-based syntactic interoperability", introducing the notions of descriptions and description sets, as defined by the DCMI Abstract Model;
  • Level 4: "Singapore Framework interoperability", in which an application is supported by the complete set of components specified by the Singapore Framework for Dublin Core Application Profiles

As Mikael notes in his message, this is an attempt to articulate some of the notions which have underpinned developments in DC metadata over the last few years. One of the difficulties we've had, I think, is that in some of our conversations within DCMI, parties in the discussions have been adopting viewpoints reflecting different "levels" in this model (particularly level 1 on the one hand and levels 2 or 3 on the other, I think) and perhaps "talking past each other" as a result. So any attempt to try to articulate these differences, and to make explicit our implicit assumptions, is to be welcomed, I think.

It should be emphasised that this is a very early working draft circulated for discussion, and there is no community consensus on these concepts. Comments on the draft should be sent to the DC-Architecture mailing list. (While I'm not going to close comments on this post, I'd strongly urge you to send comments to that list, so that discussion is visible to members of that forum.)

May 12, 2008

FRBR & "Time-Based Media", Part 2: Clips/Segments

Following on from my previous post about applying the Functional Requirements for Bibliographic Records (FRBR) model to the case of Time-Based Media, this post works through one example, reflecting the requirement to be able to disclose/discover relationships between "whole" videos and segments/clips of those "wholes". While the example I'm sketching here isn't based on my actual experience, I think it is a reasonably realistic one.

Suppose I develop a machinima-based tutorial video introducing some of the features of Second Life for use by undergraduate students new to the application. I might make my tutorial available for streaming using my institution's streaming server, both in Windows Media Video format and in QuickTime format. And I might make a QuickTime version available for download as an alternative to streaming. I might also make a second copy of that QuickTime file - exactly the same content, quality, size etc - available for download from my personal Web site.

From a FRBR viewpoint, I think this would be represented as a single FRBR Work (W01), realized in a single Expression (E01), embodied in three different Manifestations (streamed Windows Media Video (M01), streamed QuickTime (M02) and downloadable QuickTime (M03)), with the first two of these Manifestations each exemplified in a single Item, and the last exemplified in two Items. The relationships between these resources are indicated in Figure 1 below.

Fig1

For the purposes of ths discussion I'm focusing on the FRBR "Group 1" entities (Work, Expression, Manifestation and Item); a full FRBR modelling of the resource would also include various flavours of "responsibility" relationships with Group 2 entities (Persons, Corporate Bodies) and "subject" relationships with Group 3 entities.

Using the FRBR RDFS vocabulary developed by Ian Davis and Richard Newman (I'm using this vocabulary rather than the set of terms defined as part of the Scholarly Works Application Profile project because SWAP defined terms for only a small subset of the FRBR relationship types, and here I need to make use of a wider range of those relationship types), and the Turtle RDF syntax, I'd represent this as something like the following (I'm deliberately focusing here on only the FRBR "Group 1" Work-Expression-Manifestation-Item relationships, and I'm also including the inverse relationships just to be explicit):

@prefix frbr: <http://purl.org/vocab/frbr/core#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/examples/> .

ex:W01
  a frbr:Work ;
  dcterms:title "Second Life Tutorial" ;
  frbr:realization ex:E01 .

ex:E01
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial, Version 1.0" ;
  frbr:realizationOf ex:W01 ;
  frbr:embodiment ex:M01 ;
  frbr:embodiment ex:M02 ;
  frbr:embodiment ex:M03 .
 
ex:M01
  a frbr:Manifestation ;
  frbr:embodimentOf ex:E01 ;
  frbr:exemplar ex:I01 .

ex:M02
  a frbr:Manifestation ;
  frbr:embodimentOf ex:E01 ;
  frbr:exemplar ex:I02 .

ex:M03
  a frbr:Manifestation ;
  frbr:embodimentOf ex:E01 ;
  frbr:exemplar ex:I03 ;
  frbr:exemplar ex:I04 .

So far so good. Having made this full tutorial video available, I then create a clip of the original, say, a segment focusing on the graphical preferences in the SL client which covers the topic in such a way as to be useful as a self-contained resource. And I choose to make that clip/segment available only as a download in QuickTime format, from my own server, not from the institutional server. So now we have a second FRBR Work (W02) - there is a significant difference in the content of the two videos - , realized in a single Expression (E02), embodied in a single Manifestation (M04), exemplified in a single Item.

And I can express the fact that there is a relationship between this second Work (my clip on graphical preferences, W02) and the Work corresponding to the original tutorial (W01). I think (but I'm not 100% sure) it would be appropriate to use a whole-part relationship between the two Works here.

And I can also express a whole-part relationship between the two Expressions (E01, E02). This might seem redundant, given the relationship between the two Works, but I think it does add additional information, and hopefully the value of this will become clearer below. For simplicity I'm leaving out the representation of the Items in the diagrams from now on.

Fig2

Or in Turtle:

@prefix frbr: <http://purl.org/vocab/frbr/core#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/examples/> .

ex:W01
  a frbr:Work ;
  dcterms:title "Second Life Tutorial" ;
  frbr:part ex:W02 ;
  frbr:realization ex:E01 .

ex:W02
  a frbr:Work ;
  dcterms:title "Second Life Tutorial Segment: Graphics Preferences" ;
  frbr:partOf ex:W01 ;
  frbr:realization ex:E02 .

ex:E01
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial, Version 1.0" ;
  frbr:realizationOf ex:W01 ;
  frbr:part ex:E02 ;
  frbr:embodiment ex:M01 ;
  frbr:embodiment ex:M02 ;
  frbr:embodiment ex:M03 .

ex:E02
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial Segment: Graphics Preferences, Version 1.0" ;
  frbr:realizationOf ex:W02 ;
  frbr:partOf ex:E01 ;
  frbr:embodiment ex:M04 .

Now twelve months on, there is some change to Second Life functionality and I produce a slightly revised, extended version of my tutorial to take this into account. As much of the content remains the same, I think these should probably be modelled as two Expressions (E01, E03) of the same Work (W01), with a hasRevision/isRevisionOf relationship between them. Assuming I make the new version available in the same range of forms as the original, then the relationships between the two versions appear as follows, in Figure 3:

Fig3

Or in Turtle:

@prefix frbr: <http://purl.org/vocab/frbr/core#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/examples/> .

ex:W01
  a frbr:Work ;
  dcterms:title "Second Life Tutorial" ;
  frbr:realization ex:E01 ;
  frbr:realization ex:E03 .

ex:E01
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial, Version 1.0" ;
  frbr:realizationOf ex:W01 ;
  frbr:revision ex:E03 ;
  frbr:embodiment ex:M01 ;
  frbr:embodiment ex:M02 ;
  frbr:embodiment ex:M03 .
 
ex:E03
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial, Version 1.1" ;
  frbr:realizationOf ex:W01 ;
  frbr:revisionOf ex:E01 ;
  frbr:embodiment ex:M05 ;
  frbr:embodiment ex:M06 ;
  frbr:embodiment ex:M07 .

And I also create a new version of the clip on graphics preferences, a new Expression (E04) of my second Work (W02), and this new Expression is a participant in two Expression-Expression relationships:

  • it is a revision of the first Expression (E02) of that Work, and;
  • it is a part of the Expression (E03) corresponding to the new version of the full tutorial

Fig4_2

Or in Turtle:

@prefix frbr: <http://purl.org/vocab/frbr/core#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/examples/> .

ex:W01
  a frbr:Work ;
  dcterms:title "Second Life Tutorial" ;
  frbr:part ex:W02 ;
  frbr:realization ex:E01 ;
  frbr:realization ex:E03 .

ex:W02
  a frbr:Work ;
  dcterms:title "Second Life Tutorial Segment: Graphics Preferences" ;
  frbr:partOf ex:W01 ;
  frbr:realization ex:E02 ;
  frbr:realization ex:E04 .

ex:E01
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial, Version 1.0" ;
  frbr:realizationOf ex:W01 ;
  frbr:part ex:E02 ;
  frbr:revision ex:E03 ;
  frbr:embodiment ex:M01 ;
  frbr:embodiment ex:M02 ;
  frbr:embodiment ex:M03 .
 
ex:E02
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial Segment: Graphics Preferences, Version 1.0" ;
  frbr:realizationOf ex:W02 ;
  frbr:partOf ex:E01 ;
  frbr:revision ex:E04 ;
  frbr:embodiment ex:M04 .

ex:E03
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial, Version 1.1" ;
  frbr:realizationOf ex:W01 ;
  frbr:part ex:E04 ;
  frbr:revisionOf ex:E01 ;
  frbr:embodiment ex:M05 ;
  frbr:embodiment ex:M06 ;
  frbr:embodiment ex:M07 .

ex:E04
  a frbr:Expression ;
  dcterms:title "Second Life Tutorial Segment: Graphics Preferences, Version 1.1" ;
  frbr:realizationOf ex:W02 ;
  frbr:partOf ex:E03 ;
  frbr:revisionOf ex:E02 ;
  frbr:embodiment ex:M08 .

At least, I think that's right :-) But I'd appreciate any feedback on whether that is an appropriate use of the FRBR hasPart/isPartOf relationships (or indeed on any other aspect of that example.)

Finally, I guess there's an alternative scenario in which when I come to update my tutorial I remake it more or less from scratch and I consider it a distinct Work from the original.

Fig5

And if I go on to create a new clip of my new tutorial, I can still indicate the relationships between the clips and their respective wholes as above, but I don't think there is any explicit relationship between the Works correspnonding to the two tutorials, or between the Works corresponding to the two clips (though of course the two Works would have a "created-by" relationship with the same Person, and probably a set of "has-as-subject" relationships with a common set of Concepts).

Fig6_2

May 11, 2008

FRBR & "Time-Based Media", Part 1

Under their Repositories and Preservation Programme, JISC is currently funding a number of short projects (see the overview of this activity by Rachel Bruce) to develop (or in some cases to explore the feasibility of developing) metadata application profiles for a range of different resource types. One of the considerations is the capability to search effectively across a merged dataset formed by aggregating metadata instances based on the different specifications, and - based largely on the experience of the Scholarly Works Applicaton Profile - the projects are exploring an approach based on the Dublin Core Abstract Model, i.e. the development, more specifically, of Dublin Core Application Profiles.

Further, given the use in SWAP of an entity-relational model based on the Functional Requirements for Bibliographic Records (FRBR) model, at least some of the projects are also examining the use of FRBR as the basis of the models underpinning their profiles, in order that there is a common high-level model in use across the different datasets.

I am contributing to the project tasked with drafting a profile for the description of Time-Based Media, led by Gayle Calverley at the University of Manchester, and I'm starting to look at some of the issues involved in applying FRBR to this class of resource.

The immediate question, of course, is what we mean by "Time-Based Media"? And I think(!) the answer is something like "resources of which the content changes meaningfully with respect to time", or perhaps more simplistically, resources which have, or are experienced as having, a duration in time - so the primary focus is on moving images and audio.

I really wanted to use this post (and a few subsequent posts)

  • to work through some examples; and
  • to throw out some questions that I've been throwing around in the hope that some of the FRBRistas out there can set me straight; and
  • to highlight some of the complexity that I suspect is an inevitable consequence of applying the FRBR model to these resources

Despite the length of the FRBR report, and the range of examples provided, when I come to apply the FRBR model in some particular context, I often find myself with unanswered questions and looking more some examples that I can relate directly to the case at hand. So I was pleased to find that Martha Yee's chapter "FRBR and Moving Image Materials: Content (Work and Expression) versus Carrier (Manifestation)" from the book Understanding FRBR: What It Is and How It Will Affect Our Retrieval Tools is available on the Web. Here I just note a few of the points highlighted by Martha which helped guide my thinking about applying the FRBR model to Time-Based Media:

  • "a filmed version of a work intended for performance... is a new work" (Yee, page 118) (See e.g. the Romeo and Juliet examples in FRBR 3.2.1)
  • "Change from any other GMD, for example, text, music, sound recording, electronic resource, that is not moving image, into the moving image GMD motion picture or videorecording creates a new work by FRBR definition.... The reverse holds true, as well. Change from a moving image GMD to a nonmoving image GMD necessarily involves the creation of a new work. The change from a moving image to a still image or to a sound recording, for example, is so fundamental that the result has to be considered a new work" (Yee, page 121)
  • "any change in the sound, text, music or image of a moving image work creates a new expression of that work" (Yee, page 119) (See e.g. the Jules et Jim examples in FRBR 3.2.1)
  • A change in playing time is an indicator of a change between one expression and another (Yee, page 122)
  • Aspect ratio (intended proportion of image width to height) should be considered an attribute of the work, not the expression (Yee, page 123)
  • Colour should be considered an attribute of the work, not the manifestation (Yee, page 123)

When I'm exploring these sort of problems, I usually find I need to work through a few concrete examples, and try out various options, which I'll do in a series of follow-up posts to this one.

April 15, 2008

IMLS Digital Collections & Content

Another somewhat belated post.... Andy and I both get occasional invitations to be members of advisory/steering groups for various programmes and projects operating in the areas in which we have an interest. I'm currently a member of the Advisory Group for the second phase of the Digital Collections and Content project which is funded by the Institute of Museum and Library Services and led by a team at the University of Illinois at Urbana-Champaign. Given the UK focus of the Foundation, it's probably slightly unusual for me to take on such a role for a US project, but it combines a number of our interests - repositories, resource discovery, metadata, the use of cultural heritage resources for learning and research, and I have also worked with some members of the project team in the past in the development of the Dublin Core Collections Application Profile.

The group met recently in Chicago, and although I wasn't able to attend the meeting in person, I managed to join in by phone for a couple of hours. One area in which the project seems to be doing some interesting work is in the relationships between collection-level description and item description, and in particular the use of algorithms/rules by which item-level metadata might be inferred from collection-level metadata.

The project is also exploring how collection-level metadata might be presented more effectively during searching, particularly to provide contextual information for individual items.

April 14, 2008

Open Repositories 2008

I spent a large part of last week the week before last (Tuesday, Wednesday & Friday) at the Open Repositories 2008 conference at the University of Southampton.

There were something around 400 delegates there, I think, which I guess is an indicator of the considerable current level of interest around the R-word. Interestingly, if I recall conference chair Les Carr's introductory summary of stats correctly, nearly a quarter of these had described themselves as "developers", so the repository sphere has become a locus for debate around technical issues, as well as the strategic, policy and organisational aspects. The JISC Common Repository Interfaces Group (CRIG) had a visible presence at the conference, thanks to the efforts of David Flanders and his comrades, centred largely around the "Repository Challenge" competition (won by Dave Tarrant, Ben O’Steen and Tim Brody with their "Mining with ORE" entry).

The higher than anticipated number of people did make for some rather crowded sessions at times. There was a long queue for registration, though that was compensated for by the fact that I came away from that process with exactly two small pieces of paper: a name badge inside an envelope on which were printed the login details or the wireless network. (With hindsight, I could probably have done with a one page schedule of what was on in which location - there probably was one which I missed picking up!) Conference bags (in a rather neat "vertical" style which my fashion-spotting companions reliably informed me was a "man bag") were available, but optional. (I was almost tempted, as I do sport such an accessory at weekends, and it was black rather than dayglo orange, but decided to resist on the grounds that there was a high probability of it ending up in the hotel wastepaper bin as I packed up to leave.) Nul points, however, to those advertisers who thought it was a good idea to litter every desktop surface in the crowded lecture theatre with their glossy propaganda, with the result that a good proportion of it ended up on the floor as (newly manbagged-up) delegates squeezed their way to their seats.

The opening keynote was by Peter Murray-Rust of the Unilever Centre for Molecular Informatics, University of Cambridge. With some technical glitches to contend with, which must have been quite daunting in the circumstances - Peter has posted a quick note on his view of the experience! "I have no idea what I said" :-)) - , Peter delivered a somewhat "non-linear" but always engaging and entertaining overview of the role of repositories for scientific data. He noted the very real problem that while ever increasing quantities of data are being generated, very little of it is being successfully captured, stored and made accessible to others. Peter emphasised that any attempt to capture this data effectively must fit in with the existing working practices of scientists, and must be perceived as supporting the primary aims of the scientist, rather than introducing new tasks which might be regarded as tangential to those aims. And the practices of those scientists may, in at least some areas of scientific research, be highly "locally focused" i.e. the scientists see their "allegiances" as primarily to a small team with whom data is shared - at least in the first instance, an approach categorised as "long tail science" (a term attributed to Peter's colleague Jim Downing). Peter supported his discussion with examples drawn from several different e-Chemistry projects and initiatives, including the impressive OSCAR-3 text mining software which extracts descriptions of chemical compounds from documents)

Most of the remainder of the Tuesday and Wednesday I spent in paper sessions. The presentation I enjoyed most was probably a presentation by Jane Hunter from the University of Queensland on the work of the HarvANA project on a distributed approach to annotation and tagging of resources from the Picture Australia collection (in the first instance at least - at the end, Jane whipped through a series of examples of applying the same techniques to other resources). Jane covered a model for annotation on tagging based on the W3C Annotea model, a technical architecture for gathering and merging distributed annotations/taggings (using OAI-PMH to harvest from targets at quite short time intervals (though those intervals could be extended if preferred/required)), browser-based plug-in tools to perform annotation/tagging, and also touched on the relationships between tagging and formally-defined ontologies. The HarvANA retrieval system currently uses an ontology to enhance tag-based retrieval - "ontology-based or ontology-directed folksonomy" - , but the tags provided could also contribute to the development/refinement of that ontology, "folksonomy-directed ontology". Although it was in many ways a repository-centric approach and Jane focused on the use of existing, long-established technologies, she also succeeded in placing repositories firmly in the context of the Web: as systems which enable us to expose collections of resources (and collections of descriptions of those resources), which then enter the Web of relationships with other resources managed and exposed by other systems - here, the collections of annotations exposed by the Annotea servers, but potentially other collections too.

At Wednesday lunch time, (once I managed to find the room!) I contributed to a short "birds of a feather" session co-ordinated by Rosemary Russell of UKOLN and Julie Allinson of the University of York on behalf of the Dublin Core Scholarly Communications Community. We focused mainly on the Scholarly Works Application Profile and its adoption of a FRBR-based model, and talked around the extension of that approach to other resource types which is under consideration in a number of sibling projects currently being funded by JISC. (Rather frustratingly for me, this meeting clashed with another BoF session on Linked Data which I would really have liked to attend!)

I should also mention the tremendously entertaining presentation by Johan Bollen of the Los Alamos National Laboratory on the research into usage metrics carried out by the MESUR project. Yes, I know, "tremendously entertaining" and "usage statistics" aren't the sort of phrases I expect to see used in close proximity either. Johan's base premise was, I think, that seeking to illustrate impact through blunt "popularity" measures was inadequate, and he drew a distinction between citation - the resources which people announce in public that they have read - and usage - the actual resources they have downloaded. Based on a huge dataset of usage statistics provided by a range of popular publishers and aggregators, he explored a variety of other metrics, comparing the (surprisingly similar) rankings of journals obtained via several of these metrics with the rankings provided by the citation-based Thomson impact factor. I'm not remotely qualified to comment on the appropriateness of Johan's choice of algorithms, but the fact that Johan kept a large audience engaged at the end of a very long day was a tribute to his skill as a presenter. (Though I'd still take issue with the Britney (popular but insubstantial?)/Big Star (low-selling but highly influential/lauded by the cognoscenti) opposition: nothing by Big Star can compare with the strutting majesty of "Toxic". No, not even "September Gurls".)

On the Friday, I attended the OAI ORE Information Day, but I'll make that the subject of a separate post.

All in all - give or take a few technical hiccups - it was a successful conference, I think (and thanks to Les and his team for their hard work) - perhaps more so in terms of the "networking" that took place around the formal sessions, and the general "buzz" there seemed to be around the place, than because of any ground-breaking presentations.

And yet, and yet... at the end of the week I did come away from some of the sessions with my niggling misgivings about the "repository-centric" nature of much of the activity I heard described slightly reinforced. Yes, I know: what did I expect to hear at a conference called "Open Repositories"?! :-) But I did feel an awful lot of the emphasis was on how "repository systems" communicate with each other (or how some other app communicates with one repository system and then with another repository system ) e.g. how can I "get something out" of your repository system and "put it into" my repository system, and so on. It seems to me that - at the technical level at least - we need to focus less on seeing repository systems as "specific" and "different" from other Web applications, and focus more on commonalities. Rather than concentrating on repository interfaces we should ensure that repository systems implement the uniform interface defined by the RESTful use of the HTTP protocol. And then we can shift our focus to our data, and to

  • the models or ontologies (like FRBR and the CIDOC Conceptual Reference Model, or even basic one-object-is-made-available-in-multiple-formats models) which condition/determine the sets of resources we expose on the Web, and see the use of those models as choices we make rather than something "technologically determined" ("that's just what insert-name-of-repository-software-app-of-choice does");
  • the practical implementation of formalisms like RDF which underpin the structure of our representations describing instances of the entities defined by those models, through the adoption of conventions such as those advocated by the Linked Data community

In this world, the focus shifts to "Open (Managed) Collections" (or even "Open Linked Collections"), collections of documents, datasets, images, of whatever resources we choose to model and expose to the world. And as a consumer of those resources  I (and, perhaps more to the point, my client applications) really don't need to know whether the system that manages and exposes those collections is a "repository" or a "content management system" or something else (or if the provider changes that system from one day to the next): they apply the same principles to interactions with those resources as they do to any other set of resources on the Web.

March 14, 2008

Yahoo search & the Semantic Web

There was a good deal of excitement yesterday at an announcement on the Yahoo! Search weblog that they will be introducing support in the Yahoo Search Monkey platform for indexing some data made available on the Web using Semantic Web standards or using some microformats. In yesterday's post, "Dublin Core" is mentioned as one of the vocabularies which will be supported; it also refers to support for both the W3C's RDFa and Ian Davis' Embedded/Embeddable RDF (Aside: I've been starting to explore RDFa recently and I'm quite excited about the potential, but that should be the topic of a separate post.)

A post by Micah Dubinko provides some further detail in an FAQ style.

It is worth bearing in mind the note of caution from Paul Miller that such an approach brings with it the challenges of dealing with malicious or mischievous attempts to spam rankings, and as I think Micah Dubinko's post makes clear, this is not going to be an aggregator of all the RDF data on the Web. But nevertheless it seems to represents a very significant development in terms of the use of metadata by a major Web search engine (after all the years I've spent having to break it to dismayed Dublin Core aficionados that the metadata from their HTML headers almost certainly wasn't going to be used by any of the global search engines, and unless they knew of an application that was going to index/harvest it, they might wish to consider whether the effort was worthwhile!) - and for the use of Semantic Web technologies in particular.

February 21, 2008

Linked Data (and repositories, again)

This is another one of those posts that started life in the form of various drafts which I didn't publish because I thought they weren't quite "finished", but then seemed to become slightly redundant because anything of interest had already been said by lots of other people who were rather more on the ball than I was. But as there seems to be a rapid growth of interest in this area at the moment, and as it ties in with some of the themes Andy highlights in his recent posts about his presentation at VALA 2008, I thought I'd make an effort to pull try to pull some of these fragments together.

If I'd got round to compiling my year-end Top 5 Technical Documents list for 2007 (whaddya mean, you don't have a year-end Top 5 Technical Documents list?), my number one would have been How to Publish Linked Data on the Web by Chris Bizer, Richard Cyganiak and Tom Heath.

In short, the document fleshes out the principles Tim Berners-Lee sketches in his Linked Data note - essentially the foundational principles for the Semantic Web. As Berners-Lee notes

The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data.  With linked data, when you have some of it, you can find other, related, data. (emphasis added)

And the key to realising this, argues Berners-Lee, lies in following four base rules:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs. so that they can discover more things.

Bizer, Cyganiak & Heath present linked data as a combination of key concepts from the Web Architecture on the one hand (including the TAG's resolution to the httpRange-14 issue) and the RDF data model on the other, and distill them into a form which is on the one hand clear and concise, and on the other backed up by effective, practical guidelines for their application. While many of those guidelines are available in some form elsewhere (e.g. in TAG findings or in notes such as Cool URIs...), it's extremely helpful to have these ideas collated and presented in a very practically focused style.

As an aside, in the course of assembling those guidelines, they suggest that some of those principles might benefit from some qualification, in particular the use of URI aliases, which the Web Architecture document suggests are best avoided. For the authors,

URI aliases are common on the Web of Data, as it can not realistically be expected that all information providers agree on the same URIs to identify a non-information resources. URI aliases provide an important social function to the Web of Data as they are dereferenced to different descriptions of the same non-information resource and thus allow different views and opinions to be expressed. (emphasis added)

I'm prompted to mention Linked Data now in part by Andy's emphasis on Web Architecture and Semantic Web technologies, but also by a post by Mike Bergman a couple of weeks ago, reflecting on the growth in the quantity of data now available following the principles and conventions recommended by the Bizer, Cyganiak & Heath paper. In his post, Bergman includes a copy of a graphic from Richard Cyganiak providing a "birds-eye view "of the Linked Data landscape, and highlighting the principal sources by domain or provider.

"What's wrong with that picture?", as they say. I was struck (but not really surprised) by the absence - with the exception of the University of Southampton's Department of Electronics & Computer Science - of any of the data about researchers and their outputs that is being captured and exposed on the Web by the many "repository" systems of various hues within the UK education sector. While in at least some cases institutions (or trans-institutional communities) are having a modicum of success in capturing that data, it seems to me that the ways in which it is typically made available to other applications mean that it is less visible and less usable than it might be.

Or, to borrow an expression used by Paul Miller of Talis in a post  on Nodalities, we need to think about how to make sure our repository systems are not simply "on the Web" but firmly "of the Web" - and the practices of the growing Linked Data community, it seems to me, provide a firm foundation for doing that.

February 13, 2008

Repositories thru the looking glass

P1050338 I spent last week in Melbourne, Australia at the VALA 2008 Conference - my first trip over to Australia and one that I thoroughly enjoyed.  Many thanks to all those locals and non-locals that made me feel so welcome.

I was there, first and foremost, to deliver the opening keynote, using it as a useful opportunity to think and speak about repositories (useful to me at least - you'll have to ask others that were present as to whether it was useful for anyone else).

It strikes me that repositories are of interest not just to those librarians in the academic sector who have direct responsibility for the development and delivery of repository services.  Rather they represent a microcosm of the wider library landscape - a useful case study in the way the Web is evolving, particularly as manifest through Web 2.0 and social networking, and what impact those changes have on the future of libraries, their spaces and their services.

My keynote attempted to touch on many of the issues in this area - issues around the future of metadata standards and library cataloguing practice, issues around ownership, authority and responsibility, issues around the impact of user-generated content, issues around Web 2.0, the Web architecture and the Semantic Web, issues around individual vs. institutional vs. national, vs. international approaches to service provision.

In speaking first I allowed myself the luxury of being a little provocative and, as far as I can tell from subsequent discussion, that approach was well received.  Almost inevitably, I was probably a little too technical for some of the audience.  I'm a techie at heart and a firm believer that it is not possible to form a coherent strategic view in this area without having a good understanding of the underlying technology.  But perhaps I am also a little too keen to inflict my world-view on others. My apologies to anyone who felt lost or confused.

I won't repeat my whole presentation here.  My slides are available from Slideshare and a written paper will become available on the VALA Web site as soon as I get round to sending it to the conference organisers!

I can sum up my talk in three fairly simple bullet points:

  • Firstly, that our current preoccupation with the building and filling of 'repositories' (particularly 'institutional repositories') rather than the act of surfacing scholarly material on the Web means that we are focusing on the means rather than the end (open access).  Worse, we are doing so using language that is not intuitive to the very scholars whose practice we want to influence.
  • Secondly, that our focus on the 'institution' as the home of repository services is not aligned with the social networks used by scholars, meaning that we will find it very difficult to build tools that are compelling to those people we want to use them.  As a result, we resort to mandates and other forms of coercion in recognition that we have not, so far, built services that people actually want to use.  We have promoted the needs of institutions over the needs of individuals.  Instead, we need to focus on building and/or using global scholarly social networks based on global repository services.  Somewhat oddly, ArXiv (a social repository that predates the Web let alone Web 2.0) provides us with a good model, especially when combined with features from more recent Web 2.0 services such as Slideshare.
  • Finally, that the 'service oriented' approaches that we have tended to adopt in standards like the OAI-PMH, SRW/SRU and OpenURL sit uncomfortably with the 'resource oriented' approach of the Web architecture and the Semantic Web.  We need to recognise the importance of REST as an architectural style and adopt a 'resource oriented' approach at the technical level when building services.

I'm pretty sure that this last point caused some confusion and is something that Pete or I need to return to in future blog entries.  Suffice to say at this point that adopting a 'resource oriented' approach at the technical level does not mean that one is not interested in 'services' at the business or function level.

[Image: artwork outside the State Library of Victoria]

February 06, 2008

Google, Social Graphs, privacy & the Web

This has already received a fair amount of coverage elsewhere (TechcrunchDanny Ayers, Read-Write Web (1), Joshua Porter (1), Read-Write Web (2), Joshua Porter (2), to pick just a few) but I thought it was worth providing a quick pointer. Last week Google announced the availability of what they are calling their Social Graph API.

The YouTube video by Brad Fitzpatrick provides a good overview:

This is a Google-provided service which offers a (service-specific) query interface to a dataset that is generated by crawling data publicly available on the Web in the form of:

Result sets are returned in the form of JSON documents.

On the technical side, I have seen a few critical comments (see discussion on Semantic Web Interest Group IRC channel) around some points of respecting Web architecture principles (e.g. the conflation of (URIs for) people and (URIs for) documents (see the draft W3C TAG finding Dereferencing HTTP URIs) and what looks like the introduction of an unnecessary new URI scheme (see the draft W3C TAG finding URNs, Namespaces and RegIstries)). And some concerns are also voiced about introducing dependency on a centralised Google-provided service - though of course the data is created and held independently and other providers could aggregate that data and offer similar services, even using the same interface (though whether they will be able to do so as effectively as Google can, given their experience in this area, and/or attract the user base which a Google service inevitably will, remains to be seen). And of course there are the usual issues of spamming and trust and the significance of reciprocation: who says "PeteJ is friends with XYZ" and what does XYZ have to say about that?

Overall, however, I think the approach of such a high-profile provider exposing data gathered from distributed, devolved, openly available sources on the Web, rather than from the database of a single social networking service, is being seen as a significant development.

There are some thoughtful voices of caution, however. In a comment to Joshua Porter's first post listed above, Thomas Vanderwal notes

I am quite excited about this in a positive manner. I do have great trepidation as this is exactly the tool social engineering hackers have been hoping for and working toward.

and

The Google SocialGraph API is exposing everybody who has not thought through their privacy or exposing of their connections.

And in particular, a post by Danah Boyd encourages us to reflect on the social, political and ethical implications of aggregating this data and facilitating access to that aggregation in this way, and reminds us that as individuals we live within a set of power relationships which mean that some are more vulnerable than others to the use of such technologies:

Being socially exposed is AOK when you hold a lot of privilege, when people cannot hold meaningful power over you, or when you can route around such efforts. Such is the life of most of the tech geeks living in Silicon Valley. But I spend all of my time with teenagers, one of the most vulnerable populations because of their lack of agency (let alone rights). Teens are notorious for self-exposure, but they want to do so in a controlled fashion. Self-exposure is critical for the coming of age process - it's how we get a sense of who we are, how others perceive us, and how we fit into the world. We exposure during that time period in order to understand where the edges are. But we don't expose to be put at true risk. Forced exposure puts this population at a much greater risk, if only because their content is always taken out of context. Failure to expose them is not a matter of security through obscurity... it's about only being visible in context.

Even if - as Google take pains to emphasise is the case - the individual data sources are already "public", the merging of data sources, and the change of the context in which information is presented can be significant.

The opposing view is perhaps most vividly expressed in Tim O'Reilly's comment:

The counter-argument is that all this data is available anyway, and that by making it more visible, we raise people's awareness and ultimately their behavior. I'm in the latter camp. It's a lot like the evolutionary value of pain. Search creates feedback loops that allow us to learn from and modify our behavior. A false sense of security helps bad actors more than tools that make information more visible.

One of my tests for whether a Web 2.0 innovation is "good", despite the potential for abuse, is whether it makes us smarter.

I left this post half-finished at this point last night feeling very uneasy with what I perceived as an undertone of almost Darwinian "ruthlessness" in the O'Reilly position, but at the same time struggling to articulate an alternative that I was really convinced of.

So I was delighted this morning when, on opening up my Bloglines feeds, I found an excellent post by Dan Brickley which I think reflects some of the ambivalence I was feeling ("The end of privacy by obscurity should not mean the death of privacy. Privacy is not dead, and we will not get over it. But it does need to be understood in the context of the public record"), and, really, I can only recommend that you read the post in full because I think it's a very sensitive, measured contribution to the debate, based on Dan's direct experience of the issues arising from the deployment of these technologies over several years working on FOAF.

And, far from sitting on the fence, Dan concludes with very practical recommendations for action:

  • Best practice codes for those who expose, and those who aggregate, social Web data
  • Improved media literacy education for those who are unwittingly exposing too much of themselves online
  • Technology development around decentralised, non-public record communication and community tools (eg. via Jabber/XMPP)

Google's announcement of this API has certainly brought both the technical and the social issues to the attention of a wider audience, and sparked some important debate, and perhaps that in itself is a significant contribution in an area where the landscape suddenly seems to be shifting very quickly indeed.

And if I can unashamedly take the opportunity to make a another plug for the activities of the Foundation, I'm sure there's plenty of food for thought here for anyone considering a proposal to the current Eduserv Research Grants call :-)

January 30, 2008

Metadata Standards Harmonization

Mikael Nilsson announced earlier this week the availability of a document produced by the EC-funded ProLearn project, with the title Harmonization of Metadata Standards, edited by Mikael with contributions from Ambjörn Naeve, Erik Duval, David Massart & myself (though I have to admit my own direct input to this paper was quite limited!).

The document analyses a number of metadata standards and seeks to elucidate the principles and frameworks which underpin those standards, and to highlight that it is the differences and incompatibilities in those principles and frameworks which ultimately create obstacles to the development of systems working across multiple standards. Until we meet the challenge of addressing these contradictions, by "harmonizing" our metadata standards, the effective exchange of metadata instances between systems based on different standards will always be fraught with difficulty.

The paper concludes with a "manifesto" of concrete points of action for the harmonization of metadata standards generally, with specific reference to the case of the IEEE Learning Object Metadata (LOM) standard and Dublin Core, in five areas:

  • Identification: The use of URIs as globally scoped identifiers for metadata terms.
  • Abstract Models: The synchronisation of standards at the level of their abstract models, rather than through (complex, lossy) mapping between instances of different, often incompatible, abstract models.
  • Vocabulary Models: Closely related to the previous point, since the type of metadata term to be described is determined by features of the abstract model, alignment of the ways "element vocabularies" are described, with a recommendation to use RDF Schema. (I think I would have liked to see a bit more qualification/elaboration of this point, and emphasis of the dependency on an RDF-compatible abstract model: the solution isn't, IMHO, as straightforward as producing an RDFS property description corresponding to each "element" of a vocabulary which was constructed for use in the context of a tree-based model - my old "hobby horse" that a "LOM data element" is quite a different sort of thing from a "Dublin Core element".)
  • Application Profile Models: A shared understanding of what constitutes a metadata application profile.
  • Metadata formats: Syntaxes must be grounded in the abstract model(s): it is the model which drives the representation in a concrete syntax.

The paper reprises and refines some of the themes that have been addressed in earlier papers (e.g. a paper at DC-2006 on metadata frameworks and a book chapter written around the same time), but I think it provides a nice distillation of those ideas, brings in some of the current context (including the sort of informal, "subjective" metadata surfaced in many "Web 2.0" contexts and Erik's recent work on "attention metadata"), and extends them to guidance to standards developers on some practical steps for action.

The paper concludes - and here I can hear the characteristically resilient and upbeat voice of Mikael, who is always keen to point out to me that the glass I see as half-empty is in fact half-full! :-) - :

Together, these two initiatives [the IEEE LOM/Dublin Core harmonization effort and the Resource Discovery & Access (RDA) work in the librray community], both of which include important contributions from ProLEARN members, demonstrate important progress towards harmonization of several important metadata domains – generic metadata using Dublin Core, educational metadata, and library metadata, as well as a widening from the all-digital domain to the domain of physical artefacts (books).

Harmonizing metadata specifications in the way outlined in this document seems an overwhelming task, but the steady flow of important developments still makes the future seem bright.

Learning Materials & FRBR

JISC is currently funding a study, conducted by Phil Barker of JISC CETIS, to survey the requirements for a metadata application profile for learning materials held by digital repositories. Yesterday Phil posted an update on work to date, including a pointer to a (draft) document titled Learning Materials Application Profile Pre-draft Domain Model which 'suggests a "straw man" domain model for use during the project which, hopefully, will prove useful in the analysis of the metadata requirements'.

The document outlines two models: the first is of the operations applied to a learning object (based on the OAIS model) and the second is a (very outline) entity-relational model for a learning resource - which is based on a subset of the Functional Requirements for the Bibliographic Record (FRBR) model. As far as I can recall, this is the first time I've seen the FRBR model applied to the learning object space - though of course at least some of the resources which are considered "learning resources" are also described as bibliographic resources, and I think at least some, if not many, of the functions to be supported by "learning object metadata" are analogous to those to be supported by bibliographic metadata.

I do have some quibbles with the model in the current draft. Without a fuller description of the functions to be supported, it's difficult to assess whether it meets those requirements - though  I recognise that, as I think the opening comment I cited above indicates, there's an element of "chicken and egg" involved in this process: you need to have at least an outline set of entity types before you can start talking about operations on instances of those types. Clearly a FRBR-based approach should facilitate interoperability between learning object repositories and systems based on FRBR or on FRBR-derivatives like the Eprints/Scholarly Works Application Profile (SWAP). I have to admit the way "Context" is modelled at present doesn't look quite right to me, and I'm not sure about the approach of collapsing the concepts of an individual agency and a class of agents into a single "Agent" entity type in the model. (For me the distinguishing characteristic of what the SWAP calls an "Agent" is that, while it encompasses both individuals and groups, an "Agent" is something which acts as a unit, and I'm not sure that applies in the same way to the intended audience for a resource.) The other aspect I was wondering about is the potential requirement to model whole-part relationships, which, AFAICT, are excluded from the current draft version. FRBR supports a range of variant whole-part relations between instances of the principal FRBR entity types, although in the case of the SWAP, I don't think any of them were used.

But I'm getting ahead of myself here really - and probably ending up sounding more negative than I intend! I think it's a positive development to see members of the "learning metadata community" exploring - critically - the usefulness of a model emerging from the library community. I need to read the draft more carefully and formulate my thoughts more coherently, but I'll be trying to send some comments to Phil.

January 24, 2008

OAI ORE specification roll-out meetings

The OAI ORE project is co-ordinating two open meetings to introduce the (forthcoming) beta versions of the set of specifications which the project has developed to describe aggregations of resources. (The  current alpha versions are http://www.openarchives.org/ore/0.1/

The first meeting is in the USA on 3 March 2008 at Johns Hopkins University, Baltimore, MD. (Press release)

The second meeting is in the UK on 4 April 2008 at the University of Southampton in conjunction with the Open Repositories 2008 conference. (Press release)

Please note that, in both cases, spaces are limited and registration is required.

January 17, 2008

Updates to DCMI metadata vocabularies

On Monday, the Dublin Core Metadata Initiative announced a significant update to the descriptions of the DCMI vocabularies, reflected in the RDFS term descriptions available in the DCMI "namespace documents". There are two main changes:

  • the categorisation of what were previously called "encoding schemes" as either Vocabulary Encoding Schemes or Syntax Encoding Schemes;
  • the introduction of assertions of domain (rdfs:domain) and range (rdfs:range) relationships for those DCMI properties with URIs in the http://purl.org/dc/terms/ namespace

Given the wide variation in the existing use of the fifteen properties of the Dublin Core Metadata Element Set (with URIs in the http://purl.org/elements/1.1/ namespace), it was decided that making range assertions for these properties may introduce problems for existing applications. So no domain/range assertions are made for those properties, and fifteen new, like-named properties have been defined, with URIs in the http://purl.org/dc/terms/ namespace, and these new properties are the subject of domain/range assertions.

This set of changes represents a considerable step forward in aligning DCMI's descriptions of its own metadata terms with the vocabulary model defined by the DCMI Abstract Model. It is the culmination of a good deal of effort by the DCMI Usage Board and the DCMI Architecture Forum, and in particular by Tom Baker, the chair of the Usage Board, who has done the lion's share of the work in preparing this set of documents and in ensuring that all changes have been documented in accordance with the UB's procedures.

A full description of the changes can be found in the document, Revisions to DCMI Metadata Terms

In addition, and to complement these changes, the document Expressing Dublin Core metadata using the Resource Description Framework (RDF) is now a DCMI Recommendation.

Flickr Commons

Via a tweet by @briankelly I discovered Flickr Commons, a collaboration between the Library of Congress and Flickr to "give you a taste of the hidden treasures in the huge Library of Congress collection" and to demonstrate "how your input of a tag or two can make the collection even richer".  There are more formal announcements here and here.

Brian's initial tweet generated a mini Twitter discussion (something that some people say Twitter isn't supposed to be used for though I tend to disagree).  The general consensus seemed to be that using the resources and tools of the private sector to widen access to public collections makes perfect sense, provided ownership of the data is retained - i.e. in this case it is OK because Flickr isn't Facebook! :-)  There are certainly some very, very obvious benefits in terms of visibility of content, size of audience, quality of user experience, and so on.

On that basis alone, this is a very interesting development and one that I'm sure many parts of the cultural heritage sector will be keeping a close eye on.  Congratulations to the Library of Congress and Flickr for getting their fingers out and doing something to bring these worlds together!  I'm guessing that the two collections that have been made available via Flickr so far are part of the American Memory collection - I haven't checked.  I'm also guessing that, like much of that collection, these images are effectively in the public domain?

As I've said before, what is frustrating for those of us in the UK about this development is that it is much harder to see this kind of thing happening here, where so many of our cultural collections are locked behind restrictive 'personal', 'educational' use licences.

Operating a hand drill at Vultee-Nashville, woman is working on a It'll be fascinating to see what kinds of tags people add.  The Flickr policy statement - "Any Flickr member is able to add tags or comment on these collections. If you're a dork about it, shame on you. This is for the good of humanity, dude!!" is short and to the point.  Like it!

I took a quick browse around the 1930s-40s in Color collection/set.  Here's a nice image (see right), now tagged with 'bandana', a word not in the original catalogue record as far as I can tell.  From there it is possible to navigate to other images in the collection with the same tag - there are three at the time of writing.  OK, so this isn't a earth-shattering example of user-generated content but you get the idea, and bandana researchers all over the world might well be hugely grateful to have three more resources at their disposal! :-)

It will also be interesting to see the kind of comments that people leave.  Hopefully we'll get beyond the use of 'wow' and 'awesome'!  Wouldn't it be great to see comments by the people (or their families or colleagues) in the photos.

Final thought... we've been making the point here for a while that Flickr is a repository and that the Flickr experience is a useful benchmark when we think about how repositories should look and feel - I think this kind of development makes that even more obvious.

About

Powered by TypePad
Add to Technorati Favorites