« January 2009 | Main | March 2009 »

February 19, 2009

What is ORE for, really?

Pete has rather nicely answered the question, "What is ORE, really?".  In response, I'm tempted to ask a slightly different question, "What is ORE for, really?  In the ORE User Guide - Primer we find a 'Motivating Example' section which lays out some hard-to-reject statements about the importance of aggregations but which doesn't give us many verbs - it doesn't tell us what it is we can expect to be able to do to those aggregations, nor why we might want to.  The previous introductory section does propose three sample uses:

Because aggregations are not well-defined on the Web, we are limited in what we can do with them, especially in terms of the services or automated processes that make the Web useful. People who wish to save or print a multiple page document must manually click through each page and invoke the appropriate browser command. Programs that transfer multiple page documents among information systems must rely on the API's of the individual system architectures and their definition of document boundaries. Search engines must use heuristics to group individual Web pages into logical documents so that search results have the proper granularity.

On the face of it these are perfectly valid functional requirements but I think the underlying point that Pete makes in his post in that ORE, on its own, doesn't meet them.  The necessary knowledge that allows one bit of software to say, "ah, these are the pages of a document and I need to print them in this order" or "these are the boundaries of a document" or "it makes sense to group these individual Web pages in this way" based on the data it gets from another bit of software is not captured by ORE.  Life is not as simple as saying "here is an aggregation" because the aggregation might not be a set of printable pages from a document, or a set of Web pages, or a coherent set of anything else for that matter and there is very little in ORE that tells you anything about the relationship(s) between the things in the aggregation or their relationship to the outside world.  And if ORE doesn't meet its own functional requirements particularly well, it is even further from the kind of functional requirements we envisaged in the work on SWAP.  Requirements like, "show me the latest freely available version of this research paper".

Now, I accept that ORE does provide a way of layering that additional information (which might be in the form of SWAP for example) over the top of the aggregation.  On that basis the pertinent questions, or so it seems to me, are "given that we probably need that extra level of information to do anything useful with the aggregation, is the information about the aggregation useful on its own?" and "does SWAP capture the right level of detail and is it realistic to expect real-world systems to handle this level of complexity?".

I think the jury is out on both.  (Note: I am certainly not arguing that SWAP is better than ORE - they are sufficiently different for that to be a pointless statement anyway and the bottom line is that I'm not completely sure that I'm convinced by either if I'm absolutely honest.)  I would say that in the world of learning objects there is quite a long history of treating things as reasonably unrefined aggregations (usually refered to as 'content packages') and that in that space the usefulness of that approach has been fairly minimal.

Software as a disservice

I note from the UK Federation home page that the:

federation uses the standards-based Shibboleth software, developed by the Internet2 community in the United States. Shibboleth defines a common framework for access management that is being adopted by education and commercial sectors across the world.

A point echoed on their How it works page:

The UK federation uses the standards based Shibboleth software, developed by the Internet 2 community in the United States to facilitate the sharing of web resources that are subject to access control.

How odd... I've always understood the Federation to be based on an open standard, SAML to be precise, not on a particular piece of software, open-source or otherwise, and indeed this point is confirmed in the Federation's technical recommendations:

The UK federation uses the Security Assertion Markup Language (SAML) standards for the communication of authentication, entitlement and attribute information. The core of the federation is implemented using the Shibboleth software from Internet2. It is recognised, however, that any particular software implementation may not be suitable for all participants, and federation members may deploy any software that meets their specific service goals.

A perfectly reasonable statement.

Interestingly, I am often guilty of confusing the two (and I see the same thing happening with colleagues here at Eduserv), using the word Shibboleth effectively as shorthand for 'a profile of SAML'.  This confusion is a mistake and does significant harm to the community IMHO.

Open-source is fine and dandy but open standards are much more important and the effective positioning of a particular open source package into a psuedo-monopolistic position does nobody any favours.  That's the position we were trying to move away from as a community!  Shibboleth is to federated access management in the UK what Hoover used to be to vacuum cleaners.  This is great if you are trying to promote a single product but very poor if you are trying to build an open community.

February 17, 2009

What is ORE, really? (A personal viewpoint)

This is another post that I've had sitting around in draft for ages, but which some recent discussion has prompted me to dig out and try to finish. Chris Keene commented on my post of some time ago about the publication of OAI ORE specs, asking for some clarification on what it is that OAI ORE provides, "what ORE is", I suppose, and I promised I'd take a stab at answering. I guess I should emphasise that this is my personal view only, but here's my attempt at a response to Chris' questions.

is it a protocol like OAI-PMH or a file standard? I read a primer (somewhat quickly) and it seems to be almost a XML file specification to be read over HTTP, which describes a resource such as a repository? is that right?

I think it's helpful - and maybe why I think it's important will become clearer by the end of this post - to distinguish between the parts of the ORE specifications which are specific to ORE and the parts which provide guidance on how to apply principles and conventions which have been defined outside of the ORE context, are not dependent on the use of the ORE-specific parts of ORE, and are more general in their application. (The distinction I'm making here doesn't quite match the separation ORE itself makes between "Specifications" and "User Guides".)

Some parts of the ORE specifications are "ORE-specific", they define or describe things that aren't defined or described elsewhere. Those things are:

  1. A simple data model for the things ORE calls a Resource Map, an Aggregation and a Proxy. This is defined by the Abstract Data Model document. Here the term "data model" is used in the sense of a "model of (some part of) the world", a "domain model", if you like - though in the ORE case, it is intended to be quite a generally applicable one.
  2. An RDF vocabulary used, in association with terms from some existing RDF vocabularies, for representing instances of that model. This is defined in human-readable form by the Vocabulary document, and in machine-processable form by the RDF/XML "namespace document" http://www.openarchives.org/ore/terms/.
  3. A variant of what I might call - following the terminology used by Alistair Miles - a "Graph Profile", a specification of some structural constraints on an RDF graph which should be met if that graph is to serve as an ORE Resource Map, a set of "triple patterns", if you like, for the triples that make up an ORE Resource Map. This is defined in Section 6 of the Abstract Data Model document.
  4. A set of conventions for representing an ORE Resource Map as an Atom Entry Document, using the Atom Syndication Format. This is defined by the ORE document Resource Map Implementation in Atom
  5. A set of conventions for disclosing and discovering ORE Resource Maps, defined by the document Resource Map Discovery. Some of these are applications of existing conventions, but as there are some ORE-specific aspects (e.g. the definition of http://www.openarchives.org/ore/html/ as an HTML profile specifying the use of "resourcemap" as an X/HTML link type), I'm including it in this list.

Those are the things I tend to focus on when I try to answer the question "What is ORE, really?"

In addition to those ORE-specific elements, the ORE specifications also provide guidelines for how to make use of various other existing specifications and conventions when deploying the ORE model:

  1. The two documents, Resource Map Implementation in RDF/XML and Resource Map Implementation in RDFa describe how to use those two existing syntaxes, defined by W3C Recommendations, to represent Resource Maps
  2. The document HTTP Implementation describes how to apply the principles and patterns define by the W3C TAG's httpRange-14 resolution and the Cool URIs for the Semantic Web document

For the most part, these documents don't really provide new information, at least in the same way those noted above do: instead, they indicate how to apply some existing, more general specifications when making use of the ORE-specific specifications listed above.

That's not to say they aren't useful guidelines: they are, not least because they "contextualise" the general information provided by the more general specifications, and provide ORE-specific examples of their use. The ORE HTTP Implementation document selects from the patterns of the Cool URIs document and provides illustrations of their use for the URIs of Aggregations and Resource Maps.

My main point here is that I think it's important - particularly for audiences who are perhaps encountering some of these more general principles and conventions for the first time in the specific context of ORE - to "decouple" these two aspects, and to make clear that the use of these principles and conventions is not dependent on the ORE-specific parts, and they can - and indeed should - be applied in other contexts too. More on that later.

To answer, Chris' specific questions above: no, ORE isn't a protocol; no, it isn't (what I think of as) a "file standard", though it describes the use of some existing formats; and while ORE does deal with the description of things, the things it deals with are what it calls "aggregations", not "repositories", at least as that term is typically used in the OAI context, to refer to a system that supports some functions. The concept of a repository doesn't feature in ORE.

And I'm not sure how it fits in with OAI-PMH does it replace, or improve, or cater for different needs (they both seem to cater for getting an item from one system to another).

I think ORE is largely orthogonal to OAI-PMH. ORE was not designed to "replace" or "improve" OAI-PMH. ORE can be used independently of OAI-PMH, or, as I think the Discovery document illustrates, it can be used in the context of OAI-PMH, i.e. you could expose ORE Resource Maps as metadata records over OAI-PMH.

Having said that, I do think the approaches underpinning ORE provide at least some hints of how the sort of functionality which is currently provided by OAI-PMH in an RPC-like idiom, where a client "harvester" sends protocol-specific requests to a "repository", might be offered using a more "resource-oriented" approach. Here, I'm not using the term "resource-oriented" to highlight a distinction between "resource" and "metadata", but rather to emphasise the notion of treating all the "items of interest" to the application as "resources" in the sense that the Web Architecture uses that term, assigning them URIs, and supporting interaction using the uniform interface defined by the HTTP protocol. And those "items of interest" can include resources which are descriptions of other resources, and resources which are collections of resources - collections based on various criteria. Anyway, it isn't my intention here to embark on specifying an alternative approach to OAI-PMH. :-)

Chris also asked:

And what about things like SWAP and SWORD?

Let's take the case of SWORD first, as it's the one I know less about! :-) I'm not a SWORD/Atompub expert at all but I think ORE is independent of SWORD, but designed to be usable in the context of SWORD, i.e. in principle at least, an ORE Resource Map could form the subject of a SWORD "deposit". Richard Jones ponders three variant approaches, and there is some discussion on the OAI ORE Google Group.

The case of the Scholarly Works Application Profile (SWAP) raises some issues which I think illustrate some of the points I was making above about the wider applicability of some of the conventions used within ORE.

First, I think there are differences in "scope and purpose". SWAP focuses very specifically on the "eprint" and on supporting a more or less well-defined set of operations, particularly operations related to "versioning" and the various types of relationships between resources which one encounters when dealing with those issues; ORE focuses on a rather simpler, more generic concept of "aggregation" and membership of a set. Having said that, the ORE model can also be applied to the case of the eprint, and indeed some of the examples in the specifications and in supporting presentations use examples of applying ORE to eprint resources.

Second, again as noted above, ORE makes use of some general principles and patterns for exposing resources and resource descriptions on the Web. But those principles and patterns are equally applicable in the context of data models other than ORE; what ORE calls a "Resource Map" is a specialised case of an RDF graph, and the HTTP patterns for providing access to a Resource Map are applications of patterns which can be - and are - applied to provide access to data describing resources of any type - including resources of the type described by SWAP. It isn't necessary to make use of the ORE concept of the Aggregation to use those patterns.

Now then, it is true that the SWAP documentation does not make reference to these patterns, but that is probably because of two considerations. First, at the time of its development, the primary context of use considered was that of exposing data over OAI-PMH. Second, although the httpRange-14 resolution had been agreed, it hadn't been as widely disseminated/popularised  as it has been subsequently, particularly in the form of the Linked Data tutorial and the Cool URIs document. But as I discussed in a recent post, those same principles and patterns used in ORE can be applied to the FRBR case - and if SWAP was being developed now, I'm sure reference to those approaches would be included. (Well, they would if I had any input to the process!)

Third, picking up on my attempt above to identify what I think are the "core" characteristics of ORE, ORE and SWAP are based on two different "models of the world", both of which can be applied to the case of the eprint. From the perspective of the ORE model, the eprint is viewed as an aggregation made up of a number of component/member resources; with SWAP, the perspective is that of the FRBR model - a Work realised in one or more Expressions, each embodied in one or more Manifestations, each exemplified by one or more Items (possibly with relationships between this Work and other Works, between Expressions of the same or different Works, between Works, Expressions etc and Agents, and so on).

In the FRBR case, although, as in the ORE case, there are multiple related resources involved, there isn't necessarily a notion of "aggregation" involved: a FRBR Work (or indeed any of the FRBR Group 1 entities) may be a composite/aggregate resource, but it isn't necessarily the case. There is nothing in FRBR that treats, say, the set of all the Items which exemplify the Manifestations of the Expressions of a single Work as a single aggregate entity - but FRBR does allow for the expression of whole/part relationships between instances of the various Group 1 entities.

So, I think it is important to remember that the choice to use either ORE or SWAP to model an eprint is just that: a modeling choice, one which enables certain functionality on the basis of the data created. Depending on what we want to achieve with the data, different choices may be appropriate.

So to return to Chris' question, it seems to me the core difference between ORE and SWAP is that they offer different models which can be applied to the "eprint". And here, I think I'm revisiting the point that, quite some time ago now, Andy made in terms of contrasting what he called "compound objects" and "complex objects". I must admit I didn't and don't like the term "complex object" - if I describe a set and its members, I understand that the set is the "compound object", but if I describe a document and its three authors, or a FRBR Work, its Expressions, their Manifestations, their Items, and a number of related Agents, which one of them is the "complex object"? - but the point remains a good one: many of the functions we wish to perform rely on our capacity to represent relationships other than relationships of "aggregation" or "composition".

Of course, the ORE concept of the Resource Map does allow for the expression of any other types of relationship, in addition to the required ore:aggregates relationship (and I think using ORE and FRBR together would requires some careful analysis, given the nature of whole/part relationships in FRBR); but one can also construct descriptions expressing other types of relationship, and make those descriptions available using the community-agreed conventions of the Cool URIs document, without using ORE.

So, that turned into another rather rambling post, and I'm not sure how much it helps, but that's my take on "what ORE is".

February 12, 2009

Clouds on the Horizon

I note that the NMC's Horizon Report for 2009 was published back in January, available as both a PDF file and in a rather nice Web version supporting online commentary.

The report discusses 6 topics (mobiles, cloud computing, geo-everything, the personal Web, semantic-aware applications, and smart objects) each of which it suggests will have an impact over the next 5 years.

I was drawn to the cloud computing section first, partly because of other interests here and partly because Larry Johnson (one of the co-PIs on the Horizon project) spoke on this very topic at our symposium last year, about which the report says:

Educational institutions are beginning to take advantage of ready-made applications hosted on a dynamic, ever-expanding cloud that enable end users to perform tasks that have traditionally required site licensing, installation, and maintenance of individual software packages. Email, word processing, spreadsheets, presentations, collaboration, media editing, and more can all be done inside a web browser, while the software and files are housed in the cloud. In addition to productivity applications, services like Flickr (http://www.flickr.com), YouTube (http://www.youtube.com), and Blogger (http://www.blogger.com), as well as a host of other browser-based applications, comprise a set of increasingly powerful cloud-based tools for almost any task a user might need to do.

Cloud-based applications can handle photo and video editing (see http://www.splashup.com for photos and http://www.jaycut.com for videos, to name just two examples) or publish presentations and slide shows (see http://www.slideshare.net or http://www.sliderocket.com). Further, it is very easy to share content created with these tools, both in terms of collaborating on its creation and distributing the finished work. Applications like those listed here can provide students and teachers with free or low-cost alternatives to expensive, proprietary productivity tools. Browser-based, thin-client applications are accessible with a variety of computer and even mobile platforms, making these tools available anywhere the Internet can be accessed. The shared infrastructure approaches embedded in the cloud computing concept offer considerable potential for large scale experiments and research that can make use of untapped processing power.

We are just beginning to see direct applications for teaching and learning other than the simple availability of platform-independent tools and scalable data storage. This set of technologies has clear potential to distribute applications across a wider set of devices and greatly reduce the overall cost of computing. The support for group work and collaboration at a distance embedded in many cloud- based applications could be a benefit applicable to many learning situations.

However, the report also notes that a level of caution is necessary:

The cloud does have certain drawbacks. Unlike traditional software packages that can be installed on a local computer, backed up, and are available as long as the operating system supports them, cloud- based applications are services offered by companies and service providers in real time. Entrusting your work and data to the cloud is also a commitment of trust that the service provider will continue to be there, even in face of changing market and other conditions. Nonetheless, the economics of cloud computing are increasingly compelling. For many institutions, cloud computing offers a cost-effective solution to the problem of how to provide services, data storage, and computing power to a growing number of Internet users without investing capital in physical machines that need to be maintained and upgraded on-site.

The report goes on the provide some examples of use.

I doubt that there will be much here that is exactly 'news' to regular readers of this blog (though this is not true for other sections of the report which cover areas that we don't really deal with here). On the other hand, it is good to see this stuff laid out in a relatively mainstream publication. I remain bemused (as I was last year) at the relatively low level of coverage this report gets in the UK and wonder, in a kind of off the top of my head way, whether a UK or European version of this report would be a worthwhile activity?

February 11, 2009

Two snippets of OpenID news

A couple of bits of OpenID-related news that are worth noting...

First, both Paypal and Facebook have recently joined the OpenID Foundation.  The two are interesting for different reasons.  Paypal, it seems to me, brings with it the functional requirements of an environment that is very different from OpenID's original, low-trust, arena of blog posting and commenting.  On the other hand Facebook brings a high commitment to usability and, despite a generally bad press (or perhaps because of a generally bad press?), it seems to me is actually making some of the right kind of noises around openness.

In short, these moves are very interesting in terms of the future of OpenID and, to a certain extent, bring with them the potential of a shot in the arm for the credibility of OpenID in the education space.

Second, the work that Plaxo have been doing, using OpenID and OAuth to streamline their use of the Google as an OpenID provider (OP) also looks very interesting.  As the Wired article says:

This is momentum. All of a sudden, OpenID is looking like it has a very bright future.

Repository usability - take 2

OK... following my 'rant' yesterday about repository user-interface design generally (and, I suppose, the Edinburgh Research Archive in particular), Chris Rusbridge suggested I take a similar look at an ePrints.org-based repository and pointed to a research paper by Les Carr in the University of Southampton School of Electronics and Computer Science repository by way of example.  I'm happy to do so though I'm going to try and limit myself to a 10 minute survey of the kind I did yesterday.

The paper in question was originally published in The Computer Journal (Oxford University Press) and is available from http://comjnl.oxfordjournals.org/cgi/content/abstract/50/6/703 though I don't have the necessary access rights to see the PDF that OUP make available.  (In passing, it's good to see that OUP have little or no clue about Cool URIs, resorting instead to the totally useless (in Web terms at least) DOI as text string, "doi:10.1093/comjnl/bxm067" as their means of identification :-( ).

Ecs The jump-off page for the article in the repository is at http://eprints.ecs.soton.ac.uk/14352/, a URL that, while it isn't too bad, could probably be better.  How about replacing 'eprints.ecs' by 'research' for example to mitigate against changes in repository content (things other than eprints) and organisational structure (the day Computer Science becomes a separate school).

The jump-off page itself is significantly better in usability terms than the one I looked at yesterday.  The page <title> is set correctly for a start.  Hurrah!  Further, the link to the PDF of the paper is near the top of the page and a mouse-over pop-up shows clearly what you are going to get when you follow the link.  I've heard people bemoaning the use of pop-ups like this in usability terms in the past but I have to say, in this case, I think it works quite well.  On the downside, the link text is just 'PDF' which is less informative than it should be.

Following the abstract a short list of information about the paper is presented.  Author names are linked (good) though for some reason keywords are not (bad).  I have no idea what a 'Performance indicator' is in this context, even less so the value "EZ~05~05~11".  Similarly I don't see what use the ID Code is and I don't know if Last Modified refers to the paper or the information about the paper.  On that basis, I would suggest some mouse-over help text to explain these terms to end-users like myself.

The 'Look up in Google Scholar' link fails to deliver any useful results, though I'm not sure if that is a fault on the part of Google Scholar or the repository?  In any case, a bit of Ajax that indicated how many results that link was going to return would be nice (note: I have no idea off the top of my head if it is possible to do that or not).

Each of the references towards the bottom of the page has a 'SEEK' button next to them (why uppercase?).  As with my comments yesterday, this is a button that acts like a link (from my perspective as the end-user) so it is not clear to me why it has been implemented in the way it has (though I'm guessing that it is to do with limitations in the way Paracite (the target of the link) has been implemented.  My gut feeling is that there is something unRESTful in the way this is working, though I could be wrong.  In any case, it seems to be using an HTTP POST request where a HTTP GET would be more appropriate?

There is no shortage of embedded metadata in the page, at least in terms of volume, though it is interesting that <meta name="DC.subject" ... > is provided whereas the far more useful <meta name="keywords" ... > is not.

The page also contains a large number of <link rel="alternate" ... > tags in the page header - matching the wide range of metadata formats available for manual export from the page (are end-users really interested in all this stuff?) - so many in fact, that I question how useful these could possibly be in any real-world machine-to-machine scenario.

Overall then, I think this is a pretty good HTML page in usability terms.  I don't know how far this is an "out of the box" ePrints.org installation or how much it has been customised but I suggest that it is something that other repository managers could usefully take a look at.

Usability and SEO don't centre around individual pages of course, so the kind of analysis that I've done here needs to be much broader in its reach, considering how the repository functions as a whole site and, ultimately, how the network of institutional repositories and related services (since that seems to be the architectural approach we have settled on) function in usability terms.

Once again, my fundamental point here is not about individual repositories.  My point is that I don't see the issues around "eprint repositories as a part of the Web" featuring high up the agenda of our discussions as a community (and I suggest the same is true of  learning object repositories), in part because we have allowed ourselves to get sidetracked by discussion of community-specific 'interoperability' solutions that we then tend to treat as some kind of magic bullet, rolling them out whenever someone questions one approach or another.

Even where usability and SEO are on the agenda (as appears to be the case here) It's not enough that individual repositories think about the issues, even if some or most make good decisions, because most end-users (i.e. researchers) need to work across multiple repositories (typically globally) and therefore we need the usability of the system as a whole to function correctly.  We therefore need to think about these issues as a community.

February 10, 2009

Repository usability

In his response to my previous post, Freedom, Google-juice and institutional mandates, Chris Rusbridge responded using one of his Ariadne articles as an illustrative example.

By way of, err... reward, I want to take a quick look (in what I'm going to broadly call 'usability' terms) at the way in which that article is handled by the Edinburgh Research Archive (ERA).  Note that I'm treating the ERA as an example here - I don't consider it to be significantly different to other institutional repositories and, on that basis, I assume that most of what I am going to say will also apply to other repository implementations.

Much of this is basic Web 101 stuff...

The original Ariadne article is at http://www.ariadne.ac.uk/issue46/rusbridge/ - an HTML document containing embedded links to related material in the References section (internally linked from the relevant passage in the text).  The version deposited into ERA is a 9 page PDF snapshot of the original article.  I assume that PDF has been used for preservation reasons, though I'm not sure.  Hypertext links in the original HTML continue to work in the PDF version.

So far, so good.  I would naturally tend to assume that the HTML version is more machine-readable than the PDF version and on that basis is 'better', though I admit that I can't provide solid evidence to back up that statement.

Era The repository 'jump-off' page for the article is at http://www.era.lib.ed.ac.uk/handle/1842/1476 though the page itself tells us (in a human-readable way) that we should use http://hdl.handle.net/1842/1476 for citation purposes.

So we already have 4 URLs for this article and no explicit machine-readable knowledge that they all identify the same resource.  Further, the URLs that 15 years of using a Web browser lead me to use most naturally (those of the jump-off page, the original Ariadne article or the PDF file) are not the one that the page asks me to use for citation purposes.  So, in Web usability terms, I would most naturally bookmark (e.g. using del.icio.us) the wrong URL for this article and where different scholars choose to bookmark different URLs, services like del.icio.us are unlikely to be able to tell that they are referring to the same thing (recent experience of Google Scholar notwithstanding).

OK, so now let's look more closely at the jump-off page...

Firstly, what is the page title (as contained in the HTML <title> tag)?  Something useful like "Excuse Me... Some Digital Preservation Fallacies?".  No, it's "Edinburgh Research Archive : Item 1842/1476". Nice!? Again, if I bookmark this page in del.icio.us, that is the label is going to appear next to the URL, unless I manually edit it.

Secondly, what other metadata and/or micro-formats are embedded into this page?  All that nice rich Dublin Core metadata that is buried away inside the repository?  Nah.  Nothing.  A big fat zilch.  Not even any <meta name="keywords" ...> stuff.  I mean, come on.  The information is there on the page right in front of me... it's just not been marked up using even the most basic of HTML tags.  Most university Web site managers would get shot for turning out this kind of rubbish HTML.

Note I'm not asking for embedded Dublin Core metadata here - I'm asking for useful information to be embedded in useful (and machine-readable) ways where there are widely accepted conventions for how to to that.

So, let's look at those human-readable keywords again.  Why aren't they hyperlinked to all all other entries in ERA that use those keywords (in the way that Flickr and most other systems do with tags)?  Yes, the institutional repository architectural approach means that we'd only get to see other stuff in ERA, not all that useful I'll grant you, but it would be better than nothing.

Similarly, what about linking the author's name to all other entries by that author.  Ditto with the publisher's name.  Let's encourage a bit of browsing here shall we?  This is supposed to be about resource discovery after all!

So finally, let's look at the links on the page.  There at the bottom is a link labelled 'View/Open' which takes me to the PDF file - phew, the thing I'm actually looking for!  Not the most obvious spot on the page but I got there in the end.  Unfortunately, I assume that every single other item in ERA uses that exact same link text for the PDF (or other format) files.  Link text is supposed to indicate what is being linked to - it's a kind of metadata for goodness sake.

And then, right at the bottom of the page, there's a button marked "Show full item record".  I have no idea what that is but I'll click on it anyway.  Oh, it's what other services call "Additional information".  But why use an HTML form button to hide a plain old hypertext link?  Strange or what?

OK, I apologise... I've lapsed into sarcasm for effect.  But the fact remains that repository jump-off pages are potentially some of the most important Web pages exposed by universities (this is core research business after all) yet they are nearly always some of the worst examples of HTML to be found on the academic Web.  I can draw no other conclusion than that the Web is seen as tangential in this space.

I've taken 10 minutes to look at these pages... I don't doubt that there are issues that I've missed.  Clearly, if one took time to look around at different repositories one would find examples that were both better and worse (I'm honestly not picking on ERA here... it just happened to come to hand).  But in general, this stuff is atrocious - we can and should do better.

Freedom, Google-juice and institutional mandates

[Note: This entry was originally posted on the 9th Feb 2009 but has been updated in light of comments.]

An interesting thread has emerged on the American Scientist Open Access Forum based on the assertion that in Germany "freedom of research forbids mandating on university level" (i.e. that a mandate to deposit all research papers in an institutional repository (IR) would not be possible legally).  Now, I'm not familiar with the background to this assertion and I don't understand the legal basis on which it is made.  But it did cause me to think about why there might be an issue related to academic freedom caused by IR deposit mandates by funders or other bodies.

In responding to the assertion, Bernard Rentier says:

No researcher would complain (and consider it an infringement upon his/ her academic freedom to publish) if we mandated them to deposit reprints at the local library. It would be just another duty like they have many others. It would not be terribly useful, needless to say, but it would not cause an uproar. Qualitatively, nothing changes. Quantitatively, readership explodes.

Quite right. Except that the Web isn't like a library so the analogy isn't a good one.

If we ignore the rarefied, and largely useless, world of resource discovery based on the OAI-PMH and instead consider the real world of full-text indexing, link analysis and, well... yes, Google then there is a direct and negative impact of mandating a particular place of deposit. For every additional place that a research paper surfaces on the Web there is a likely reduction in the Google-juice associated with each instance caused by an overall diffusion of inbound links.

So, for example, every researcher who would naturally choose to surface their paper on the Web in a location other than their IR (because they have a vibrant central (discipline-based) repository (CR) for example) but who is forced by mandate to deposit a second copy in their local IR will probably see a negative impact on the Google-juice associated with their chosen location.

Now, I wouldn't argue that this is an issue of academic freedom per se, and I agree with Bernard Rentier (earlier in his response) that the freedom to "decide where to publish is perfectly safe" (in the traditional academic sense of the word 'publish'). However, in any modern understanding of 'to publish' (i.e. one that includes 'making available on the Web') then there is a compromise going on here.

The problem is that we continue to think about repositories as if they were 'part of a library', rather than as a 'true part of the fabric of the Web', a mindset that encourages us to try (and fail) to redefine the way the Web works (through the introduction of things like the OAI-PMH for example) and that leads us to write mandates that use words like 'deposit in a repository' (often without even defining what is meant by 'repository') rather than 'make openly available on the Web'.

In doing so I think we do ourselves, and the long term future of open access, a disservice.

Addendum (10 Feb 2009): In light of the comments so far (see below) I confess that I stand partially corrected.  It is clear that Google is able to join together multiple copies of research papers.  I'd love to know the heuristics they use to do this and I'd love to know how successful those heuristics are in the general case.  Nonetheless, on the basis that they are doing it, and on the assumption that in doing so they also combine the Google juice associated with each copy, I accept that my "dispersion of Google-juice" argument above is somewhat weakened.

There are other considerations however, not least the fact that the Web Architecture explicitly argues against URI aliases:

Good practice: Avoiding URI aliases
A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

The reasons given align very closely to the ones I gave above, though couched in more generic language:

Although there are benefits (such as naming flexibility) to URI aliases, there are also costs. URI aliases are harmful when they divide the Web of related resources. A corollary of Metcalfe's Principle (the "network effect") is that the value of a given resource can be measured by the number and value of other resources in its network neighborhood, that is, the resources that link to it.

The problem with aliases is that if half of the neighborhood points to one URI for a given resource, and the other half points to a second, different URI for that same resource, the neighborhood is divided. Not only is the aliased resource undervalued because of this split, the entire neighborhood of resources loses value because of the missing second-order relationships that should have existed among the referring resources by virtue of their references to the aliased resource.

Now, I think that some of the discussions around linked data are pushing at the boundaries of this guidance, particularly in the area of non-information resources.  Nonetheless, I think this is an area in which we have to tread carefully.  I stand by my original statement that we do not treat scholarly papers as though they are part of the fabric of the Web - we do not link between them in the way we link between other Web pages.  In almost all respects we treat them as bits of paper that happen to have been digitised and the culprits are PDF, the OAI-PMH, an over-emphasis on preservation and a collective lack of imagination about the potential transformative effect of the Web on scholarly communication.  We are tampering at the edges and the result is a mess.

February 06, 2009

Open orienteering

It seems to me that there is now quite a general acceptance of what the 'open access' movement is trying to achieve. I know that not everyone buys into that particular world-view but, for those of us that do, we know where we are headed and most of us will probably recognise it when we get there. Here, for example, is Yishay Mor writing to the open-science mailing list:

I would argue that there's a general principle to consider here. I hold that any data collected by public money should be made freely available to the public, for any use that contributes to the public good. Strikes me as a no-brainer, but of course - we have a long way to go.

A fairly straight-forward articulation of the open access position and a goal that I would thoroughly endorse.

The problem is that we don't always agree as a community about how best to get there.

I've been watching two debates flow past today, both showing some evidence of lack of consensus in the map reading department, though one much more long-standing than the other. Firstly, the old chestnut about the relative merits of central repositories vs. institutional repositories (initiated in part by Bernard Rentier's blog post, Institutional, thematic or centralised repositories?) but continued on various repository-related mailing lists (you know the ones!). Secondly, a newer debate about whether formal licences or community norms provide the best way to encourage the open sharing of research data by scientists and others, a debate which I tried to sum up in the following tweet:

@yishaym summary of open data debate... OD is good & needs to be encouraged - how best to do that? 1 licences (as per CC) or 2 social norms

It's great what can be done with 140 characters.

I'm more involved in the first than the second and therefore tend to feel more aggrieved at lack of what I consider to be sensible progress. In particular, I find the recurring refrain that we can join stuff back together using the OAI-PMH and therefore everything is going to be OK both tiresome and laughable.

If there's a problem here, and perhaps there isn't, then it is that the arguments and debates are taking place between people who ultimately want the same thing. I'm reminded of Monty Python's Life of Brian:

Brian: Excuse me. Are you the Judean People's Front?
Reg: Fuck off! We're the People's Front of Judea

It's like we all share the same religion but we disagree about which way to face while we are praying. Now, clearly, some level of debate is good. The point at which it becomes not good is when it blocks progress which is why, generally speaking, having made my repository-related architectural concerns known a while back, I try and resist the temptation to reiterate them too often.

Cameron Neylon has a nice summary of the licensing vs. norms debate on his blog. It's longer and more thoughtful than my tweet! This is a newer debate and I therefore feel more positive that it is able to go somewhere. My initial reaction was that a licensing approach is the most sensible way forward but having read through the discussion I'm no longer so sure.

So what's my point? I'm not sure really... but if I wake up in 4 years time and the debate about licensing vs. norms is still raging, as has pretty much happened with the discussion around CRs vs. IRs, I'll be very disappointed.

February 05, 2009

httpRange-14, Cool URIs & FRBR

The W3C Technical Architecture Group's resolution to what had become known as "the httpRange-14 question" introduced a distinction between the subset of resources for which representations may be served using the HTTP protocol - a subset which the Architecture of the World Wide Web refers to as "information" resources - and - by implication at least - a disjoint subset of resources which may be identified using the http URI scheme but which is not "representable" -  for which no representations are provided using the HTTP protocol - though they may be described, by "information resources" identified by their own distinct URIs.

A subsequent note by Leo Sauermann and Richard Cyganiak of the W3C Semantic Web Education and Outreach (SWEO) Interest Group, Cool URIs for the Semantic Web provides an extended discussion of the issue, together with a set of "patterns" for assigning http URIs and for the appropriate responses to HTTP requests using such URIs. This document uses the terms "Web documents" and "real-world objects" to refer to the two classes of resources, noting that the latter class includes "real-world objects like people and cars, and even abstract ideas and non-existing things like a mythical unicorn".

The question raised by this division is where the boundary between the two classes lies. From the viewpoint of the consumer/user of URIs, the point is somewhat moot: they simply need to follow the information provided, in the form of responses to HTTP requests by the owner of the URI (or possibly also from metadata provided by other parties). Information about the nature of the resource can be provided both by HTTP response codes and by explicit descriptions of the resource. Following the httpRange-14 guideline, if the HTTP response to a GET is 2xx, then the resource identified by the URI is an information resource. I think it's worth emphasising the point that this is the only response code which allows the user to make a "positive" inference about resource type; if the response is 303 See Other, that in itself says nothing about the type of the resource.

The URI owner, on the other hand, needs to make the choice, for each resource, whether to provide a representation or not, based on their understanding of the nature of the resources they are exposing on the Web. The Architecture of the World Wide Web document offers the following (somewhat "slippery", to me!) criterion for a resource being an "information resource": The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message.

I've been trying to think through how this set of conventions should be applied to the case of the Functional Requirements for Bibliographic Records (FRBR) and more specifically to the "FRBR Group 1 Entities", i.e. instances of the the classes of Work, Expression, Manifestation and Item which FRBR uses to model the universe of resources described by bibliographic records.

The work on the development of the Scholarly Works Application Profile (SWAP) focused primarily on deployment in the context of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). OAI-PMH provides an RPC-like layer on top of HTTP, and SWAP focuses on how to deliver descriptions of the SWAP/FRBR entities using that RPC layer, rather than the question of how those entities could be represented as Web resources.

I'm starting from the FRBR model here; I'm asking the question, "If I'm exposing on the Web a set of resources based on the FRBR model, are there any general rules for which of these resources are 'representable'?". I'm not trying to address the broader question of whether/how the distinctions made in the Web Architecture reflect, or can be defined in terms of, the FRBR model.

Taking the "easy" cases first, FRBR defines a Work as follows:

The first entity defined in the model is work: a distinct intellectual or artistic creation.

A work is an abstract entity; there is no single material object one can point to as the work. We recognize the work through individual realizations or expressions of the work, but the work itself exists only in the commonality of content between and among the various expressions of the work.

- FRBR Section 3.2.1

It seems fairly clear from this description that a FRBR Work is a "conceptual resource", like an idea. In the terms of the "Cool URIs" document, it is a "real-world object", albeit an abstract one, and not a "Web document". And on this basis, while a FRBR Work may be identified by an http URI, an HTTP server should not return a representation and a 200 status code in response to a GET for that URI, though the server may provide access (using one of the patterns provided in the Cool URIs document) to a description of the Work, a "Web document" itself identified by a distinct URI.

A similar argument can, I think, be made for the case of the FRBR Expression:

An expression is the specific intellectual or artistic form that a work takes each time it is "realized." Expression encompasses, for example, the specific words, sentences, paragraphs, etc. that result from the realization of a work in the form of a text, or the particular sounds, phrasing, etc. resulting from the realization of a musical work. The boundaries of the entity expression are defined, however, so as to exclude aspects of physical form, such as typeface and page layout, that are not integral to the intellectual or artistic realization of the work as such.

- FRBR Section 3.2.2

Again we're dealing with an "abstraction", albeit a more "specific", less "generic" one than a Work. And on this basis, like the Work, it falls into the category of "real-world objects", and again, while an Expression may be identified by an http URI and an HTTP server may provide access to a description of an Expression, it should not provide a representation of an Expression.

In considering the other two FRBR Group 1 entity types, Manifestation and Item, it is perhaps easiest to consider the application of FRBR to physical resources and to digital resources separately.

Considering the physical world first, it is perhaps helpful to consider the Item first, as it seems to me it also sheds some light on the nature of the Manifestation. The FRBR definition of Item is very much grounded in the physical:

The entity defined as item is a concrete entity. It is in many instances a single physical object (e.g., a copy of a one-volume monograph, a single audio cassette, etc.). There are instances, however, where the entity defined as item comprises more than one physical object (e.g., a monograph issued as two separately bound volumes, a recording issued on three separate compact discs, etc.).

- FRBR Section 3.2.4

These Items are the "real world objects" which traditionally libraries have been concerned with managing (acquiring, storing, maintaining, providing access to, distributing, disposing of). From the perspective of httpRange-14 and the "Cool URIs" document, then, these "real-world objects" may be described by Web documents, but they are not themselves Web documents. So a physical Item may be identified by an http URI, and an HTTP server may provide access to a description of such an Item, but it can't provide a representation of it.

Now take the case of the Manifestation:

The third entity defined in the model is manifestation: the physical embodiment of an expression of a work.

The entity defined as manifestation encompasses a wide range of materials, including manuscripts, books, periodicals, maps, posters, sound recordings, films, video recordings, CD-ROMs, multimedia kits, etc. As an entity, manifestation represents all the physical objects that bear the same characteristics, in respect to both intellectual content and physical form.

- FRBR Section 3.2.3

Again a Manifestation is dealing with physical form, but furthermore, a Manifestation is still an abstraction: its role in the FRBR model is to capture characteristics that are true of a set of individual Items which "exemplify" that Manifestation (even in the case where a unique Item which is the sole exemplar of a Manifestation). Seen in this light, then, I think a Manifestation also falls into the category of things which may be described by one or more Web documents, but is not itself a Web document.

In turning to the context of the digital world, I think it's worth highlighting that although the FRBR specification contains some references to "electronic resources", the coverage of digital resources in the text very limited, and indeed the introduction acknowledges that "the dynamic nature of entities recorded in digital formats" is one of the areas that require further analysis.

It seems relatively straightforward to transfer the concepts of Work and Expression into the digital sphere, as they are independent of the form in which content is "embodied".

The question of what constitutes a FRBR Item in the digital domain is rather more difficult to pin down, particularly since the FRBR document itself focuses exclusively on the physical in its discussion of the Item. Ingbert Floyd and Allen Renear take on this challenge in their poster, "What Exactly is an Item in the Digital World?" (ASIST, 2007)

In the physical world, they argue, the thing which carries information is the same thing for which information managers typically describe characteristics such as provenance, condition, and access restrictions - the attributes of the Item in FRBR. In the digital world, this is no longer true: information is carried by the physical state of some component of a computer system, something the authors call an instance of "patterned matter and energy" (PME) - but information managers rarely concern themselves with managing such entities and recording their attributes. Entities such as a "file", however, are the focus of management and description - but a digital "file" isn't really the "concrete entity" that FRBR calls an Item. Two approaches to the Item are possible, then: the Item-as-PME approach, which "maintains that a fundamental aspect of being an item is being a concrete physical thing", or the Item-as-"file" approach which addopts the pragmatic position that "items are the things, whatever their nature (physical, abstract, or metaphorical), which play the role in bibliographic control that FRBR assigns to items".

The question I'm posing here is, I think, a different, and narrower, one from the broader one grappled with by Ingbert and Renear: if we are treating a FRBR Item as a Web resource, for the case of an exemplar of a Manifestation in digital format, is that resource an "information resource", for which a representation can be served? From the Web Architecture perspective, it seems to me that it is the case that "all of their essential characteristics can be conveyed in a message". The Scholarly Works Application Profile takes this approach: the copy of a PDF document available from an institutional repository server, or the copy of an mp3 file constituting an episode of a podcast, is the FRBR Item. These, after all, are the things which, "play the role in bibliographic control that FRBR assigns to items".

A further issue here is that FRBR lists "Access Address (URL)" as an attribute of a Manifestation, rather than of an Item, and I'm not sure whether this is compatible with the SWAP approach.

The concept of Manifestation in the digital case seems the most difficult to categorise. On the one hand, as noted above, FRBR states that a Manifestation is an abstraction corresponding to a set of objects with the same characteristics of both form and content. On the other hand, it seems to me that one could argue that for Manifestations in digital form, it is true that "all of their essential characteristics can be conveyed in a message", since the notion of Manifestation encapsulates that of specific intellectual content "embodied" in a specific form. For consistency with the physical case, I guess the former would be best, but I'm not sure on this point.

So those rather lengthy musings might suggest the following (tentative, I hasten to add... I'm mostly just trying to think through my rationale here) approach to identifying and serving representations/descriptions of the FRBR entities, at least using the approach that SWAP takes to the Item:

Entity Type HTTP Behaviour

Identify using http URI

Provide description of Work. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.


Identify using http URI

Provide description of Expression. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Manifestation Physical

Identify using http URI

Provide description of Manifestation. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.


Identify using http URI

Provide description of Manifestation. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Item Physical

Identify using http URI

Provide description of Item. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.


Identify using http URI

Provide representation of Item. (Respond to GET with 200 and representation).

One final point.... The use of HTTP content negotiation on the Web introduces a dimension which I'm not sure sits very easily within the FRBR model. Using content negotiation, I may decide to expose a single resource on the Web, using a single URI, but configure my server so that, at any point in time, depending on a range of factors (the preferences of the client, the IP address of the client, etc.) it returns different representations of that resource - representations which may vary by (amongst other things) media type (format) or language. From the FRBR perspective, such variations would, I think, result in the creation of different Manifestations (for the media type case) or even different Expressions (for the language case). In the SWAP case, I think the implication is that Item representations should not vary, at least by media type or language.



eFoundations is powered by TypePad