July 21, 2010

Getting techie... what questions should we be asking of publishers?

The Licence Negotiation team here are thinking about the kinds of technical questions they should be asking publishers and other content providers as part of their negotiations with them. The aim isn't to embed the answers to those questions in contractual clauses - rather, it is to build up a useful knowledge base of surrounding information that may be useful to institutions and others who are thinking about taking up a particular agreement.

My 'starter for 10' set of questions goes like this:

  • Do you make any commitment to the persistence of the URLs for your published content? If so, please give details. Do you assign DOIs to your published content? Are you members of CrossRef?
  • Do you support a search API? If so, what standard(s) do you support?
  • Do you support a metadata harvesting API? If so, what standard(s) do you support?
  • Do you expose RSS and/or Atom feeds for your content? If so, please describe what feeds you offer?
  • Do you expose any form of Linked Data about your published content? If so, please give details.
  • Do you generate OpenURLs as part of your web interface? Do you have a documented means of linking to your content based on bibliographic metadata fields? If so, please give details.
  • Do you support SAML (Service Provider) as a means of controlling access to your content? If so, which version? Are you a member of the UK Access Management Federation? If you also support other methods of access control, please give details.
  • Do you grant permission for the preservation of your content using LOCKSS, CLOCKSS and/or PORTICO? If so, please give details.
  • Do you have a statement about your support for the Web Accessibility Initiative (WAI)? If so, please give details?

Does this look like a reasonable and sensible set of questions for us to be asking of publishers? What have I missed? Something about open access perhaps?

December 03, 2009

On being niche

I spoke briefly yesterday at a pre-IDCC workshop organised by REPRISE.  I'd been asked to talk about Open, social and linked information environments, which resulted in a re-hash of the talk I gave in Trento a while back.

My talk didn't go too well to be honest, partly because I was on last and we were over-running so I felt a little rushed but more because I'd cut the previous set of slides down from 119 to 6 (4 really!) - don't bother looking at the slides, they are just images - which meant that I struggled to deliver a very coherent message.  I looked at the most significant environmental changes that have occurred since we first started thinking about the JISC IE almost 10 years ago.  The resulting points were largely the same as those I have made previously (listen to the Trento presentation) but with a slightly preservation-related angle:

  • the rise of social networks and the read/write Web, and a growth in resident-like behaviour, means that 'digital identity' and the identification of people have become more obviously important and will remain an important component of provenance information for preservation purposes into the future;
  • Linked Data (and the URI-based resource-oriented approach that goes with it) is conspicuous by its absence in much of our current digital library thinking;
  • scholarly communication is increasingly diffusing across formal and informal services both inside and outside our institutional boundaries (think blogging, Twitter or Google Wave for example) and this has significant implications for preservation strategies.

That's what I thought I was arguing anyway!

I also touched on issues around the growth of the 'open access' agenda, though looking at it now I'm not sure why because that feels like a somewhat orthogonal issue.

Anyway... the middle bullet has to do with being mainstream vs. being niche.  (The previous speaker, who gave an interesting talk about MyExperiment and its use of Linked Data, made a similar point).  I'm not sure one can really describe Linked Data as being mainstream yet, but one of the things I like about the Web Architecture and REST in particular is that they describe architectural approaches that haven proven to be hugely successful, i.e. they describe the Web.  Linked data, it seems to me, builds on these in very helpful ways.  I said that digital library developments often prove to be too niche - that they don't have mainstream impact.  Another way of putting that is that digital library activities don't spend enough time looking at what is going on in the wider environment.  In other contexts, I've argued that "the only good long-term identifier, is a good short-term identifier" and I wonder if that principle can and should be applied more widely.  If you are doing things on a Web-scale, then the whole Web has an interest in solving any problems - be that around preservation or anything else.  If you invent a technical solution that only touches on scholarly communication (for example) who is going to care about it in 50 or 100 years - answer, not all that many people.

It worries me, for example, when I see an architectural diagram (as was shown yesterday) which has channels labelled 'OAI-PMH', XML' and 'the Web'!

After my talk, Chris Rusbridge asked me if we should just get rid of the JISC IE architecture diagram.  I responded that I am happy to do so (though I quipped that I'd like there to be an archival copy somewhere).  But on the train home I couldn't help but wonder if that misses the point.  The diagram is neither here nor there, it's the "service-oriented, we can build it all", mentality that it encapsulates that is the real problem.

Let's throw that out along with the diagram.

November 23, 2009

Memento and negotiating on time

Via Twitter, initially in a post by Lorcan Dempsey, I came across the work of Herbert Van de Sompel and his comrades from LANL and Old Dominion University on the Memento project:

The project has since been the topic of an article in New Scientist.

The technical details of the Memento approach are probably best summarised in the paper "Memento: Time Travel for the Web", and Herbert has recently made available a presentation which I'll embed here, since it includes some helpful graphics illustrating some of the messaging in detail:

Memento seeks to take advantage of the Web Architecture concept that interactions on the Web are concerned with exchanging representations of resources. And for any single resource, representations may vary - at a single point in time, variant representations may be provided, e.g. in different formats or languages, and over time, variant representations may be provided reflecting changes in the state of the resource. The HTTP protocol incorporates a feature called content negotiation which can be used to determine the most appropriate representation of a resource - typically according to variables such as content type, language, character set or encoding. The innovation that Memento brings to this scenario is the proposition that content negotiation may also be applied to the axis of date-time. i.e. in the same way that a client might express a preference for the language of the representation based on a standard request header, it could also express a preference that the representation should reflect resource state at a specified point in time, using a custom accept header (X-Accept-Datetime).

More specifically, Memento uses a flavour of content negotiation called "transparent content negotiation" where the server provides details of the variant representations available, from which the client can choose. Slides 26-50 in Herbert's presentation above illustrate how this technique might be applied to two different cases: one in which the server to which the initial request is sent is itself capable of providing the set of time-variant representations, and a second in which that server does not have those "archive" capabilities but redirects to (a URI supported by) a second server which does.

This does seem quite an ingenious approach to the problem, and one that potentially has many interesting applications, several of which Herbert alludes to in his presentation.

What I want to focus on here is the technical approach, which did raise a question in my mind. And here I must emphasise that I'm really just trying to articulate a question that I've been trying to formulate and answer for myself: I'm not in a position to say that Memento is getting anything "wrong", just trying to compare the Memento proposition with my understanding of Web architecture and the HTTP protocol, or at least the use of that protocol in accordance with the REST architectural style, and understand whether there are any divergences (and if there are, what the implications are).

In his dissertation in which he defines the REST architectural style, Roy Fielding defines a resource as follows:

More precisely, a resource R is a temporally varying membership function MR(t), which for time t maps to a set of entities, or values, which are equivalent. The values in the set may be resource representations and/or resource identifiers. A resource can map to the empty set, which allows references to be made to a concept before any realization of that concept exists -- a notion that was foreign to most hypertext systems prior to the Web. Some resources are static in the sense that, when examined at any time after their creation, they always correspond to the same value set. Others have a high degree of variance in their value over time. The only thing that is required to be static for a resource is the semantics of the mapping, since the semantics is what distinguishes one resource from another.

On representations, Fielding says the following, which I think is worth quoting in full. The emphasis in the first and last sentences is mine.

REST components perform actions on a resource by using a representation to capture the current or intended state of that resource and transferring that representation between components. A representation is a sequence of bytes, plus representation metadata to describe those bytes. Other commonly used but less precise names for a representation include: document, file, and HTTP message entity, instance, or variant.

A representation consists of data, metadata describing the data, and, on occasion, metadata to describe the metadata (usually for the purpose of verifying message integrity). Metadata is in the form of name-value pairs, where the name corresponds to a standard that defines the value's structure and semantics. Response messages may include both representation metadata and resource metadata: information about the resource that is not specific to the supplied representation.

Control data defines the purpose of a message between components, such as the action being requested or the meaning of a response. It is also used to parameterize requests and override the default behavior of some connecting elements. For example, cache behavior can be modified by control data included in the request or response message.

Depending on the message control data, a given representation may indicate the current state of the requested resource, the desired state for the requested resource, or the value of some other resource, such as a representation of the input data within a client's query form, or a representation of some error condition for a response. For example, remote authoring of a resource requires that the author send a representation to the server, thus establishing a value for that resource that can be retrieved by later requests. If the value set of a resource at a given time consists of multiple representations, content negotiation may be used to select the best representation for inclusion in a given message.

So at a point in time t1, the "temporally varying membership function" maps to one set of values, and - in the case of a resource whose representations vary over time - at another point in time t2, it may map to another, different set of values. To take a concrete example, suppose at the start of 2009, I launch a "quote of the day", and I define a single resource that is my "quote of the day", to which I assign the URI http://example.org/qotd/. And I provide variant representations in XHTML and plain text. On 1 January 2009 (time t1), my quote is "From each according to his abilities, to each according to his needs", and I provide variant representations in those two formats, i.e. the set of values for 1 January 2009 is those two documents. On 2 January 2009 (time t2), my quote is "Those who do not move, do not notice their chains", and again I provide variant representations in those two formats, i.e. the set of values for 2 January 2009 (time t2) is two XHTML and plain text documents with different content from those provided at time t1.

So, moving on to that second piece of text I cited, my interpretation of the final sentence as it applies to HTTP (and, as I say, I could be wrong about this) would be that the RESTful use of the HTTP GET method is intended to retrieve a representation of the current state of the resource. It is the value set at that point in time which provides the basis for negotiation. So, in my example here, on 1 January 2009, I offer XHTML and plain text versions of my "From each according to his abilities..." quote via content negotiation, and on 2 January 2009, I offer XHTML and plain text versions of my "Those who do not move..." quotations. i.e. At two different points in time t1 and t2, different (sets of) representations may be provided for a single resource, reflecting the different state of that resource at those two different points in time, but at either of those points in time, the expectation is that each representation of the set available represents the state of the resource at that point in time, and only members of that set are available via content negotiation. So although representations may vary by language, content-type etc, they should be in some sense "equivalent" (Roy Fielding's term) in terms of their representation of the current state of the resource.

I think the Memento approach suggests that on 2 January 2009, I could, using the date-time-based negotiation convention, offer all four of those variants listed above (and on each day into the future, a set which increases in membership as I add new quotes). But it seems to me that is at odds with the REST style, because the Memento approach requires that representations of different states of the resource (i.e. the state of the resource at different points in time) are all made available as representations at a single point in time.

I appreciate that (even if my interpretation is correct, which it may not be) the constraints specified by the REST architectural style are just that: a set of constraints which, if observed, generate certain properties/characteristics in a system. And if some of those constraints are relaxed or ignored, then those properties change. My understanding is not good enough to pinpoint exactly what the implications of this particular point of divergence (if indeed it is one!) would be - though as Herbert notes in hs presentation, it would appear that there would be implications for cacheing.

But as I said, I'm really just trying to raise the questions which have been running around my head and which I haven't really been able to answer to my own satisfaction.

As an aside, I think Memento could probably achieve quite similar results by providing some metadata (or a link to another document providing that metadata) which expressed the relationships between the time-variant resource and all the time-specific variant resources, rather than seeking to manage this via HTTP content negotiation.

Postscript: I notice that, in the time it has taken me to draft this post, Mark Baker has made what I think is a similar point in a couple of messages (first, second) to the W3C public-lod mailing list.

August 20, 2009

What researchers think about data preservation and access

There's an interesting report in the current issues of Ariadne by Neil Beagrie, Robert Beagrie and Ian Rowlands, Research Data Preservation and Access: The Views of Researchers, fleshing out some of the data behind the UKRDS Report, which I blogged about a while back.

I have a minor quibble with the way the data has been presented in the report, in that it's not overly clear how the 179 respondents represented in Figure 1 have been split across the three broad areas (Sciences, Social Sciences, and Arts and Humanities) that appear in subsequent figures. One is left wondering how significant the number of responses in each of the 3 areas was?  I would have preferred to see Figure 1 organised in such a way that the 'departments and faculties' were grouped more obviously into the broad areas.

That aside, I think the report is well worth reading.  I'll just highlight what the authors perceive to be the emerging themes:

  • It is clear that different disciplines have different requirements and approaches to research data.
  • Current provision of facilities to encourage and ensure that researchers have data stores where they can deposit their valuable data for safe-keeping and for sharing, as appropriate, varies from discipline to discipline.
  • Local data management and preservation activity is very important with most data being held locally.
  • Expectations about the rate of increase in research data generated indicate not only higher data volumes but also an increase in different types of data and data generated by disciplines that have not until recently been producing volumes of digital output.
  • Significant gaps and areas of need remain to be addressed.

The Findings of the Scoping Study and Research Data Management Workshop (undertaken at the University of Oxford and part of the work that infomed the Ariadne article) provides an indication of the "top requirements for services to help [researchers] manage data more effectively":

  • Advice on practical issues related to managing data across their life cycle. This help would range from assistance in producing a data management/sharing plan; advice on best formats for data creation and options for storing and sharing data securely; to guidance on publishing and preserving these research data.
  • A secure and user-friendly solution that allows storage of large volume of data and sharing of these in a controlled fashion way allowing fine grain access control mechanisms.
  • A sustainable infrastructure that allows publication and long-term preservation of research data for those disciplines not currently served by domain specific services such as the UK Data Archive, NERC Data Centres, European Bioinformatics Institute and others.
  • Funding that could help address some of the departmental challenges to manage the research data that are being produced.

Pretty high level stuff so nothing particularly surprising there. It seems to me that some work drilling down into each of these areas might be quite useful.

June 19, 2009

Repositories and linked data

Last week there was a message from Steve Hitchcock on the UK [email protected] mailing list noting Tim Berners-Lee's comments that "giving people access to the data 'will be paradise'". In response, I made the following suggestion:

If you are going to mention TBL on this list then I guess that you really have to think about how well repositories play in a Web of linked data?

My thoughts... not very well currently!

Linked data has 4 principles:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.

Of these, repositories probably do OK at 1 and 2 (though, as I’ve argued before, one might question the coolness of some of the http URIs in use and, I think, the use of cool URIs is implicit in 2).

3, at least according to TBL, really means “provide RDF” (or RDFa embedded into HTML I guess), something that I presume very few repositories do?

Given lack of 3, I guess that 4 is hard to achieve. Even if one was to ignore the lack of RDF or RDFa, the fact that content is typically served as PDF or MS formats probably means that links to other things are reasonably well hidden?

It’d be interesting (academically at least), and probably non-trivial, to think about what a linked data repository would look like? OAI-ORE is a helpful step in the right direction in this regard.

In response, various people noted that there is work in this area: Mark Diggory on work at DSpace, Sally Rumsey (off-list) on the Oxford University Research Archive and parallel data repository (DataBank), and Les Carr on the new JISC dotAC Rapid Innovation project. And I'm sure there is other stuff as well.

In his response, Mark Diggory said:

So the question of "coolness" of URI tends to come in second to ease of implementation and separation of services (concerns) in a repository. Should "Coolness" really be that important? We are trying to work on this issue in DSpace 2.0 as well.

I don't get the comment about "separation of services". Coolness of URIs is about persistence. It's about our long term ability to retain the knowledge that a particular URI identifies a particular thing and to interact with the URI in order to obtain a representation of it. How coolness is implemented is not important, except insofar as it doesn't impact on our long term ability to meet those two aims.

Les Carr also noted the issues around a repository minting URIs "for things it has no authority over (e.g. people's identities) or no knowledge about (e.g. external authors' identities)" suggesting that the "approach of dotAC is to make the repository provide URIs for everything that we consider significant and to allow an external service to worry about mapping our URIs to "official" URIs from various "authorities"". An interesting area.

As I noted above, I think that the work on OAI-ORE is an important step in helping to bring repositories into the world of linked data. That said, there was some interesting discussion on Twitter during the recent OAI6 conference about the value of ORE's aggregation model, given that distinct domains will need to layer their own (different) domain models onto those aggregations in order to do anything useful. My personal take on this is that it probably is useful to have abstracted out the aggregation model but that the hypothesis still to be tested that primitive aggregation is useful despite every domain needing own richer data and, indeed, that we need to see whether the way the ORE model gets applied in the field turns out to be sensible and useful.

March 20, 2009

Unlocking Audio

I spent the first couple of days this week at the British Library in London, attending the Unlocking Audio 2 conference.  I was there primarily to give an invited talk on the second day.

You might notice that I didn't have a great deal to say about audio, other than to note that what strikes me as interesting about the newer ways in which I listen to music online (specifically Blip.fm and Spotify) is that they are both highly social (almost playful) in their approach and that they are very much of the Web (as opposed to just being 'on' the Web).

What do I mean by that last phrase?  Essentially, it's about an attitude.  It's about seeing being mashed as a virtue.  It's about an expectation that your content, URLs and APIs will be picked up by other people and re-used in ways you could never have foreseen.  Or, as Charles Leadbeater put it on the first day of the conference, it's about "being an ingredient".

I went on to talk about the JISC Information Environment (which is surprisingly(?) not that far off its 10th birthday if you count from the initiation of the DNER), using it as an example of digital library thinking more generally and suggesting where I think we have parted company with the mainstream Web (in a generally "not good" way).  I noted that while digital library folks can discuss identifiers forever (if you let them!) we generally don't think a great deal about identity.  And even where we do think about it, the approach is primarily one of, "who are you and what are you allowed to access?", whereas on the social Web identity is at least as much about, "this is me, this is who I know, and this is what I have contributed". 

I think that is a very significant difference - it's a fundamentally different world-view - and it underpins one critical aspect of the difference between, say, Shibboleth and OpenID.  In digital libraries we haven't tended to focus on the social activity that needs to grow around our content and (as I've said in the past) our institutional approach to repositories is a classic example of how this causes 'social networking' issues with our solutions.

I stole a lot of the ideas for this talk, not least Lorcan Dempsey's use of concentration and diffusion.  As an aside... on the first day of the conference, Charles Leadbeater introduced a beach analogy for the 'media' industries, suggesting that in the past the beach was full of a small number of large boulders and that everything had to happen through those.  What the social Web has done is to make the beach into a place where we can all throw our pebbles.  I quite like this analogy.  My one concern is that many of us do our pebble throwing in the context of large, highly concentrated services like Flickr, YouTube, Google and so on.  There are still boulders - just different ones?  Anyway... I ended with Dave White's notions of visitors vs. residents, suggesting that in the cultural heritage sector we have traditionally focused on building services for visitors but that we need to focus more on residents from now on.  I admit that I don't quite know what this means in practice... but it certainly feels to me like the right direction of travel.

I concluded by offering my thoughts on how I would approach something like the JISC IE if I was asked to do so again now.  My gut feeling is that I would try to stay much more mainstream and focus firmly on the basics, by which I mean adopting the principles of linked data (about which there is now a TED talk by Tim Berners-Lee), cool URIs and REST and focusing much more firmly on the social aspects of the environment (OpenID, OAuth, and so on).

Prior to giving my talk I attended a session about iTunesU and how it is being implemented at the University of Oxford.  I confess a strong dislike of iTunes (and iTunesU by implication) and it worries me that so many UK universities are seeing it as an appropriate way forward.  Yes, it has a lot of concentration (and the benefits that come from that) but its diffusion capabilities are very limited (i.e. it's a very closed system), resulting in the need to build parallel Web interfaces to the same content.  That feels very messy to me.  That said, it was an interesting session with more potential for debate than time allowed.  If nothing else, the adoption of systems about which people can get religious serves to get people talking/arguing.

Overall then, I thought it was an interesting conference.  I suspect that my contribution wasn't liked by everyone there - but I hope it added usefully to the debate.  My live-blogging notes from the two days are here and here.

March 05, 2009

A National Research Data Service for the UK?

I attended the A National Research Data Service for the UK? meeting at the Royal Society in London last week and my live-blogged notes are available for those who want more detail.  Chris Rusbridge also blogged the day on the Digital Curation Blog - session 1, session 2, session 3 and session 4.  FWIW, I think that Chris's posts are more comprehensive and better than my live-blogged notes.

The day was both interesting and somewhat disappointing...

Interesting primarily because of the obvious political tension in the room (which I characterised on Twitter as a potential bun-fight between librarians and the rest but which in fact is probably better summed up as a lack of shared agreement around centralist (discipline-based) solutions vs. institutional solutions).

Disappointing because the day struck me more as a way of presenting a done-deal than as a real opportunity for debate.

The other thing that I found annoying was the constant parroting of the view that "researchers want to share their data openly" as though this is an obvious position.  The uncomfortable fact is that even the UKRDS report's own figures suggest that less than half (43%) of those surveyed "expressed the need to access other researchers' data" - my assumption therefore is that the proportion currently willing to share their data openly will be much smaller.

Don't take this as a vote against open access, something that I'm very much in favour of.  But, as we've found with eprint archives, a top-down "thou shalt deposit because it is good for you" approach doesn't cut it with researchers - it doesn't result in cultural change.  Much better to look for, and actively support, those areas where open sharing of data occurs naturally within a community or discipline, thus demonstrating its value to others.

That said, a much more fundamental problem facing the provision of collaborative services to the research community is that funding happens nationally but research happens globally (or at least across geographic/funding boundaries) - institutions are largely irrelevant whichever way you look at it [except possibly as an agent of long term preservation - added 6 March 2009].  Resolving that tension seems paramount to me though I have no suggestions as to how it can be done.  It does strike me however that shared discipline-based services come closer to the realities of the research world than do institutional services.

December 22, 2008

Web, scissors, stone

It strikes me that we continue to do a lot of stuff as though we lived in a paper-only world...

The whole scholarly communication cycle is a great example. Yes stuff surfaces on the Web - but only as PDF, a digital representation of traditional paper, and the way we cite and link between academic papers still happens as though we lived in a paper-based world by and large.

The JISC-funded Preservation of Web Resources Handbook [PDF], made available a few weeks back, is another nice example.  I have no idea whether it's any good or not! At over 100 A4 pages, it's impractical to read on screen and I don't really want to print out 50-odd sheets of paper just to read it either.

As funders ourselves, Eduserv is just as guilty as the JISC of funding people to produce long reports that it is difficult to do anything with other than convert to PDF and slap up on the Web as a single file.  I hate PDF! We need to get more imaginitive in the way we surface stuff. And we desperately need to get out of the mindset that more equals better.

We now regularly ask our projects to blog throughout the life of their work, with the expectation that doing so is good for engagement and results in something incremental and part of the fabric of the social Web. I'm sure the JISC do this as well. But even where we do that, we still often end up with a final deliverable that is the complete antithesis of what the Web is about - i.e. a long PDF file. See our series of snapshots of the use of Second Life in UK HE for an example (though we are currently in discussion about how we improve things going forward).  Is there any reason why a handbook isn't delivered as a proper Web document (and, no, I don't mean one of those horrible automated conversions) for example?

And more generally... what about funding documents tailored for use on a mobile device like an iPhone? What about CommonCraft-style videos? What about ongoing Twitter streams? And if the format really does demand a traditional 'book' (as might well be the case here) why not optimise it for one of the print on demand services so that people can end up with a nicely bound volume rather than some scrappily printed, stapled together collection of A4 pages?

And please don't give me that crap about PDF being the best preservation format. Sheesh!

December 18, 2008

JISC IE and e-Research Call briefing day

I attended the briefing day for the JISC's Information Environment and e-Research Call in London on Monday and my live-blogged notes are available on eFoundations LiveWire for anyone that is interested in my take on what was said.

Quite an interesting day overall but I was slightly surprised at the lack of name badges and a printed delegate list, especially given that this event brought together people from two previously separate areas of activity. Oh well, a delegate list is promised at some point.  I also sensed a certain lack of buzz around the event - I mean there's almost £11m being made available here, yet nobody seemed that excited about it, at least in comparison with the OER meeting held as part of the CETIS conference a few weeks back.  At that meeting there seemed to be a real sense that the money being made available was going to result in a real change of mindset within the community.  I accept that this is essentially second-phase money, building on top of what has gone before, but surely it should be generating a significant sense of momentum or something... shouldn't it?

A couple of people asked me why I was attending given that Eduserv isn't entitled to bid directly for this money and now that we're more commonly associated with giving grant money away rather than bidding for it ourselves.

The short answer is that this call is in an area that is of growing interest to Eduserv, not least because of the development effort we are putting into our new data centre capability.  It's also about us becoming better engaged with the community in this area.  So... what could we offer as part of a project team? Three things really: 

  • Firstly, we'd be very interested in talking to people about sustainable hosting models for services and content in the context of this call.
  • Secondly, software development effort, particularly around integration with Web 2.0 services.
  • Thirdly, significant expertise in both Semantic Web technologies (e.g. RDF, Dublin Core and ORE) and identity standards (e.g. Shibboleth and OpenID).

If you are interested in talking any of this thru further, please get in touch.

November 07, 2008

Some (more) thoughts on repositories

I attended a meeting of the JISC Repositories and Preservation Advisory Group (RPAG) in London a couple of weeks ago.  Part of my reason for attending was to respond (semi-formally) to the proposals being put forward by Rachel Heery in her update to the original Repositories Roadmap that we jointly authored back in April 2006.

It would be unfair (and inappropriate) for me to share any of the detail in my comments since the update isn't yet public (and I suppose may never be made so).  So other than saying that I think that, generally speaking, the update is a step in the right direction, what I want to do here is rehearse the points I made which are applicable to the repositories landscape as I see it more generally.  To be honest, I only had 5 minutes in which to make my comments in the meeting, so there wasn't a lot of room for detail in any case!

Broadly speaking, I think three points are worth making.  (With the exception of the first, these will come as no surprise to regular readers of this blog.)


There may well be some disagreement about this but it seems to me that the collection of material we are trying to put into institutional repositories of scholarly research publications is a reasonably well understood and measurable corpus.  It strikes me as odd therefore that the metrics we tend to use to measure progress in this space are very general and uninformative.  Numbers of institutions with a repository for example - or numbers of papers with full text.  We set targets for ourselves like, "a high percentage of newly published UK scholarly output [will be] made available on an open access basis" (a direct quote from the original roadmap).  We don't set targets like, "80% of newly published UK peer-reviewed research papers will be made available on an open access basis" - a more useful and concrete objective.

As a result, we have little or no real way of knowing if are actually making significant progress towards our goals.  We get a vague feel for what is happening but it is difficult to determine if we are really succeeding.

Clearly, I am ignoring learning object repositories and repositories of research data here because those areas are significantly harder, probably impossible, to measure in percentage terms.  In passing, I suggest that the issues around learning object repositories, certainly the softer issues like what motivates people to deposit, are so totally different from those around research repositories that it makes no sense to consider them in the same space anyway.

Even if the total number of published UK peer-reviewed research papers is indeed hard to determine, it seems to me that we ought to be able to reach some kind of suitable agreement about how we would estimate it for the purposes of repository metrics.  Or we could base our measurements on some agreed sub-set of all scholarly output - the peer-reviewed research papers submitted to the current RAE (or forthcoming REF) for example.

A glass half empty view of the world says that by giving ourselves concrete objectives we are setting ourselves up for failure.  Maybe... though I prefer the glass half full view that we are setting ourselves up for success.  Whatever... failure isn't really failure - it's just a convenient way of partitioning off those activities that aren't worth pursuing (for whatever reason) so that other things can be focused on more fully.  Without concrete metrics it is much harder to make those kinds of decisions.

The other issue around metrics is that if the goal is open access (which I think it is), as opposed to full repositories (which are just a means to an end) then our metrics should be couched in terms of that goal.  (Note that, for me at least, open access implies both good management and long-term preservation and that repositories are only one way of achieving that).

The bottom-line question is, "what does success in the repository space actually look like?".  My worry is that we are scared of the answers.  Perhaps the real problem here is that 'failure' isn't an option?

Executive summary: our success metrics around research publications should be based on a percentage of the newly published peer-reviewed literature (or some suitable subset thereof) being made available on an open access basis (irrespective of how that is achieved).

Emphasis on individuals

Across the board we are seeing a growing emphasis on the individual, on user-centricity and on personalisation (in its widest sense).  Personal Learning Environments, Personal Research Environments and the suite of 'open stack' standards around OpenID are good examples of this trend.  Yet in the repository space we still tend to focus most on institutional wants and needs.  I've characterised this in the past in terms of us needing to acknowledge and play to the real-world social networks adopted by researchers.  As long as our emphasis remains on the institution we are unlikely to bring much change to individual research practice.

Executive summary: we need to put the needs of individuals before the needs of institutions in terms of how we think about reaching open access nirvana.

Fit with the Web

I written and spoken a lot about this in the past and don't want to simply rehash old arguments.  That said, I think three things are worth emphasising:


Global discipline-based repositories are more successful at attracting content than institutional repositories.  I can say that with only minimal fear of contradiction because our metrics are so poor - see above :-).  This is no surprise.  It's exactly what I'd expect to see.  Successful services on the Web tend to be globally concentrated (as that term is defined by Lorcan Dempsey) because social networks tend not to follow regional or organisational boundaries any more.

Executive summary: we need to work out how to take advantage of global concentration more fully in the repository space.

Web architecture

Take three guiding documents - the Web Architecture itself, REST, and the principles of linked data.  Apply liberally to the content you have at hand - repository content in our case.  Sit back and relax. 

Executive summary: we need to treat repositories more like Web sites and less like repositories.

Resource discovery

On the Web, the discovery of textual material is based on full-text indexing and link analysis.  In repositories, it is based on metadata and pre-Web forms of citation.  One approach works, the other doesn't.  (Hint: I no longer believe in metadata as it is currently used in repositories).  Why the difference?  Because repositories of research publications are library-centric and the library world is paper-centric - oh, and there's the minor issue of a few hundred years of inertia to overcome.  That's the only explanation I can give anyway.  (And yes, since you ask... I was part of the recent movement that got us into this mess!). 

Executive summary: we need to 1) make sure that repository content is exposed to mainstream Web search engines in Web-friendly formats and 2) make academic citation more Web-friendly so that people can discovery repository content using everyday tools like Google.

Simple huh?!  No, thought not...

I realise that most of what I say above has been written (by me) on previous occasions in this blog.  I also strongly suspect that variants of this blog entry will continue to appear here for some time to come.

August 15, 2008

Preserving virtual worlds

The BBC have a short article about digital preservation entitled, Writing the history of virtual worlds.  Virtual worlds and other gaming environments, being highly dynamic in nature, bring with them special considerations in terms of long term preservation and the article describes an approach being used at the University of Texas involving interviews and story telling with both makers and users.

Belatedly, I also note that last week's Wallenburg Summer Institute at Standford University in the US included a workshop entitled Preserving Knowledge in Virtual Worlds.  Stanford are (or were?) partners in another virtual world preservation project, Preserving Virtual Worlds, led by Jerome McDonnough at the University of Illinois (someone who is probably better known by many readers as the technical architect of METS), which was funded by the Library of Congress a year or so ago.

At the risk of making a gross generalisation, it looks like the Texas work is attempting to preserve the 'experience' of virtual worlds, whereas the Illinois work has been focusing more on the content.  It strikes me that virtual worlds such as Second Life that are surrounded by a very significant level of blogging, image taking, video making and podcasting activity are being preserved indirectly (in some sense at least) through the preservation of that secondary material (I'm making the assumption here, possibly wrongly, that much of that associated material is making its way into the Internet Archive in one form or another)?

February 26, 2008

Preserving the ABC of scholarly communication

Somewhat belatedly, I've been re-reading Lorcan Dempsey's post from October last year, Quotes of the day (and other days?): persistent academic discourse, in which he ponders the role of academic blogs in scholarly discourse and the apparent lack of engagement by institutions in thinking about their preservation.

I like Grainne Conole's characterisation of the place of blogging in scholarly communication:

  • Academic paper: reporting of findings against a particular narrative, grounded in the literature and related work; style – formal, academic-speak
  • Conference presentation: awareness raising of the work, posing questions and issues about the work, style – entertaining, visual, informal
  • Blogging – snippets of the work, reflecting on particular issues, style – short, informal, reflective

(even though it would have been better in alphabetical order! :-) ) and I'm tempted to wonder whether and how this characterisation will change over the next few years, as blogging continues to grow in importance as a communication medium.

Lorcan ends with:

Universities and university libraries are recognizing that they have some responsibility to the curation of the intellectual outputs of their academics and students. So far, this has not generally extended to thinking about blogs. What, if anything, should the Open University or Harvard be doing to make sure that this valuable discourse is available to future readers as part of the scholarly record?

As I argued in my most recent post about repositories, I suspect that most academics would currently expect to host their blogs outside their institution.  (Note that I'm hypothesising here, since I haven't asked any real academics this question - however, the breadth and depth of external blog services seems so overwhelming that it would be hard for institutions to try to compel their academics to use an institutional blogging service IMHO). This leaves institutions (or anyone else for that matter) that want to curate the blogging component of their intellectual output with a problem.  Somehow, they have to aggregate their part of the externally held scholarly record into an internal form, such that they can curate it.

I don't see this as an impossible task - though clearly, there is a challenge here in terms of both technology and policy.

In the context of the debate about institutional repositories, my personal opinion is that this situation waters down the argument that repositories have to be institutional because that is the only way in which the scholarly record can be preserved.  Sorry, I just don't buy it.

August 06, 2007

Open, online journals != PDF ?

I note that Volume 2, Number 1 of the International Journal of Digital Curation (IJDC) has been announced with a healthy looking list of peer-reviewed articles.  Good stuff.

I mention this partly because I helped set up the technical infrastructure for the journal using the Open Journal System, an open source journal management and publishing system developed by the Public Knowledge Project, while I was still at UKOLN - so I have a certain fondness for it.

Odd though, for a journal that is only ever (as far as I know) intended to be published online, to offer the articles using PDF rather than HTML.  Doing so prevents any use of lightweight 'semantic' markup within the articles, such as microformats, and tends to make re-use of the content less easy.

In short, choosing to use PDF rather than HTML tends to make the content less open than it otherwise could be.  That feels wrong to me, especially for an open access journal!  One could just about justify this approach for a journal destined to be published both on paper and online (though even in that case I think it would be wrong) but surely not for an online-only 'open' publication?

May 03, 2007

An email snapshot of the UK

The BBC report that the British Library are running a project during May to capture a snapshot of the UK through email.

The library is asking everyone in the UK to forward an e-mail from their inbox or sent mail box representing their life or interests.  Alternatively, people can submit a specially-composed message.

Emails should be submitted to [email protected].

As an aside, I note that a Google search for this story (first heard on BBC Radio 4 this morning) turned up coverage by the BBC and the Guardian, but nothing on the British Library Web site :-(.

November 27, 2006

Reflections on 10 years of the institutional Web


I began writing this post when we first started the blog back in September - but then it got lost somehow and I never finished it off or published it.  More recently I was interviewed by phone as part of a study evaluating the UK eLib Programme (and in particular the impact of the MODELS project) and it reminded me about the conclusions I draw below.

I was asked earlier in the year to give the closing talk at the UK Institutional Web Managers Workshop, looking back over the past 10 years of the institutional Web.

I have to confess that I felt pretty intimidated by the scope of the talk, though I hope that the result was at least a little interesting.  I certainly found it useful to spend some time thinking about what had been happening over that kind of timeframe.

I suppose I reached two broad conclusions...

My first point was that there hasn't been enough engagement during that time between the UK digital library community (i.e. the community that grew out of the e-Lib Programme, the UK's first significant suite of digital library projects) and the UK Web manager community.  I don't mean that in a "they've been ignoring us" kind of way.  What I mean is that we need to continually remind ourselves that digital library activities (at least in the context of UK higher and further education) need to be firmly grounded in the realities of education service delivery - not least so that they remain relevant and practical.  A good example is metadata development and research (a topic close to my heart) which one could argue has had very limited impact on the issues that institutional Web manages care about.  As the provision of an institution's Web site became an increasingly marketing kind of activity, I think we lost a lot of the links between the webmaster, library and elearning communities that would otherwise have been very beneficial.  Having listened to a lot of very interesting debates on the UK webmaster forums over the years, it has always frustrated me that there seems to have been very little engagement by that community in elearning or eresearch issues as such - and the community is poorer for it.

My second point was that we are in danger of losing our own digital history - or at least that associated with the development of institutional Web sites in the UK.  For a community that generally accepts the importance of digital preservation, it seems to me that we are more talk than action!?  Looking back to the early days of institutional Web site, seeing what those sites looked like and what kinds of discussions went on at that time is surprisingly difficult.  Early mailing list archives got lost in the transition from mailbase.ac.uk to jiscmail.ac.uk, the Internat Archive didn't start until too late to capture some of the stuff, and so on.  Don't get me wrong, there is some interesting stuff still around - I showed some of Brian Kelly's early Web evangelism slides that he was using in 1995 during my talk and noted that they could more or less still be used today.  I also had some stuff in my personal email archives, but to my shame, even some of that got lost when I moved from UKOLN to the Eduserv Foundation :-(

So my major recommendation from the talk was that someone needs to capture some of this stuff as soon as possible, because, if we don't, then a significant part of the history of how university Web sites in the UK grew up and the outcomes of the eLib programme will be lost forever.

It used to be said that eLib took a "let a 1000 flowers bloom" kind of approach.  Undoubtedly true.  Yet somewhere along the line we may not have cultivated those flowers properly and, even if we did, we are in danger of losing our record of how pretty they looked.

Image: Cloud reflections at Nisqually National Wildlife Refuge, Washington, US. [May 2006]

November 15, 2006

8 out of 10 ten surveys aren't worth the paper they are written on

There's been widespread reporting about the recent BPI survey into attitudes on copyright protection in recorded music for songwriters and performers (e.g. BBC and Observer).  Amongst other things, the survey appears to suggests that:

62 per cent of those polled believe British artists should receive the same copyright protection as their US counterparts

and that

Just under 70 per cent of 18- to 29-year-olds hold that view, the highest of any age group surveyed.

I've had a quick look round and can't find any evidence of how the survey questions were phrased or what context they were put into, which to my mind means that it is pretty much impossible to draw anything more substantial from the survey than a few juicy headlines.

I did stumble across the BPI's Five reasons to support British music: one of the UK's most valuable industries during my search, which it seems to me presents a rather uncompelling case.

On the other side of the debate, the British Library IP Manifesto sensibly urges caution:

The copyright term for sound recordings should not be extended without empirical evidence of the benefits and due consideration of the needs of society as a whole.

I found it interesting to note that the BPI list being a "unique cultural asset" as the first reason for copyright extension, concluding that

... recordings are the cultural heirlooms of Britain. They should remain in possession of their original owners.

whereas the BL argues that longer copyright makes it difficult or impossible to preserve recordings for the future.



eFoundations is powered by TypePad