April 08, 2011

Scholarly communication, open access and disruption

I attended part of UKSG earlier this week, listening to three great presentations in the New approaches to research session on Monday afternoon (by Philip Bourne, Cameron Neylon and Bill Russell) and presenting first thing Tuesday morning in the Rethinking 'content' session.

(A problem with my hearing meant that I was very deaf for most of the time, making conversation in the noisy environment rather tiring, so I decided to leave the conference early Tuesday afternoon. Unfortunately, that meant that I didn't get much of an opportunity to network with people. If I missed you, sorry. Looking at the Twitter stream, it also meant that I missed what appear to have been some great presentations on the final day. Shame.)

Anyway, for what it's worth, my slides are below. I was speaking on the theme of 'open, social and linked', something that I've done before, so for regular readers of this blog there probably won't be too much in the way of news.

With respect to the discussion of 'social' and it's impact on scholarly communication, there is room for some confusion because 'social' is often taken to mean, "how does one use social media like Facebook, Twitter, etc. to support scholarly communication?". Whilst I accept that as a perfectly sensible question, it isn't quite what I meant in this talk. What I meant was that we need to better understand the drivers for social activity around research and research artefacts, which probably needs breaking down into the various activities that make up the scholarly research workflow/cycle, in order that we can build tools that properly support that social activity. That is something that I don't think we have yet got right, particularly in our provision of repositories. Indeed, as I argued in the talk, our institutional repository architecture is more or less in complete opposition to the social drivers at play in the research space. Anyway... you've heard all this from me before.

Cameron Neylon's talk was probably the best of the ones that I saw and I hope my talk picked up on some of the themes that he was developing. I'm not sure if Cameron's UKSG slides are available yet but there's a very similar set, The gatekeeper is dead, long live the gatekeeper, presented at the STM Innovation Seminar last December. Despite the number of slides, these are very quick to read thru, and very understandable, even in the absence of any audio. On that basis, I won't re-cap them here. Slides 112 onwards give a nice summary: "we are the gatekeepers... enable, don't block... build platforms, not destinations... sell services, not content... don't think about filtering or control... enable discovery". These are strong messages for both the publishing community and libraries. All in all, his points about 'discovery defecit' rather than 'filter failure' felt very compelling to me.

On the final day there were talks about open access and changing subscription models, particularly from 'reader pays' to 'author pays', based partly on the recently released study commissioned by the Research Information Network (RIN), JISC, Research Libraries UK (RLUK), the Publishing Research Consortium (PRC) and the Wellcome Trust, Heading for the open road: costs and benefits of transitions in scholarly communications. We know that the web is disruptive to both publishers and libraries but it seemed to me (from afar) that the discussions at UKSG missed the fact that the web is potentially also disruptive to the process of scholarly communication itself. If all we do is talk about shifting the payment models within the confines of current peer-review process we are missing a trick (at least potentially).

What strikes me as odd, thinking back to that original hand-drawn diagram of the web done by Tim Berners-Lee, is that, while the web has disrupted almost every aspect of our lives to some extent, it has done relatively little to disrupt scholarly communication except in an 'at the margins' kind of way. Why is that the case? My contention is that there is such a significant academic inertia to overcome, coupled with a relatively small and closed 'market', that the momentum of change hasn't yet grown sufficiently - but it will. The web was invented as a scholarly device, yet it has, in many ways, resulted in less transformation there than in most other fields. Strange?

Addendum: slides for Philip Bourne's talk are now available on Slideshare.

October 25, 2010

A few brief thoughts on iTunesU

The use of iTunesU by UK universities has come up in discussions a couple of times recently, on Brian Kelly's UK Web Focus blog (What Are UK Universities Doing With iTunesU? and iTunes U: an Institutional Perspective) and on the closed ALT-C discussion list. In both cases, as has been the case in previous discussions, my response has been somewhat cautious, an attitude that always seems to be interpreted as outright hostility for some reason.

So, just  for the record, I'm not particularly negative about iTunesU and in some respects I am quite positive - if nothing else, I recognise that the adoption of iTunesU is a very powerful motivator for the generation of openly available content and that has got to be a good thing - but a modicum of scepticism is always healthy in my view (particularly where commercial companies are involved) and I do have a couple of specific concerns about the practicalities of how it is used:

  • Firstly that students who do not own Apple hardware and/or who choose not to use iTunes on the desktop are not disenfranchised in any way (e.g. by having to use a less functional Web interface). In general, the response to this is that they are not and, in the absence of any specific personal experience either way, I have to concede that to be the case.
  • Secondly (and related to the first point), that in an environment where most of the emphasis seems to be on the channel (iTunesU) rather than on the content (the podcasts), that confusion isn't introduced as to how material is cited and referred to – i.e. do some lecturers only ever refer to 'finding stuff on iTunesU', while others offer a non-iTunesU Web URL, and others still remember to cite both? I'm interested in whether universities who have adopted iTunesU but who also make the material available in other ways have managed to adopt a single way of citing the material that is on offer?

Both these concerns relate primarily to the use of iTunesU as a distribution channel for teaching and learning content within the institution. They apply much less to its use as an external 'marketing' channel. iTunesU seems to me (based on a gut feel more than on any actual numbers) to be a pretty effective way of delivering OER outside the institution and to have a solid 'marketing win on the back of that. That said, it would be good to have some real numbers as confirmation (note that I don't just mean numbers of downloads here - I mean conversions into 'actions' (new students, new research opps, etc.)). Note that I also don't consider 'marketing' to be a dirty word (in this context) - actually, I guess this kind of marketing is going to become increasingly important to everyone in the HE sector.

There is a wider, largely religious, argument about whether "if you are not paying for it, you aren't the customer, you are part of the product" but HE has been part of the MS product for a long while now and, worse, we have paid for the privilege – so there is nothing particularly new there. It's not an argument that particularly bothers me one way or the other, provided that universities have their eyes open and understand the risks as well as the benefits. In general, I'm sure that they do.

On the other hand, while somebody always owns the channel, some channels seem to me to be more 'open' (I don't really want to use the word 'open' here because it is so emotive but I can't think of a better one) than others. So, for example, I think there are differences in an institution adopting YouTube as a channel as compared with adopting iTunesU as a channel and those differences are largely to do with the fit that YouTube has with the way the majority of the Web works.

September 01, 2010

Lessons of Intute

Many years ago now, back when I worked for UKOLN, I spent part of my time working on the JISC-funded Intute service (and the Resource Discovery Network (RDN) that went before it), a manually created catalogue of high-quality Internet resources. It was therefore with some interest that I read a retrospective about the service in the July issue of Ariadne. My involvement was largely with the technology used to bring together a pre-existing and disparate set of eLib 'subject gateways' into a coherent whole. I was, I suppose, Intute's original technical architect, though I doubt if I was ever formally given that title. Almost inevitably, it was a role that led to my involvement in discussions both within the service and with our funders (and reviewers) at the time about the value (i.e. the benefits vs the costs) of such a service - conversations that were, from my point of view, always quite difficult because they involved challenging ourselves about the impact of our 'home grown' resource discovery services against those being built outside the education sector - notably, but not exclusively, by Google :-). 

Today, Steve Hitchcock of Southampton posted his thoughts on the lessons we should draw from the history of Intute. They were posted originally to the jisc-repositories mailing list. I repeat the message, with permission and in its entirety, here:

I just read the obituary of Intute, and its predecessor JISC services, in Ariadne with interest and some sadness, as will others who have been involved with JISC projects over this extended period. It rightly celebrates the achievements of the service, but it is also balanced in seeking to learn the lessons for where it is now.

We must be careful to avoid partial lessons, however. The USP of Intute was 'quality' in its selection of online content across the academic disciplines, but ultimately the quest for quality was also its downfall:

"Our unique selling point of human selection and generation of descriptions of Web sites was a costly model, and seemed somewhat at odds with the current trend for Web 2.0 technologies and free contribution on the Internet. The way forward was not clear, but developing a community-generated model seemed like the only way to go."

http://www.ariadne.ac.uk/issue64/joyce-et-al/

Unfortunately it can be hard for those responsible for defining and implementing quality to trust others to adhere to the same standards: "But where does the librarian and the expert fit in all of this? Are we grappling with new perceptions of trust and quality?" It seems that Intute could not unravel this issue of quality and trust of the wider contributor community. "The market research findings did, however, suggest that a quality-assurance process would be essential in order to maintain trust in the service". It is not alone, but it is not hard to spot examples of massively popular Web services that found ways to trust and exploit community.

The key to digital information services is volume and speed. If you have these then you have limitless opportunities to filter 'quality'. This is not to undermine quality, but to recognise that first we have to reengineer the information chain. Paul Ginsparg reengineered this chain in physics, but he saw early on that it would be necessary to rebuild the ivory towers:

"It is clear, however, that the architecture of the information data highways of the future will somehow have to reimplement the protective physical and social isolation currently enjoyed by ivory towers and research laboratories."

http://arxiv.org/macros/blurb.tex

It was common at that time in 1994 to think that the content on the emerging Web was mostly rubbish and should be swept away to make space for quality assured content. A senior computer science professor said as much in IEEE Computer magazine, and as a naive new researcher I replied to say he was wrong and that speed changes everything.

Clearly we have volume of content across the Web; only now are we beginning to see the effect of speed with realtime information services.

If we are to salvage something from Intute, as seems to be the aim of the article, it must be to recognise the relations on the digital information axis between volume, speed and quality, not just the latter, even in the context of academic information services.

Steve Hitchcock

Steve's comments were made in the context of repositories but his final paragraph struck a chord with me more generally, in ways that I'm struggling to put into words.

My involvement with Intute ended some years ago and I can't comment on its recent history but, for me, there are also lessons in how we recognise, acknowledge and respond to changes in the digital environment beyond academia - changes that often have a much larger impact on our scholarly practices than those we initiate ourselves. And this is not a problem just for those of us working on developing the component services within our environment but for the funders of such activities.

May 05, 2010

The future of UK Dublin Core application profiles

I spent yesterday morning up at UKOLN (at the University of Bath) for a brief meeting about the future of JISC-funded Dublin Core application profile development in the UK.

I don't intend to report on the outcomes of the meeting here since it is not really my place to do so (I was just invited as an interested party and I assume that the outcomes of the meeting will be made public in due course). However, attending the meeting did make me think about some of the issues around the way application profiles have tended to be developed to date and these are perhaps worth sharing here.

By way of background, the JISC have been funding the development of a number of Dublin Core application profiles in areas such as scholarly works, images, time-based media, learning objects, GIS and research data over the last few years.  An application profile provides a model of some subset of the world of interest and an associated set of properties and controlled vocabularies that can be used to describe the entities in that model for the purposes of some application (or service) within a particular domain. The reference to Dublin Core implies conformance with the DCMI Abstract Model (which effectively just means use of the RDF model) and an inherent preference for the use of Dublin Core terms whenever possible.

The meeting was intended to help steer any future UK work in this area.

I think (note that this blog post is very much a personal view) that there are two key aspects of the DC application profile work to date that we need to think about.

Firstly, DC application profiles are often developed by a very small number of interested parties (sometimes just two or three people) and where engagement in the process by the wider community is quite hard to achieve. This isn't just a problem with the UK JISC-funded work on application profiles by the way. Almost all of the work undertaken within the DCMI community on application profiles suffers from the same problem - mailing lists and meetings with very little active engagement beyond a small core set of people.

Secondly, whilst the importance of enumerating the set of functional requirements that the application profile is intended to meet has not been underestimated, it is true to say that DC application profiles are often developed in the absence of an actual 'software application'. Again, this is also true of the application profile work being undertaken by the DCMI. What I mean here is that there is not a software developer actually trying to build something based on the application profile at the time it is being developed. This is somewhat odd (to say the least) given that they are called application profiles!

Taken together, these two issues mean that DC application profiles often take on a rather theoretical status - and an associated "wouldn't it be nice if" approach. The danger is a growth in the complexity of the application profile and a lack of any real business drivers for the work.

Speaking from the perspective of the Scholarly Works Application Profile (SWAP) (the only application profile for which I've been directly responsible), in which we adopted the use of FRBR, there was no question that we were working to a set of perceived functional requirements (e.g. "people need to be able to find the latest version of the current item"). However, we were not driven by the concrete needs of a software developer who was in the process of building something. We were in the situation where we could only assume that an application would be built at some point in the future (a UK repository search engine in our case). I think that the missing link to an actual application, with actual developers working on it, directly contributed to the lack of uptake of the resulting profile. There were other factors as well of course - the conceptual challenge of basing the work on FRBR and that fact that existing repository software was not RDF-ready for example - but I think that was the single biggest factor overall.

Oddly, I think JISC funding is somewhat to blame here because, in making funding available, JISC helps the community to side-step the part of the business decision-making that says, "what are the costs (in time and money) of developing, implementing and using this profile vs. the benefits (financial or otherwise) that result from its use?".

It is perhaps worth comparing current application profile work and other activities. Firstly, compare the progress of SWAP with the progress of the Common European Research Information Format (CERIF), about which the JISC recently reported:

EXRI-UK reviewed these approaches against higher education needs and recommended that CERIF should be the basis for the exchange of research information in the UK. CERIF is currently better able to encode the rich information required to communicate research information, and has the organisational backing of EuroCRIS, ensuring it is well-managed and sustainable.

I don't want to compare the merits of these two approaches at a technical level here. What is interesting however, is that if CERIF emerges as the mandated way in which research information is shared in the UK then there will be a significant financial driver to its adoption within systems in UK institutions. Research information drives a significant chunk of institutional funding which, in turn, drives compliance in various applications. If the UK research councils say, "thou shalt do CERIF", that is likely what institutions will do.  They'll have no real choice. SWAP has no such driver, financial or otherwise.

Secondly, compare the current development of Linked Data applications within the UK data.gov.uk initiative with the current application profile work. Current government policy in the UK effectively says, 'thou shalt do Linked Data' but isn't really any more prescriptive. It encourages people to expose their data as Linked Data and to develop useful applications based on that data. Ignoring any discussion about whether Linked Data is a good thing or not, what has resulted is largely ground-up. Individual developers are building stuff and, in the process, are effectively developing their own 'application profiles' (though they don't call them that) as part of exposing/using the Linked Data. This approach results in real activity. But it also brings with it the danger of redundancy, in that every application developer may model their Linked Data differently, inventing their own RDF properties and so on as they see fit.

As Paul Walk noted at the meeting yesterday, at some stage there will be a huge clean-up task to make any widespread sense of the UK government-related Linked Data that is out there. Well, yes... there will. Conversely, there will be no clean up necessary with SWAP because nobody will have implemented it.

Which situation is better!? :-)

I think the issue here is partly to do with setting the framework at the right level. In trying to specify a particular set of application profiles, the JISC is setting the framework very tightly - not just saying, "you must use RDF" or "you must use Dublin Core" but saying "you must use Dublin Core in this particular way". On the other hand, the UK government have left the field of play much more open. The danger with the DC application profile route is lack of progress. The danger with the government approach is too little consistency.

So, what are the lessons here? The first, I think, is that it is important to lobby for your prefered technical solution at a policy level as well as at a technical level. If you believe that a Linked Data-compliant Dublin Core application profile is the best technical way of sharing research information in the UK then it is no good just making that argument to software developers and librarians. Decisions made by the research councils (in this case) will be binding irrespective of technical merit and will likely trump any decisions made by people on the ground.

The second is that we have to understand the business drivers for the adoption, or not, of our technical solutions rather better than we do currently. Who makes the decisions? Who has the money? What motivates the different parties? Again, technically beautiful solutions won't get adopted if the costs of adoption are perceived to outweigh the benefits, or if the people who hold the purse strings don't see any value in spending their money in that particular way, or if people simply don't get it.

Finally, I think we need to be careful that centralised, top-down, initiatives (particularly those with associated funding) don't distort the environment to such an extent that the 'real' drivers, both financial and user-demand, can be ignored in the short term, leading to unsustainable situations in the longer term. The trick is to pump-prime those things that the natural drivers will support in the long term - not always an easy thing to pull off.

February 19, 2010

In the clouds

So, the Repositories and the Cloud meeting, jointly organised by ourselves and the JISC, takes place on Tuesday next week and I promised to write up my thoughts in advance.  Trouble is... I'm not sure I actually have any thoughts :-(

Let's start from the very beginning (it's a very good place to start)...

The general theory behind cloud solutions - in this case we are talking primarily about cloud storage solutions but I guess this applies more generally - is that you outsource parts of your business to someone else because:

  • they can do it better than you can,
  • they can do it more cheaply than you can,
  • they can do it in a more environmentally-friendly way than you can, or
  • you simply no longer wish to do it yourself for other reasons.

Seems simple enough and I guess that all of these apply to the issues at hand for the meeting next week, i.e. what use is there for utility cloud storage solutions for the data currently sitting in institutional repositories (and physically stored on disks inside the walls of the institution concerned).

Against that, there are a set of arguments or issues that mitigate against a cloud solution, such as:

  • security
  • data protection
  • sustainability
  • resilience
  • privacy
  • loss of local technical knowledge
  • ...

...you know the arguments.  Ultimately institutions are going to end up asking themselves questions like, "how important is this data to us?", "are we willing to hand it over to one or more cloud providers for long term storage?", "can we afford to continue to store this stuff for ourselves?", "what is our exit strategy in the future?", and so on.

Wrapped up in this will be issues about the specialness of the kind of stuff one typically finds in institutional repositories - either because of volume of data (large research data-sets for example), or because stuff is seen as being especially important for various reasons (it's part of the scholarly record for example).

None of which is particularly helpful in terms of where the meeting will take us!  I certainly don't expect any actual answers to come out of it, but I am expecting a good set of discussions both about current capabilities (what the current tools are capable of), policy issues, and about where we are likely to go in the future.

One of the significant benefits the current interest in cloud solutions brings is the abstraction of the storage layer from the repository services.  Even if I never actually make use of Amazon S3, I might still get significant benefit from the cloud storage mindset because my internal repository 'storage' layer is separated from the rest of the software.  That means that I can do things like sharing data across multiple internal stores, sharing data across multiple external stores, or some combination of both, much more easily.  It also potentially opens up the market to competing products.

So, I think this space has wider implications than a simple, "should I use cloud storage?" approach might imply.

From an Eduserv point of view, both as a provider of not-for-profit services to the public, health and education sectors and as an organisation with a brand spanking new data centre I don't think there's any secret in the fact that we want to understand whether there is anything useful we can bring to this space - as a provider of cloud storage solutions that are significantly closer to the community than the utility providers are for example.  That's not to say that we have such an offer currently - but it is the kind of thing we are interested in thinking about.

I don't particularly buy into the view that the cloud is nothing new.  Amazon S3 and its ilk didn't exist 10 years ago and there's a reason for that.  As markets and technology have matured new things have become possible.  But that, on its own, isn't a reason to play in the cloud space. So, I suppose that the real question for the meeting next week is, "when, if ever, is the right time to move to cloud storage solutions for repository content... and why?" - both from a practical and a policy viewpoint.

I don't know the answers to those questions but I'm looking forward to finding out more about it next week.

February 11, 2010

Repositories and the Cloud - tell us your views

It's now a little over a week to go until the Repositories and the Cloud event (jointly organised by Eduserv and the JISC) takes place in London.  The event is sold out (sorry to those of you that haven't got a place) and we have a full morning of presentations from DuraSpace, Microsoft and EPrints and an afternoon of practical experience (Terry Harmer of the Belfast eScience Centre) and parallel discussion groups looking at both policy and technical issues.

To those of you that are coming, please remember that the afternoon sessions are for discussion.  We want you to get involved, to share your thoughts and to challenge the views of other people at the event (in the nicest way possible of course).  We'd love to know what you think about repositories and the cloud (note that, by that phrase, I mean the use of utility cloud providers as back-end storage for repository-like services).  Please share your thoughts below, or blog them using the event tag - 'repcloud' - or just bring them with you on the day!

I will share my thoughts separately here next week but let me be honest... I don't actually know what I think about the relationship between repositories and the cloud.  I'm coming to the meeting with an open mind.  As a community, we now have some experience of the policy and technical issues in our use of the cloud for things like undergraduate email but I expect the policy issues and technical requirements around repositories to be significantly different.  On that basis, I am really looking forward to the meeting.

The chairs of the two afternoon sessions, Paul Miller (paul.miller (at) cloudofdata.com) who is leading the policy session and Brad McLean (bmclean (at) fedora-commons.org) who is leading the technical session, would also like to hear your views on what you hope their sessions will cover.  If you have ideas please get in touch, either thru the comments form below, via Twitter (using the '#repcloud' hashtag) or by emailing them directly.

Thanks.

December 21, 2009

Scanning horizons for the Semantic Web in higher education

The week before last I attended a couple of meetings looking at different aspects of the use of Semantic Web technologies in the education sector.

On the Wednesday, I was invited to a workshop of the JISC-funded ResearchRevealed project at ILRT in Bristol. From the project weblog:

ResearchRevealed [...] has the core aim of demonstrating a fine-grained, access controlled, view layer application for research, built over a content integration repository layer. This will be tested at the University of Bristol and we aim to disseminate open source software and findings of generic applicability to other institutions.

ResearchRevealed will enhance ways in which a range of user stakeholder groups can gain up-to-date, accurate integrated views of research information and thus use existing institutional, UK and potentially global research information to better effect.

I'm not formally part of the project, but Nikki Rogers of ILRT mentioned it to me at the recent VoCamp Bristol meeting, and I expressed a general interest in what they were doing; they were also looking for some concrete input on the use of Dublin Core vocabularies in some of their candidate approaches.

This was the third in a series of small workshops, attended by representatives of the project from Bristol, Oxford and Southampton, and the aim was to make progress on defining a "core Research ontology". The morning session circled mainly around usage scenarios (support for the REF (and other "impact" assessment exercises), building and sustaining cross-institutional collaboration etc), and the (somewhat blurred) boundaries between cross-institutional requirements and institution-specific ones; what data might be aggregated, what might be best "linked to"; and the costs/benefits of rich query interfaces (e.g. SPARQL endpoints) v simpler literal- or URI-based lookups. In the afternoon, Nick Gibbins from the University of Southampton walked through a draft mapping of the CERIF standard to RDF developed by the dotAC project. This focused attention somewhat and led to some - to me - interesting technical discussions about variant ways of expressing information with differing degrees of precision/flexibility. I had to leave before the end of the meeting, but I hope to be able to continue to follow the project's progress, and contribute where I can.

A long train journey later, the following day I was at a meeting in Glasgow organised by the CETIS Semantic Technologies Working Group to discuss the report produced by the recent JISC-funded Semtech project, and to try to identify potential areas for further work in that area by CETIS and/or JISC. Sheila MacNeill from CETIS liveblogged proceedings here. Thanassis Tiropanis from the University of Southampton presented the project report, with a focus on its "roadmap for semantic technology adoption". The report argues that, in the past, the adoption of semantic technologies may have been hindered by a tendency towards a "top-down" approach requiring the widespread agreement on ontologies; in contrast the "linked data" approach encourages more of a "bottom-up" style in which data is first made available as RDF, and then later application-specific or community-wide ontologies are developed to enable more complex reasoning across the base data (which may involve mapping that initial data to those ontologies as they emerge). While I think there's a slight risk of overstating the distinction - in my experience many "linked data" initiatives do seem to demonstrate a good deal of thinking about the choice of RDF vocabularies and compatibility with other datasets - and I guess I see rather more of a continuum, it's probably a useful basis for planning. The report recommends a graduated approach which focusses initially on the development of this "linked data field" - in particular where there are some "low-hanging fruit" cases of data already made available in human-readable form which could relatively easily be made available in RDF, especially using RDFa.

One of the issues I was slightly uneasy with in the Glasgow meeting was that occasionally there were mentions of delivering "interoperability" (or "data interoperability") without really saying what was meant by that - and I say this as someone who used to have the I-word in my job title ;-) I feel we probably need to be clearer, and more precise, about what different "semantic technologies" (for want of a better expression) enable. What does the use of RDF provide that, say, XML typically doesn't? What does, e.g., RDF Schema add to that picture? What about convergence on shared vocabularies? And so on. Of course, the learners, teachers, researchers and administrators using the systems don't need to grapple with this, but it seems to me such aspects do need to be conveyed to the designers and developers, and perhaps more importantly - as Andy highlighted in his report of related discussions at the CETIS conference - to those who plan and prioritise and fund such development activity. (As an aside, I this is also something of an omission in the current version of the DCMI document on "Interoperability Levels": it tells me what characterises each level, and how I can test for whether an application meets the requirements of the level, but it doesn't really tell me what functionality each level provides/enables, or why I should consider level n+1 rather than level n.)

Rather by chance, I came across a recent presentation by Richard Cyganiak to the Vienna Linked Data Camp, which I think addresses some similar questions, albeit from a slightly different starting point: Richard asks the questions, "So, if we have linked data sources, what's stopping the development of great apps? What else do we need?", and highlights various dimensions of "heterogeneity" which may exist across linked data sources (use of identifiers, differences in modelling, differences in RDF vocabularies used, differences in data quality, differences in licensing, and so on).

Finally, I noticed that last Friday, Paul Miller (who was also at the CETIS meeting) announced the availability of a draft of a "Horizon Scan" report on "Linked Data" which he has been working on for JISC, as part of the background for a JISC call for projects in this area some time early in 2010. It's a relatively short document (hurrah for short reports!) but I've only had time for a quick skim through. It aims for some practical recommendations, ranging from general guidance on URI creation and the use of RDFa to more specific actions on particular resources/datasets. And here I must reiterate what Paul says in his post - it's a draft on which he is seeking comments, not the final report, and none of those recommendations have yet been endorsed by JISC. (If you have comments on the document, I suggest that you submit them to Paul (contact details here or comment on his post) rather than commenting on this post.)

In short, it's encouraging to see the active interest in this area growing within the HE sector. On reading Paul's draft document, I was struck by the difference between the atmosphere now (both at the Semtech meeting, and more widely) and what Paul describes as the "muted" conclusions of Brian Matthews' 2005 survey report on Semantic Web Technologies for JISC Techwatch. Of course, many of the challenges that Andy mentioned in his report of the CETIS conference session remain to be addressed, but I do sense that there is a momentum here - an excitement, even - which I'm not sure existed even eighteen months ago. It remains to be seen whether and how that enthusiasm translates into applications of benefit to the educational community, but I look forward to seeing how the upcoming JISC call, and the projects it funds, contribute to these developments.

December 04, 2009

Moving beyond the typical 15% deposit level

In an email to the [email protected] mailing list, Steve Hitchcock writes:

... authors of research papers everywhere want "to reach the eyes and minds of peers, fellow esoteric scientists and scholars the world over, so that they can build on one another's contributions in that cumulative. collaborative enterprise called learned inquiry."

[This] belief was founded on principle, but also on observed practice, that in 1994 we saw authors spontaneously making their papers available on the Web. From those small early beginnings we just assumed the practice would grow. Why wouldn't it? The Web was new, and open, and people were learning quickly how they could make use of it. Our instincts about the Web were not wrong. Since then, writing to the Web has become even easier.

So this is the powerful idea ..., and what we haven't yet understood is why, beyond the typical 15% deposit level, self-archiving does not happen without mandates. The passage of 15 years should tell us something about the other 85% of authors. Do they not share this belief? Does self-archiving not serve the purpose? ...

This is the part that needs to be re-examined, the idea, and why it has yet to awaken and enthuse our colleagues, as it has us, to the extent we envisaged. Might we have misunderstood and idealised the process of 'learned inquiry'?

I completely agree.

In passing, I'd be interested to know what uptake of Mendeley is like, and whether it looks likely to make any in-roads into the 85%, either as an adjunct to institutional repositories or as an alternative?

December 03, 2009

On being niche

I spoke briefly yesterday at a pre-IDCC workshop organised by REPRISE.  I'd been asked to talk about Open, social and linked information environments, which resulted in a re-hash of the talk I gave in Trento a while back.

My talk didn't go too well to be honest, partly because I was on last and we were over-running so I felt a little rushed but more because I'd cut the previous set of slides down from 119 to 6 (4 really!) - don't bother looking at the slides, they are just images - which meant that I struggled to deliver a very coherent message.  I looked at the most significant environmental changes that have occurred since we first started thinking about the JISC IE almost 10 years ago.  The resulting points were largely the same as those I have made previously (listen to the Trento presentation) but with a slightly preservation-related angle:

  • the rise of social networks and the read/write Web, and a growth in resident-like behaviour, means that 'digital identity' and the identification of people have become more obviously important and will remain an important component of provenance information for preservation purposes into the future;
  • Linked Data (and the URI-based resource-oriented approach that goes with it) is conspicuous by its absence in much of our current digital library thinking;
  • scholarly communication is increasingly diffusing across formal and informal services both inside and outside our institutional boundaries (think blogging, Twitter or Google Wave for example) and this has significant implications for preservation strategies.

That's what I thought I was arguing anyway!

I also touched on issues around the growth of the 'open access' agenda, though looking at it now I'm not sure why because that feels like a somewhat orthogonal issue.

Anyway... the middle bullet has to do with being mainstream vs. being niche.  (The previous speaker, who gave an interesting talk about MyExperiment and its use of Linked Data, made a similar point).  I'm not sure one can really describe Linked Data as being mainstream yet, but one of the things I like about the Web Architecture and REST in particular is that they describe architectural approaches that haven proven to be hugely successful, i.e. they describe the Web.  Linked data, it seems to me, builds on these in very helpful ways.  I said that digital library developments often prove to be too niche - that they don't have mainstream impact.  Another way of putting that is that digital library activities don't spend enough time looking at what is going on in the wider environment.  In other contexts, I've argued that "the only good long-term identifier, is a good short-term identifier" and I wonder if that principle can and should be applied more widely.  If you are doing things on a Web-scale, then the whole Web has an interest in solving any problems - be that around preservation or anything else.  If you invent a technical solution that only touches on scholarly communication (for example) who is going to care about it in 50 or 100 years - answer, not all that many people.

It worries me, for example, when I see an architectural diagram (as was shown yesterday) which has channels labelled 'OAI-PMH', XML' and 'the Web'!

After my talk, Chris Rusbridge asked me if we should just get rid of the JISC IE architecture diagram.  I responded that I am happy to do so (though I quipped that I'd like there to be an archival copy somewhere).  But on the train home I couldn't help but wonder if that misses the point.  The diagram is neither here nor there, it's the "service-oriented, we can build it all", mentality that it encapsulates that is the real problem.

Let's throw that out along with the diagram.

November 23, 2009

Memento and negotiating on time

Via Twitter, initially in a post by Lorcan Dempsey, I came across the work of Herbert Van de Sompel and his comrades from LANL and Old Dominion University on the Memento project:

The project has since been the topic of an article in New Scientist.

The technical details of the Memento approach are probably best summarised in the paper "Memento: Time Travel for the Web", and Herbert has recently made available a presentation which I'll embed here, since it includes some helpful graphics illustrating some of the messaging in detail:

Memento seeks to take advantage of the Web Architecture concept that interactions on the Web are concerned with exchanging representations of resources. And for any single resource, representations may vary - at a single point in time, variant representations may be provided, e.g. in different formats or languages, and over time, variant representations may be provided reflecting changes in the state of the resource. The HTTP protocol incorporates a feature called content negotiation which can be used to determine the most appropriate representation of a resource - typically according to variables such as content type, language, character set or encoding. The innovation that Memento brings to this scenario is the proposition that content negotiation may also be applied to the axis of date-time. i.e. in the same way that a client might express a preference for the language of the representation based on a standard request header, it could also express a preference that the representation should reflect resource state at a specified point in time, using a custom accept header (X-Accept-Datetime).

More specifically, Memento uses a flavour of content negotiation called "transparent content negotiation" where the server provides details of the variant representations available, from which the client can choose. Slides 26-50 in Herbert's presentation above illustrate how this technique might be applied to two different cases: one in which the server to which the initial request is sent is itself capable of providing the set of time-variant representations, and a second in which that server does not have those "archive" capabilities but redirects to (a URI supported by) a second server which does.

This does seem quite an ingenious approach to the problem, and one that potentially has many interesting applications, several of which Herbert alludes to in his presentation.

What I want to focus on here is the technical approach, which did raise a question in my mind. And here I must emphasise that I'm really just trying to articulate a question that I've been trying to formulate and answer for myself: I'm not in a position to say that Memento is getting anything "wrong", just trying to compare the Memento proposition with my understanding of Web architecture and the HTTP protocol, or at least the use of that protocol in accordance with the REST architectural style, and understand whether there are any divergences (and if there are, what the implications are).

In his dissertation in which he defines the REST architectural style, Roy Fielding defines a resource as follows:

More precisely, a resource R is a temporally varying membership function MR(t), which for time t maps to a set of entities, or values, which are equivalent. The values in the set may be resource representations and/or resource identifiers. A resource can map to the empty set, which allows references to be made to a concept before any realization of that concept exists -- a notion that was foreign to most hypertext systems prior to the Web. Some resources are static in the sense that, when examined at any time after their creation, they always correspond to the same value set. Others have a high degree of variance in their value over time. The only thing that is required to be static for a resource is the semantics of the mapping, since the semantics is what distinguishes one resource from another.

On representations, Fielding says the following, which I think is worth quoting in full. The emphasis in the first and last sentences is mine.

REST components perform actions on a resource by using a representation to capture the current or intended state of that resource and transferring that representation between components. A representation is a sequence of bytes, plus representation metadata to describe those bytes. Other commonly used but less precise names for a representation include: document, file, and HTTP message entity, instance, or variant.

A representation consists of data, metadata describing the data, and, on occasion, metadata to describe the metadata (usually for the purpose of verifying message integrity). Metadata is in the form of name-value pairs, where the name corresponds to a standard that defines the value's structure and semantics. Response messages may include both representation metadata and resource metadata: information about the resource that is not specific to the supplied representation.

Control data defines the purpose of a message between components, such as the action being requested or the meaning of a response. It is also used to parameterize requests and override the default behavior of some connecting elements. For example, cache behavior can be modified by control data included in the request or response message.

Depending on the message control data, a given representation may indicate the current state of the requested resource, the desired state for the requested resource, or the value of some other resource, such as a representation of the input data within a client's query form, or a representation of some error condition for a response. For example, remote authoring of a resource requires that the author send a representation to the server, thus establishing a value for that resource that can be retrieved by later requests. If the value set of a resource at a given time consists of multiple representations, content negotiation may be used to select the best representation for inclusion in a given message.

So at a point in time t1, the "temporally varying membership function" maps to one set of values, and - in the case of a resource whose representations vary over time - at another point in time t2, it may map to another, different set of values. To take a concrete example, suppose at the start of 2009, I launch a "quote of the day", and I define a single resource that is my "quote of the day", to which I assign the URI http://example.org/qotd/. And I provide variant representations in XHTML and plain text. On 1 January 2009 (time t1), my quote is "From each according to his abilities, to each according to his needs", and I provide variant representations in those two formats, i.e. the set of values for 1 January 2009 is those two documents. On 2 January 2009 (time t2), my quote is "Those who do not move, do not notice their chains", and again I provide variant representations in those two formats, i.e. the set of values for 2 January 2009 (time t2) is two XHTML and plain text documents with different content from those provided at time t1.

So, moving on to that second piece of text I cited, my interpretation of the final sentence as it applies to HTTP (and, as I say, I could be wrong about this) would be that the RESTful use of the HTTP GET method is intended to retrieve a representation of the current state of the resource. It is the value set at that point in time which provides the basis for negotiation. So, in my example here, on 1 January 2009, I offer XHTML and plain text versions of my "From each according to his abilities..." quote via content negotiation, and on 2 January 2009, I offer XHTML and plain text versions of my "Those who do not move..." quotations. i.e. At two different points in time t1 and t2, different (sets of) representations may be provided for a single resource, reflecting the different state of that resource at those two different points in time, but at either of those points in time, the expectation is that each representation of the set available represents the state of the resource at that point in time, and only members of that set are available via content negotiation. So although representations may vary by language, content-type etc, they should be in some sense "equivalent" (Roy Fielding's term) in terms of their representation of the current state of the resource.

I think the Memento approach suggests that on 2 January 2009, I could, using the date-time-based negotiation convention, offer all four of those variants listed above (and on each day into the future, a set which increases in membership as I add new quotes). But it seems to me that is at odds with the REST style, because the Memento approach requires that representations of different states of the resource (i.e. the state of the resource at different points in time) are all made available as representations at a single point in time.

I appreciate that (even if my interpretation is correct, which it may not be) the constraints specified by the REST architectural style are just that: a set of constraints which, if observed, generate certain properties/characteristics in a system. And if some of those constraints are relaxed or ignored, then those properties change. My understanding is not good enough to pinpoint exactly what the implications of this particular point of divergence (if indeed it is one!) would be - though as Herbert notes in hs presentation, it would appear that there would be implications for cacheing.

But as I said, I'm really just trying to raise the questions which have been running around my head and which I haven't really been able to answer to my own satisfaction.

As an aside, I think Memento could probably achieve quite similar results by providing some metadata (or a link to another document providing that metadata) which expressed the relationships between the time-variant resource and all the time-specific variant resources, rather than seeking to manage this via HTTP content negotiation.

Postscript: I notice that, in the time it has taken me to draft this post, Mark Baker has made what I think is a similar point in a couple of messages (first, second) to the W3C public-lod mailing list.

October 14, 2009

Open, social and linked - what do current Web trends tell us about the future of digital libraries?

About a month ago I travelled to Trento in Italy to speak at a Workshop on Advanced Technologies for Digital Libraries organised by the EU-funded CACOA project.

My talk was entitled "Open, social and linked - what do current Web trends tell us about the future of digital libraries?" and I've been holding off blogging about it or sharing my slides because I was hoping to create a slidecast of them. Well... I finally got round to it and here is the result:

Like any 'live' talk, there are bits where I don't get my point across quite as I would have liked but I've left things exactly as they came out when I recorded it. I particularly like my use of "these are all very bog standard... err... standards"! :-)

Towards the end, I refer to David White's 'visitors vs. residents' stuff, about which I note he has just published a video. Nice one.

Anyway... the talk captures a number of threads that I've been thinking and speaking about for the last while. I hope it is of interest.

September 16, 2009

Edinburgh publish guidance on research data management

The University of Edinburgh has published some local guidance about the way that research data should be managed, Research data management guidance, covering How to manage research data and Data sharing and preservation, as well as detailing local training, support and advice options.

One assumes that this kind of thing will become much more common at universities over the next few years.

Having had a very quick look, it feels like the material is more descriptive than prescriptive - which isn't meant as a negative comment, it just reflects the current state of play. The section on Data documentation & metadata for example, gives advice as simple as:

Have you created a "readme.txt" file to describe the contents of files in a folder? Such a simple act can be invaluable at a later date.

but also provides a link to the UK Data Archive's guidance on Data Documentation and Metadata, which at first sight appears hugely complex. I'm not sure what your average research will make of it?

(In passing, I note that the UKDA seem to be promoting the use of the Data Documentation Initiative standard at what they call the 'catalogue' level, a standard that I've not come across before but one that appears to be rooted firmly outside the world of linked data, which is a shame.)

Similarly, the section on Methods for data sharing lists a wide range of possible options (from "posting on a University website" thru to "depositing in a data repository") without being particularly prescriptive about which is better and why.

(As a second aside, I am continually amazed by this firm distinction in the repository world between 'posting on the website' and 'depositing in a repository' - from the perspective of the researcher, both can, and should, achieve the same aims, i.e. improved management, more chance of persistence and better exposure.)

As we have found with repositories of research publications, it seems to me that research data repositories (the Edinburgh DataShare in this case) need to hide much of this kind of complexity, and do most of the necessary legwork, in order to turn what appears to be a simple and obvious 'content management' workflow (from the point of view of the individual researcher) into a well managed, openly shared, long term resource for the community.

August 20, 2009

What researchers think about data preservation and access

There's an interesting report in the current issues of Ariadne by Neil Beagrie, Robert Beagrie and Ian Rowlands, Research Data Preservation and Access: The Views of Researchers, fleshing out some of the data behind the UKRDS Report, which I blogged about a while back.

I have a minor quibble with the way the data has been presented in the report, in that it's not overly clear how the 179 respondents represented in Figure 1 have been split across the three broad areas (Sciences, Social Sciences, and Arts and Humanities) that appear in subsequent figures. One is left wondering how significant the number of responses in each of the 3 areas was?  I would have preferred to see Figure 1 organised in such a way that the 'departments and faculties' were grouped more obviously into the broad areas.

That aside, I think the report is well worth reading.  I'll just highlight what the authors perceive to be the emerging themes:

  • It is clear that different disciplines have different requirements and approaches to research data.
  • Current provision of facilities to encourage and ensure that researchers have data stores where they can deposit their valuable data for safe-keeping and for sharing, as appropriate, varies from discipline to discipline.
  • Local data management and preservation activity is very important with most data being held locally.
  • Expectations about the rate of increase in research data generated indicate not only higher data volumes but also an increase in different types of data and data generated by disciplines that have not until recently been producing volumes of digital output.
  • Significant gaps and areas of need remain to be addressed.

The Findings of the Scoping Study and Research Data Management Workshop (undertaken at the University of Oxford and part of the work that infomed the Ariadne article) provides an indication of the "top requirements for services to help [researchers] manage data more effectively":

  • Advice on practical issues related to managing data across their life cycle. This help would range from assistance in producing a data management/sharing plan; advice on best formats for data creation and options for storing and sharing data securely; to guidance on publishing and preserving these research data.
  • A secure and user-friendly solution that allows storage of large volume of data and sharing of these in a controlled fashion way allowing fine grain access control mechanisms.
  • A sustainable infrastructure that allows publication and long-term preservation of research data for those disciplines not currently served by domain specific services such as the UK Data Archive, NERC Data Centres, European Bioinformatics Institute and others.
  • Funding that could help address some of the departmental challenges to manage the research data that are being produced.

Pretty high level stuff so nothing particularly surprising there. It seems to me that some work drilling down into each of these areas might be quite useful.

July 20, 2009

On names

There's was a brief exchange of messages on the jisc-repositories mailing list a couple of weeks ago concerning the naming of authors in institutional repositories.  When I say naming, I really mean identifying because a name, as in a string of characters, doesn't guarantee any kind of uniqueness - even locally, let alone globally.

The thread started from a question about how to deal with the situation where one author writes under multiple names (is that a common scenario in academic writing?) but moved on to a more general discussion about how one might assign identifiers to people.

I quite liked Les Carr's suggestion:

Surely the appropriate way to go forward is for repositories to start by locally choosing a scheme for identifying individuals (I suggest coining a URI that is grounded in some aspect of the institution's processes). If we can export consistently referenced individuals, then global services can worry about "equivalence mechanisms" to collect together all the various forms of reference that.

This is the approach taken by the Resist Knowledgebase, which is the foundation for the (just started) dotAC JISC Rapid Innovation project.

(Note: I'm assuming that when Les wrote 'URI' he really meant 'http URI').

Two other pieces of current work seem relevant and were mentioned in the discussion. Firstly the JISC-funded Names project which is working on a pilot Names Authroity Service. Secondly, the RLG Networking Names report.  I might be misunderstanding the nature of these bits of work but both seem to me to be advocating rather centralised, registry-like, approaches. For example, both talk about centrally assigning identifiers to people.

As an aside, I'm constantly amazed by how many digital library initiatives end up looking and feeling like registries. It seems to be the DL way... metadata registries, metadata schema registries, service registries, collection registries. You name it and someone in a digital library will have built a registry for it.

May favoured view is that the Web is the registry. Assign identifiers at source, then aggregate appropriately if you need to work across stuff (as Les suggests above).  The <sameAs> service is a nice example of this:

The Web of Data has many equivalent URIs. This service helps you to find co-references between different data sets.

As Hugh Glaser says in a discussion about the service:

Our strong view is that the solution to the problem of having all these URIs is not to generate another one. And I would say that with services of this type around, there is no reason.

In thinking about some of the issues here I had cause to go back and re-read a really interesting interview by Martin Fenner with Geoffrey Bilder of CrossRef (from earlier this year).  Regular readers will know that I'm not the world's biggest fan of the DOI (on which CrossRef is based), partly for technical reasons and partly on governence grounds, but let's set that aside for the moment.  In describing CrossRef's "Contributor ID" project, Geoff makes the point that:

... “distributed” begets “centralized”. For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.). This gets us back to square one and makes me think the real issue is - how do you make the centralized system that eventually emerges accountable?

I think this is a fair point but I also think there is a very significant architectural difference between a centralised service that aggregates identifiers and other information from a distributed base of services, in order to provide some useful centralised function for example, vs. a centralised service that assigns identifiers which it then pushes out into the wider landscape. It seems to me that only the former makes sense in the context of the Web.

June 19, 2009

Repositories and linked data

Last week there was a message from Steve Hitchcock on the UK [email protected] mailing list noting Tim Berners-Lee's comments that "giving people access to the data 'will be paradise'". In response, I made the following suggestion:

If you are going to mention TBL on this list then I guess that you really have to think about how well repositories play in a Web of linked data?

My thoughts... not very well currently!

Linked data has 4 principles:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.

Of these, repositories probably do OK at 1 and 2 (though, as I’ve argued before, one might question the coolness of some of the http URIs in use and, I think, the use of cool URIs is implicit in 2).

3, at least according to TBL, really means “provide RDF” (or RDFa embedded into HTML I guess), something that I presume very few repositories do?

Given lack of 3, I guess that 4 is hard to achieve. Even if one was to ignore the lack of RDF or RDFa, the fact that content is typically served as PDF or MS formats probably means that links to other things are reasonably well hidden?

It’d be interesting (academically at least), and probably non-trivial, to think about what a linked data repository would look like? OAI-ORE is a helpful step in the right direction in this regard.

In response, various people noted that there is work in this area: Mark Diggory on work at DSpace, Sally Rumsey (off-list) on the Oxford University Research Archive and parallel data repository (DataBank), and Les Carr on the new JISC dotAC Rapid Innovation project. And I'm sure there is other stuff as well.

In his response, Mark Diggory said:

So the question of "coolness" of URI tends to come in second to ease of implementation and separation of services (concerns) in a repository. Should "Coolness" really be that important? We are trying to work on this issue in DSpace 2.0 as well.

I don't get the comment about "separation of services". Coolness of URIs is about persistence. It's about our long term ability to retain the knowledge that a particular URI identifies a particular thing and to interact with the URI in order to obtain a representation of it. How coolness is implemented is not important, except insofar as it doesn't impact on our long term ability to meet those two aims.

Les Carr also noted the issues around a repository minting URIs "for things it has no authority over (e.g. people's identities) or no knowledge about (e.g. external authors' identities)" suggesting that the "approach of dotAC is to make the repository provide URIs for everything that we consider significant and to allow an external service to worry about mapping our URIs to "official" URIs from various "authorities"". An interesting area.

As I noted above, I think that the work on OAI-ORE is an important step in helping to bring repositories into the world of linked data. That said, there was some interesting discussion on Twitter during the recent OAI6 conference about the value of ORE's aggregation model, given that distinct domains will need to layer their own (different) domain models onto those aggregations in order to do anything useful. My personal take on this is that it probably is useful to have abstracted out the aggregation model but that the hypothesis still to be tested that primitive aggregation is useful despite every domain needing own richer data and, indeed, that we need to see whether the way the ORE model gets applied in the field turns out to be sensible and useful.

March 20, 2009

Unlocking Audio

I spent the first couple of days this week at the British Library in London, attending the Unlocking Audio 2 conference.  I was there primarily to give an invited talk on the second day.

You might notice that I didn't have a great deal to say about audio, other than to note that what strikes me as interesting about the newer ways in which I listen to music online (specifically Blip.fm and Spotify) is that they are both highly social (almost playful) in their approach and that they are very much of the Web (as opposed to just being 'on' the Web).

What do I mean by that last phrase?  Essentially, it's about an attitude.  It's about seeing being mashed as a virtue.  It's about an expectation that your content, URLs and APIs will be picked up by other people and re-used in ways you could never have foreseen.  Or, as Charles Leadbeater put it on the first day of the conference, it's about "being an ingredient".

I went on to talk about the JISC Information Environment (which is surprisingly(?) not that far off its 10th birthday if you count from the initiation of the DNER), using it as an example of digital library thinking more generally and suggesting where I think we have parted company with the mainstream Web (in a generally "not good" way).  I noted that while digital library folks can discuss identifiers forever (if you let them!) we generally don't think a great deal about identity.  And even where we do think about it, the approach is primarily one of, "who are you and what are you allowed to access?", whereas on the social Web identity is at least as much about, "this is me, this is who I know, and this is what I have contributed". 

I think that is a very significant difference - it's a fundamentally different world-view - and it underpins one critical aspect of the difference between, say, Shibboleth and OpenID.  In digital libraries we haven't tended to focus on the social activity that needs to grow around our content and (as I've said in the past) our institutional approach to repositories is a classic example of how this causes 'social networking' issues with our solutions.

I stole a lot of the ideas for this talk, not least Lorcan Dempsey's use of concentration and diffusion.  As an aside... on the first day of the conference, Charles Leadbeater introduced a beach analogy for the 'media' industries, suggesting that in the past the beach was full of a small number of large boulders and that everything had to happen through those.  What the social Web has done is to make the beach into a place where we can all throw our pebbles.  I quite like this analogy.  My one concern is that many of us do our pebble throwing in the context of large, highly concentrated services like Flickr, YouTube, Google and so on.  There are still boulders - just different ones?  Anyway... I ended with Dave White's notions of visitors vs. residents, suggesting that in the cultural heritage sector we have traditionally focused on building services for visitors but that we need to focus more on residents from now on.  I admit that I don't quite know what this means in practice... but it certainly feels to me like the right direction of travel.

I concluded by offering my thoughts on how I would approach something like the JISC IE if I was asked to do so again now.  My gut feeling is that I would try to stay much more mainstream and focus firmly on the basics, by which I mean adopting the principles of linked data (about which there is now a TED talk by Tim Berners-Lee), cool URIs and REST and focusing much more firmly on the social aspects of the environment (OpenID, OAuth, and so on).

Prior to giving my talk I attended a session about iTunesU and how it is being implemented at the University of Oxford.  I confess a strong dislike of iTunes (and iTunesU by implication) and it worries me that so many UK universities are seeing it as an appropriate way forward.  Yes, it has a lot of concentration (and the benefits that come from that) but its diffusion capabilities are very limited (i.e. it's a very closed system), resulting in the need to build parallel Web interfaces to the same content.  That feels very messy to me.  That said, it was an interesting session with more potential for debate than time allowed.  If nothing else, the adoption of systems about which people can get religious serves to get people talking/arguing.

Overall then, I thought it was an interesting conference.  I suspect that my contribution wasn't liked by everyone there - but I hope it added usefully to the debate.  My live-blogging notes from the two days are here and here.

March 05, 2009

A National Research Data Service for the UK?

I attended the A National Research Data Service for the UK? meeting at the Royal Society in London last week and my live-blogged notes are available for those who want more detail.  Chris Rusbridge also blogged the day on the Digital Curation Blog - session 1, session 2, session 3 and session 4.  FWIW, I think that Chris's posts are more comprehensive and better than my live-blogged notes.

The day was both interesting and somewhat disappointing...

Interesting primarily because of the obvious political tension in the room (which I characterised on Twitter as a potential bun-fight between librarians and the rest but which in fact is probably better summed up as a lack of shared agreement around centralist (discipline-based) solutions vs. institutional solutions).

Disappointing because the day struck me more as a way of presenting a done-deal than as a real opportunity for debate.

The other thing that I found annoying was the constant parroting of the view that "researchers want to share their data openly" as though this is an obvious position.  The uncomfortable fact is that even the UKRDS report's own figures suggest that less than half (43%) of those surveyed "expressed the need to access other researchers' data" - my assumption therefore is that the proportion currently willing to share their data openly will be much smaller.

Don't take this as a vote against open access, something that I'm very much in favour of.  But, as we've found with eprint archives, a top-down "thou shalt deposit because it is good for you" approach doesn't cut it with researchers - it doesn't result in cultural change.  Much better to look for, and actively support, those areas where open sharing of data occurs naturally within a community or discipline, thus demonstrating its value to others.

That said, a much more fundamental problem facing the provision of collaborative services to the research community is that funding happens nationally but research happens globally (or at least across geographic/funding boundaries) - institutions are largely irrelevant whichever way you look at it [except possibly as an agent of long term preservation - added 6 March 2009].  Resolving that tension seems paramount to me though I have no suggestions as to how it can be done.  It does strike me however that shared discipline-based services come closer to the realities of the research world than do institutional services.

March 04, 2009

How uncool? Repository URIs...

I've been using the OpenDOAR API to take a quick look at the coolness of the URIs that people in the UK are assigning to their repositories.  Why does coolness matter?  Because uncool URIs are more likely to break than cool URIs and broken URIs for the content of academic scholarly repositories will probably cause disruption to the smooth flow of scholarly communication at some point in the future.

So what is an uncool URI?  An uncool URI is one that is unlikely to be persistent, typically because the person who first assigned it didn't think hard enough about likely changes in organisational structure, policy or technology and the impact that changes in those areas might have on the persistence of the URI into the future.

In short - URIs don't break... people break them and, usually, the seeds for that breakage are sown at the point that a URI is minted.

OK, so first... hats off to the OpenDOAR team for providing such an easy to use API, one which made it simple to get at the data I was interested in - the URIs of the home pages of all the institutional repositories in the UK - by using the following link:

http://www.opendoar.org/api13.php?co=gb&rt=2&show=min

This provides a list of 107 repositories (as of 23 Feb 2009) as an XML file. Here's just the URIs of the repository home pages, broken out into the first part of the domain name, the institutional part of the domain name, the port, and the rest of the path (as a csv-separated file for easy loading into a spreadsheet).

In the following analysis, I'm making the assumption that the URI of the repository home page is carried thru into the URIs of all the items within that repository and that, if the home page URI is uncool then it is likely that the URIs for everything within that repository will be likewise.  This feels like a reasonable assumption to me, though I haven't actually checked it out.

So... what do we find?

Looking at the first part of the domain name, we find 7 institutions using 'dspace' as part of the repository domain name and 35 using 'eprints'.  Both are, presumably, derived from the technology being used as the repository platform.  Building this information into the URL is potentially uncool (because that technology might well be changed in the future).  Now, in both cases, I suspect that institutions might argue that they would stick with their use of 'eprints' and 'dspace' (particularly the former) even if the underlying technology was to change (on the basis that these terms have become somewhat generic).  I'm not totally convinced by that argument, though I think it holds more water in the case of 'eprints' than it does in the case of 'dspace' but in any case, I would argue that this is something definitely worth thinking about.

Note that 10 institutions (with some cross-over between the two counts) have built 'dspace' into the path part of the repository URL, which is uncool for the same reasons.

3 institutions have built a non-standard port (i.e. not port 80) into their repository URL.  Whilst this isn't necessarily uncool, it does warrant a question as to why it has been done and whether it will cause maintenance headaches into the future.

Looking at the path part of the URLs, 3 institutions have built the underlying technology (.htm, .php and .aspx) into their URLs - again, this is uncool because of the likelihood of future changes in technology.

A small number of institutions have built the library into their repository URLs.  This is probably OK but reflects a commitment to organisational structure thaat may not be warranted longer term?

Similarly, a larger number have built a, err..., 'jazzy' project name into their repository URL.  I would have thought it might be safer to stick to 'functional' labels like 'research' than, say 'opus', at least for the URLs, since this seems less likely to change because of political or other organisational issues into the future.

Finally, 4 institutions have outsourced their repository to openrepository.com, resulting in URLs under that domain.  Outsourcing is good (I say that not least because I work for a charity that is in that business!) but I would strongly suggest outsourcing in a form that keeps your institutional domain name as part of the URL so that your URLs don't break if your host goes bust or you decide to move things back into the institution or to another provider.

Overall then, it's another 'could try harder' mid-term report from me to the Repository 101 course members.

February 19, 2009

What is ORE for, really?

Pete has rather nicely answered the question, "What is ORE, really?".  In response, I'm tempted to ask a slightly different question, "What is ORE for, really?  In the ORE User Guide - Primer we find a 'Motivating Example' section which lays out some hard-to-reject statements about the importance of aggregations but which doesn't give us many verbs - it doesn't tell us what it is we can expect to be able to do to those aggregations, nor why we might want to.  The previous introductory section does propose three sample uses:

Because aggregations are not well-defined on the Web, we are limited in what we can do with them, especially in terms of the services or automated processes that make the Web useful. People who wish to save or print a multiple page document must manually click through each page and invoke the appropriate browser command. Programs that transfer multiple page documents among information systems must rely on the API's of the individual system architectures and their definition of document boundaries. Search engines must use heuristics to group individual Web pages into logical documents so that search results have the proper granularity.

On the face of it these are perfectly valid functional requirements but I think the underlying point that Pete makes in his post in that ORE, on its own, doesn't meet them.  The necessary knowledge that allows one bit of software to say, "ah, these are the pages of a document and I need to print them in this order" or "these are the boundaries of a document" or "it makes sense to group these individual Web pages in this way" based on the data it gets from another bit of software is not captured by ORE.  Life is not as simple as saying "here is an aggregation" because the aggregation might not be a set of printable pages from a document, or a set of Web pages, or a coherent set of anything else for that matter and there is very little in ORE that tells you anything about the relationship(s) between the things in the aggregation or their relationship to the outside world.  And if ORE doesn't meet its own functional requirements particularly well, it is even further from the kind of functional requirements we envisaged in the work on SWAP.  Requirements like, "show me the latest freely available version of this research paper".

Now, I accept that ORE does provide a way of layering that additional information (which might be in the form of SWAP for example) over the top of the aggregation.  On that basis the pertinent questions, or so it seems to me, are "given that we probably need that extra level of information to do anything useful with the aggregation, is the information about the aggregation useful on its own?" and "does SWAP capture the right level of detail and is it realistic to expect real-world systems to handle this level of complexity?".

I think the jury is out on both.  (Note: I am certainly not arguing that SWAP is better than ORE - they are sufficiently different for that to be a pointless statement anyway and the bottom line is that I'm not completely sure that I'm convinced by either if I'm absolutely honest.)  I would say that in the world of learning objects there is quite a long history of treating things as reasonably unrefined aggregations (usually refered to as 'content packages') and that in that space the usefulness of that approach has been fairly minimal.

February 17, 2009

What is ORE, really? (A personal viewpoint)

This is another post that I've had sitting around in draft for ages, but which some recent discussion has prompted me to dig out and try to finish. Chris Keene commented on my post of some time ago about the publication of OAI ORE specs, asking for some clarification on what it is that OAI ORE provides, "what ORE is", I suppose, and I promised I'd take a stab at answering. I guess I should emphasise that this is my personal view only, but here's my attempt at a response to Chris' questions.

is it a protocol like OAI-PMH or a file standard? I read a primer (somewhat quickly) and it seems to be almost a XML file specification to be read over HTTP, which describes a resource such as a repository? is that right?

I think it's helpful - and maybe why I think it's important will become clearer by the end of this post - to distinguish between the parts of the ORE specifications which are specific to ORE and the parts which provide guidance on how to apply principles and conventions which have been defined outside of the ORE context, are not dependent on the use of the ORE-specific parts of ORE, and are more general in their application. (The distinction I'm making here doesn't quite match the separation ORE itself makes between "Specifications" and "User Guides".)

Some parts of the ORE specifications are "ORE-specific", they define or describe things that aren't defined or described elsewhere. Those things are:

  1. A simple data model for the things ORE calls a Resource Map, an Aggregation and a Proxy. This is defined by the Abstract Data Model document. Here the term "data model" is used in the sense of a "model of (some part of) the world", a "domain model", if you like - though in the ORE case, it is intended to be quite a generally applicable one.
  2. An RDF vocabulary used, in association with terms from some existing RDF vocabularies, for representing instances of that model. This is defined in human-readable form by the Vocabulary document, and in machine-processable form by the RDF/XML "namespace document" http://www.openarchives.org/ore/terms/.
  3. A variant of what I might call - following the terminology used by Alistair Miles - a "Graph Profile", a specification of some structural constraints on an RDF graph which should be met if that graph is to serve as an ORE Resource Map, a set of "triple patterns", if you like, for the triples that make up an ORE Resource Map. This is defined in Section 6 of the Abstract Data Model document.
  4. A set of conventions for representing an ORE Resource Map as an Atom Entry Document, using the Atom Syndication Format. This is defined by the ORE document Resource Map Implementation in Atom
  5. A set of conventions for disclosing and discovering ORE Resource Maps, defined by the document Resource Map Discovery. Some of these are applications of existing conventions, but as there are some ORE-specific aspects (e.g. the definition of http://www.openarchives.org/ore/html/ as an HTML profile specifying the use of "resourcemap" as an X/HTML link type), I'm including it in this list.

Those are the things I tend to focus on when I try to answer the question "What is ORE, really?"

In addition to those ORE-specific elements, the ORE specifications also provide guidelines for how to make use of various other existing specifications and conventions when deploying the ORE model:

  1. The two documents, Resource Map Implementation in RDF/XML and Resource Map Implementation in RDFa describe how to use those two existing syntaxes, defined by W3C Recommendations, to represent Resource Maps
  2. The document HTTP Implementation describes how to apply the principles and patterns define by the W3C TAG's httpRange-14 resolution and the Cool URIs for the Semantic Web document

For the most part, these documents don't really provide new information, at least in the same way those noted above do: instead, they indicate how to apply some existing, more general specifications when making use of the ORE-specific specifications listed above.

That's not to say they aren't useful guidelines: they are, not least because they "contextualise" the general information provided by the more general specifications, and provide ORE-specific examples of their use. The ORE HTTP Implementation document selects from the patterns of the Cool URIs document and provides illustrations of their use for the URIs of Aggregations and Resource Maps.

My main point here is that I think it's important - particularly for audiences who are perhaps encountering some of these more general principles and conventions for the first time in the specific context of ORE - to "decouple" these two aspects, and to make clear that the use of these principles and conventions is not dependent on the ORE-specific parts, and they can - and indeed should - be applied in other contexts too. More on that later.

To answer, Chris' specific questions above: no, ORE isn't a protocol; no, it isn't (what I think of as) a "file standard", though it describes the use of some existing formats; and while ORE does deal with the description of things, the things it deals with are what it calls "aggregations", not "repositories", at least as that term is typically used in the OAI context, to refer to a system that supports some functions. The concept of a repository doesn't feature in ORE.

And I'm not sure how it fits in with OAI-PMH does it replace, or improve, or cater for different needs (they both seem to cater for getting an item from one system to another).

I think ORE is largely orthogonal to OAI-PMH. ORE was not designed to "replace" or "improve" OAI-PMH. ORE can be used independently of OAI-PMH, or, as I think the Discovery document illustrates, it can be used in the context of OAI-PMH, i.e. you could expose ORE Resource Maps as metadata records over OAI-PMH.

Having said that, I do think the approaches underpinning ORE provide at least some hints of how the sort of functionality which is currently provided by OAI-PMH in an RPC-like idiom, where a client "harvester" sends protocol-specific requests to a "repository", might be offered using a more "resource-oriented" approach. Here, I'm not using the term "resource-oriented" to highlight a distinction between "resource" and "metadata", but rather to emphasise the notion of treating all the "items of interest" to the application as "resources" in the sense that the Web Architecture uses that term, assigning them URIs, and supporting interaction using the uniform interface defined by the HTTP protocol. And those "items of interest" can include resources which are descriptions of other resources, and resources which are collections of resources - collections based on various criteria. Anyway, it isn't my intention here to embark on specifying an alternative approach to OAI-PMH. :-)

Chris also asked:

And what about things like SWAP and SWORD?

Let's take the case of SWORD first, as it's the one I know less about! :-) I'm not a SWORD/Atompub expert at all but I think ORE is independent of SWORD, but designed to be usable in the context of SWORD, i.e. in principle at least, an ORE Resource Map could form the subject of a SWORD "deposit". Richard Jones ponders three variant approaches, and there is some discussion on the OAI ORE Google Group.

The case of the Scholarly Works Application Profile (SWAP) raises some issues which I think illustrate some of the points I was making above about the wider applicability of some of the conventions used within ORE.

First, I think there are differences in "scope and purpose". SWAP focuses very specifically on the "eprint" and on supporting a more or less well-defined set of operations, particularly operations related to "versioning" and the various types of relationships between resources which one encounters when dealing with those issues; ORE focuses on a rather simpler, more generic concept of "aggregation" and membership of a set. Having said that, the ORE model can also be applied to the case of the eprint, and indeed some of the examples in the specifications and in supporting presentations use examples of applying ORE to eprint resources.

Second, again as noted above, ORE makes use of some general principles and patterns for exposing resources and resource descriptions on the Web. But those principles and patterns are equally applicable in the context of data models other than ORE; what ORE calls a "Resource Map" is a specialised case of an RDF graph, and the HTTP patterns for providing access to a Resource Map are applications of patterns which can be - and are - applied to provide access to data describing resources of any type - including resources of the type described by SWAP. It isn't necessary to make use of the ORE concept of the Aggregation to use those patterns.

Now then, it is true that the SWAP documentation does not make reference to these patterns, but that is probably because of two considerations. First, at the time of its development, the primary context of use considered was that of exposing data over OAI-PMH. Second, although the httpRange-14 resolution had been agreed, it hadn't been as widely disseminated/popularised  as it has been subsequently, particularly in the form of the Linked Data tutorial and the Cool URIs document. But as I discussed in a recent post, those same principles and patterns used in ORE can be applied to the FRBR case - and if SWAP was being developed now, I'm sure reference to those approaches would be included. (Well, they would if I had any input to the process!)

Third, picking up on my attempt above to identify what I think are the "core" characteristics of ORE, ORE and SWAP are based on two different "models of the world", both of which can be applied to the case of the eprint. From the perspective of the ORE model, the eprint is viewed as an aggregation made up of a number of component/member resources; with SWAP, the perspective is that of the FRBR model - a Work realised in one or more Expressions, each embodied in one or more Manifestations, each exemplified by one or more Items (possibly with relationships between this Work and other Works, between Expressions of the same or different Works, between Works, Expressions etc and Agents, and so on).

In the FRBR case, although, as in the ORE case, there are multiple related resources involved, there isn't necessarily a notion of "aggregation" involved: a FRBR Work (or indeed any of the FRBR Group 1 entities) may be a composite/aggregate resource, but it isn't necessarily the case. There is nothing in FRBR that treats, say, the set of all the Items which exemplify the Manifestations of the Expressions of a single Work as a single aggregate entity - but FRBR does allow for the expression of whole/part relationships between instances of the various Group 1 entities.

So, I think it is important to remember that the choice to use either ORE or SWAP to model an eprint is just that: a modeling choice, one which enables certain functionality on the basis of the data created. Depending on what we want to achieve with the data, different choices may be appropriate.

So to return to Chris' question, it seems to me the core difference between ORE and SWAP is that they offer different models which can be applied to the "eprint". And here, I think I'm revisiting the point that, quite some time ago now, Andy made in terms of contrasting what he called "compound objects" and "complex objects". I must admit I didn't and don't like the term "complex object" - if I describe a set and its members, I understand that the set is the "compound object", but if I describe a document and its three authors, or a FRBR Work, its Expressions, their Manifestations, their Items, and a number of related Agents, which one of them is the "complex object"? - but the point remains a good one: many of the functions we wish to perform rely on our capacity to represent relationships other than relationships of "aggregation" or "composition".

Of course, the ORE concept of the Resource Map does allow for the expression of any other types of relationship, in addition to the required ore:aggregates relationship (and I think using ORE and FRBR together would requires some careful analysis, given the nature of whole/part relationships in FRBR); but one can also construct descriptions expressing other types of relationship, and make those descriptions available using the community-agreed conventions of the Cool URIs document, without using ORE.

So, that turned into another rather rambling post, and I'm not sure how much it helps, but that's my take on "what ORE is".

February 11, 2009

Repository usability - take 2

OK... following my 'rant' yesterday about repository user-interface design generally (and, I suppose, the Edinburgh Research Archive in particular), Chris Rusbridge suggested I take a similar look at an ePrints.org-based repository and pointed to a research paper by Les Carr in the University of Southampton School of Electronics and Computer Science repository by way of example.  I'm happy to do so though I'm going to try and limit myself to a 10 minute survey of the kind I did yesterday.

The paper in question was originally published in The Computer Journal (Oxford University Press) and is available from http://comjnl.oxfordjournals.org/cgi/content/abstract/50/6/703 though I don't have the necessary access rights to see the PDF that OUP make available.  (In passing, it's good to see that OUP have little or no clue about Cool URIs, resorting instead to the totally useless (in Web terms at least) DOI as text string, "doi:10.1093/comjnl/bxm067" as their means of identification :-( ).

Ecs The jump-off page for the article in the repository is at http://eprints.ecs.soton.ac.uk/14352/, a URL that, while it isn't too bad, could probably be better.  How about replacing 'eprints.ecs' by 'research' for example to mitigate against changes in repository content (things other than eprints) and organisational structure (the day Computer Science becomes a separate school).

The jump-off page itself is significantly better in usability terms than the one I looked at yesterday.  The page <title> is set correctly for a start.  Hurrah!  Further, the link to the PDF of the paper is near the top of the page and a mouse-over pop-up shows clearly what you are going to get when you follow the link.  I've heard people bemoaning the use of pop-ups like this in usability terms in the past but I have to say, in this case, I think it works quite well.  On the downside, the link text is just 'PDF' which is less informative than it should be.

Following the abstract a short list of information about the paper is presented.  Author names are linked (good) though for some reason keywords are not (bad).  I have no idea what a 'Performance indicator' is in this context, even less so the value "EZ~05~05~11".  Similarly I don't see what use the ID Code is and I don't know if Last Modified refers to the paper or the information about the paper.  On that basis, I would suggest some mouse-over help text to explain these terms to end-users like myself.

The 'Look up in Google Scholar' link fails to deliver any useful results, though I'm not sure if that is a fault on the part of Google Scholar or the repository?  In any case, a bit of Ajax that indicated how many results that link was going to return would be nice (note: I have no idea off the top of my head if it is possible to do that or not).

Each of the references towards the bottom of the page has a 'SEEK' button next to them (why uppercase?).  As with my comments yesterday, this is a button that acts like a link (from my perspective as the end-user) so it is not clear to me why it has been implemented in the way it has (though I'm guessing that it is to do with limitations in the way Paracite (the target of the link) has been implemented.  My gut feeling is that there is something unRESTful in the way this is working, though I could be wrong.  In any case, it seems to be using an HTTP POST request where a HTTP GET would be more appropriate?

There is no shortage of embedded metadata in the page, at least in terms of volume, though it is interesting that <meta name="DC.subject" ... > is provided whereas the far more useful <meta name="keywords" ... > is not.

The page also contains a large number of <link rel="alternate" ... > tags in the page header - matching the wide range of metadata formats available for manual export from the page (are end-users really interested in all this stuff?) - so many in fact, that I question how useful these could possibly be in any real-world machine-to-machine scenario.

Overall then, I think this is a pretty good HTML page in usability terms.  I don't know how far this is an "out of the box" ePrints.org installation or how much it has been customised but I suggest that it is something that other repository managers could usefully take a look at.

Usability and SEO don't centre around individual pages of course, so the kind of analysis that I've done here needs to be much broader in its reach, considering how the repository functions as a whole site and, ultimately, how the network of institutional repositories and related services (since that seems to be the architectural approach we have settled on) function in usability terms.

Once again, my fundamental point here is not about individual repositories.  My point is that I don't see the issues around "eprint repositories as a part of the Web" featuring high up the agenda of our discussions as a community (and I suggest the same is true of  learning object repositories), in part because we have allowed ourselves to get sidetracked by discussion of community-specific 'interoperability' solutions that we then tend to treat as some kind of magic bullet, rolling them out whenever someone questions one approach or another.

Even where usability and SEO are on the agenda (as appears to be the case here) It's not enough that individual repositories think about the issues, even if some or most make good decisions, because most end-users (i.e. researchers) need to work across multiple repositories (typically globally) and therefore we need the usability of the system as a whole to function correctly.  We therefore need to think about these issues as a community.

February 10, 2009

Repository usability

In his response to my previous post, Freedom, Google-juice and institutional mandates, Chris Rusbridge responded using one of his Ariadne articles as an illustrative example.

By way of, err... reward, I want to take a quick look (in what I'm going to broadly call 'usability' terms) at the way in which that article is handled by the Edinburgh Research Archive (ERA).  Note that I'm treating the ERA as an example here - I don't consider it to be significantly different to other institutional repositories and, on that basis, I assume that most of what I am going to say will also apply to other repository implementations.

Much of this is basic Web 101 stuff...

The original Ariadne article is at http://www.ariadne.ac.uk/issue46/rusbridge/ - an HTML document containing embedded links to related material in the References section (internally linked from the relevant passage in the text).  The version deposited into ERA is a 9 page PDF snapshot of the original article.  I assume that PDF has been used for preservation reasons, though I'm not sure.  Hypertext links in the original HTML continue to work in the PDF version.

So far, so good.  I would naturally tend to assume that the HTML version is more machine-readable than the PDF version and on that basis is 'better', though I admit that I can't provide solid evidence to back up that statement.

Era The repository 'jump-off' page for the article is at http://www.era.lib.ed.ac.uk/handle/1842/1476 though the page itself tells us (in a human-readable way) that we should use http://hdl.handle.net/1842/1476 for citation purposes.

So we already have 4 URLs for this article and no explicit machine-readable knowledge that they all identify the same resource.  Further, the URLs that 15 years of using a Web browser lead me to use most naturally (those of the jump-off page, the original Ariadne article or the PDF file) are not the one that the page asks me to use for citation purposes.  So, in Web usability terms, I would most naturally bookmark (e.g. using del.icio.us) the wrong URL for this article and where different scholars choose to bookmark different URLs, services like del.icio.us are unlikely to be able to tell that they are referring to the same thing (recent experience of Google Scholar notwithstanding).

OK, so now let's look more closely at the jump-off page...

Firstly, what is the page title (as contained in the HTML <title> tag)?  Something useful like "Excuse Me... Some Digital Preservation Fallacies?".  No, it's "Edinburgh Research Archive : Item 1842/1476". Nice!? Again, if I bookmark this page in del.icio.us, that is the label is going to appear next to the URL, unless I manually edit it.

Secondly, what other metadata and/or micro-formats are embedded into this page?  All that nice rich Dublin Core metadata that is buried away inside the repository?  Nah.  Nothing.  A big fat zilch.  Not even any <meta name="keywords" ...> stuff.  I mean, come on.  The information is there on the page right in front of me... it's just not been marked up using even the most basic of HTML tags.  Most university Web site managers would get shot for turning out this kind of rubbish HTML.

Note I'm not asking for embedded Dublin Core metadata here - I'm asking for useful information to be embedded in useful (and machine-readable) ways where there are widely accepted conventions for how to to that.

So, let's look at those human-readable keywords again.  Why aren't they hyperlinked to all all other entries in ERA that use those keywords (in the way that Flickr and most other systems do with tags)?  Yes, the institutional repository architectural approach means that we'd only get to see other stuff in ERA, not all that useful I'll grant you, but it would be better than nothing.

Similarly, what about linking the author's name to all other entries by that author.  Ditto with the publisher's name.  Let's encourage a bit of browsing here shall we?  This is supposed to be about resource discovery after all!

So finally, let's look at the links on the page.  There at the bottom is a link labelled 'View/Open' which takes me to the PDF file - phew, the thing I'm actually looking for!  Not the most obvious spot on the page but I got there in the end.  Unfortunately, I assume that every single other item in ERA uses that exact same link text for the PDF (or other format) files.  Link text is supposed to indicate what is being linked to - it's a kind of metadata for goodness sake.

And then, right at the bottom of the page, there's a button marked "Show full item record".  I have no idea what that is but I'll click on it anyway.  Oh, it's what other services call "Additional information".  But why use an HTML form button to hide a plain old hypertext link?  Strange or what?

OK, I apologise... I've lapsed into sarcasm for effect.  But the fact remains that repository jump-off pages are potentially some of the most important Web pages exposed by universities (this is core research business after all) yet they are nearly always some of the worst examples of HTML to be found on the academic Web.  I can draw no other conclusion than that the Web is seen as tangential in this space.

I've taken 10 minutes to look at these pages... I don't doubt that there are issues that I've missed.  Clearly, if one took time to look around at different repositories one would find examples that were both better and worse (I'm honestly not picking on ERA here... it just happened to come to hand).  But in general, this stuff is atrocious - we can and should do better.

Freedom, Google-juice and institutional mandates

[Note: This entry was originally posted on the 9th Feb 2009 but has been updated in light of comments.]

An interesting thread has emerged on the American Scientist Open Access Forum based on the assertion that in Germany "freedom of research forbids mandating on university level" (i.e. that a mandate to deposit all research papers in an institutional repository (IR) would not be possible legally).  Now, I'm not familiar with the background to this assertion and I don't understand the legal basis on which it is made.  But it did cause me to think about why there might be an issue related to academic freedom caused by IR deposit mandates by funders or other bodies.

In responding to the assertion, Bernard Rentier says:

No researcher would complain (and consider it an infringement upon his/ her academic freedom to publish) if we mandated them to deposit reprints at the local library. It would be just another duty like they have many others. It would not be terribly useful, needless to say, but it would not cause an uproar. Qualitatively, nothing changes. Quantitatively, readership explodes.

Quite right. Except that the Web isn't like a library so the analogy isn't a good one.

If we ignore the rarefied, and largely useless, world of resource discovery based on the OAI-PMH and instead consider the real world of full-text indexing, link analysis and, well... yes, Google then there is a direct and negative impact of mandating a particular place of deposit. For every additional place that a research paper surfaces on the Web there is a likely reduction in the Google-juice associated with each instance caused by an overall diffusion of inbound links.

So, for example, every researcher who would naturally choose to surface their paper on the Web in a location other than their IR (because they have a vibrant central (discipline-based) repository (CR) for example) but who is forced by mandate to deposit a second copy in their local IR will probably see a negative impact on the Google-juice associated with their chosen location.

Now, I wouldn't argue that this is an issue of academic freedom per se, and I agree with Bernard Rentier (earlier in his response) that the freedom to "decide where to publish is perfectly safe" (in the traditional academic sense of the word 'publish'). However, in any modern understanding of 'to publish' (i.e. one that includes 'making available on the Web') then there is a compromise going on here.

The problem is that we continue to think about repositories as if they were 'part of a library', rather than as a 'true part of the fabric of the Web', a mindset that encourages us to try (and fail) to redefine the way the Web works (through the introduction of things like the OAI-PMH for example) and that leads us to write mandates that use words like 'deposit in a repository' (often without even defining what is meant by 'repository') rather than 'make openly available on the Web'.

In doing so I think we do ourselves, and the long term future of open access, a disservice.

Addendum (10 Feb 2009): In light of the comments so far (see below) I confess that I stand partially corrected.  It is clear that Google is able to join together multiple copies of research papers.  I'd love to know the heuristics they use to do this and I'd love to know how successful those heuristics are in the general case.  Nonetheless, on the basis that they are doing it, and on the assumption that in doing so they also combine the Google juice associated with each copy, I accept that my "dispersion of Google-juice" argument above is somewhat weakened.

There are other considerations however, not least the fact that the Web Architecture explicitly argues against URI aliases:

Good practice: Avoiding URI aliases
A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

The reasons given align very closely to the ones I gave above, though couched in more generic language:

Although there are benefits (such as naming flexibility) to URI aliases, there are also costs. URI aliases are harmful when they divide the Web of related resources. A corollary of Metcalfe's Principle (the "network effect") is that the value of a given resource can be measured by the number and value of other resources in its network neighborhood, that is, the resources that link to it.

The problem with aliases is that if half of the neighborhood points to one URI for a given resource, and the other half points to a second, different URI for that same resource, the neighborhood is divided. Not only is the aliased resource undervalued because of this split, the entire neighborhood of resources loses value because of the missing second-order relationships that should have existed among the referring resources by virtue of their references to the aliased resource.

Now, I think that some of the discussions around linked data are pushing at the boundaries of this guidance, particularly in the area of non-information resources.  Nonetheless, I think this is an area in which we have to tread carefully.  I stand by my original statement that we do not treat scholarly papers as though they are part of the fabric of the Web - we do not link between them in the way we link between other Web pages.  In almost all respects we treat them as bits of paper that happen to have been digitised and the culprits are PDF, the OAI-PMH, an over-emphasis on preservation and a collective lack of imagination about the potential transformative effect of the Web on scholarly communication.  We are tampering at the edges and the result is a mess.

February 06, 2009

Open orienteering

It seems to me that there is now quite a general acceptance of what the 'open access' movement is trying to achieve. I know that not everyone buys into that particular world-view but, for those of us that do, we know where we are headed and most of us will probably recognise it when we get there. Here, for example, is Yishay Mor writing to the open-science mailing list:

I would argue that there's a general principle to consider here. I hold that any data collected by public money should be made freely available to the public, for any use that contributes to the public good. Strikes me as a no-brainer, but of course - we have a long way to go.

A fairly straight-forward articulation of the open access position and a goal that I would thoroughly endorse.

The problem is that we don't always agree as a community about how best to get there.

I've been watching two debates flow past today, both showing some evidence of lack of consensus in the map reading department, though one much more long-standing than the other. Firstly, the old chestnut about the relative merits of central repositories vs. institutional repositories (initiated in part by Bernard Rentier's blog post, Institutional, thematic or centralised repositories?) but continued on various repository-related mailing lists (you know the ones!). Secondly, a newer debate about whether formal licences or community norms provide the best way to encourage the open sharing of research data by scientists and others, a debate which I tried to sum up in the following tweet:

@yishaym summary of open data debate... OD is good & needs to be encouraged - how best to do that? 1 licences (as per CC) or 2 social norms

It's great what can be done with 140 characters.

I'm more involved in the first than the second and therefore tend to feel more aggrieved at lack of what I consider to be sensible progress. In particular, I find the recurring refrain that we can join stuff back together using the OAI-PMH and therefore everything is going to be OK both tiresome and laughable.

If there's a problem here, and perhaps there isn't, then it is that the arguments and debates are taking place between people who ultimately want the same thing. I'm reminded of Monty Python's Life of Brian:

Brian: Excuse me. Are you the Judean People's Front?
Reg: Fuck off! We're the People's Front of Judea

It's like we all share the same religion but we disagree about which way to face while we are praying. Now, clearly, some level of debate is good. The point at which it becomes not good is when it blocks progress which is why, generally speaking, having made my repository-related architectural concerns known a while back, I try and resist the temptation to reiterate them too often.

Cameron Neylon has a nice summary of the licensing vs. norms debate on his blog. It's longer and more thoughtful than my tweet! This is a newer debate and I therefore feel more positive that it is able to go somewhere. My initial reaction was that a licensing approach is the most sensible way forward but having read through the discussion I'm no longer so sure.

So what's my point? I'm not sure really... but if I wake up in 4 years time and the debate about licensing vs. norms is still raging, as has pretty much happened with the discussion around CRs vs. IRs, I'll be very disappointed.

February 05, 2009

httpRange-14, Cool URIs & FRBR

The W3C Technical Architecture Group's resolution to what had become known as "the httpRange-14 question" introduced a distinction between the subset of resources for which representations may be served using the HTTP protocol - a subset which the Architecture of the World Wide Web refers to as "information" resources - and - by implication at least - a disjoint subset of resources which may be identified using the http URI scheme but which is not "representable" -  for which no representations are provided using the HTTP protocol - though they may be described, by "information resources" identified by their own distinct URIs.

A subsequent note by Leo Sauermann and Richard Cyganiak of the W3C Semantic Web Education and Outreach (SWEO) Interest Group, Cool URIs for the Semantic Web provides an extended discussion of the issue, together with a set of "patterns" for assigning http URIs and for the appropriate responses to HTTP requests using such URIs. This document uses the terms "Web documents" and "real-world objects" to refer to the two classes of resources, noting that the latter class includes "real-world objects like people and cars, and even abstract ideas and non-existing things like a mythical unicorn".

The question raised by this division is where the boundary between the two classes lies. From the viewpoint of the consumer/user of URIs, the point is somewhat moot: they simply need to follow the information provided, in the form of responses to HTTP requests by the owner of the URI (or possibly also from metadata provided by other parties). Information about the nature of the resource can be provided both by HTTP response codes and by explicit descriptions of the resource. Following the httpRange-14 guideline, if the HTTP response to a GET is 2xx, then the resource identified by the URI is an information resource. I think it's worth emphasising the point that this is the only response code which allows the user to make a "positive" inference about resource type; if the response is 303 See Other, that in itself says nothing about the type of the resource.

The URI owner, on the other hand, needs to make the choice, for each resource, whether to provide a representation or not, based on their understanding of the nature of the resources they are exposing on the Web. The Architecture of the World Wide Web document offers the following (somewhat "slippery", to me!) criterion for a resource being an "information resource": The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message.

I've been trying to think through how this set of conventions should be applied to the case of the Functional Requirements for Bibliographic Records (FRBR) and more specifically to the "FRBR Group 1 Entities", i.e. instances of the the classes of Work, Expression, Manifestation and Item which FRBR uses to model the universe of resources described by bibliographic records.

The work on the development of the Scholarly Works Application Profile (SWAP) focused primarily on deployment in the context of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). OAI-PMH provides an RPC-like layer on top of HTTP, and SWAP focuses on how to deliver descriptions of the SWAP/FRBR entities using that RPC layer, rather than the question of how those entities could be represented as Web resources.

I'm starting from the FRBR model here; I'm asking the question, "If I'm exposing on the Web a set of resources based on the FRBR model, are there any general rules for which of these resources are 'representable'?". I'm not trying to address the broader question of whether/how the distinctions made in the Web Architecture reflect, or can be defined in terms of, the FRBR model.

Taking the "easy" cases first, FRBR defines a Work as follows:

The first entity defined in the model is work: a distinct intellectual or artistic creation.

A work is an abstract entity; there is no single material object one can point to as the work. We recognize the work through individual realizations or expressions of the work, but the work itself exists only in the commonality of content between and among the various expressions of the work.

- FRBR Section 3.2.1

It seems fairly clear from this description that a FRBR Work is a "conceptual resource", like an idea. In the terms of the "Cool URIs" document, it is a "real-world object", albeit an abstract one, and not a "Web document". And on this basis, while a FRBR Work may be identified by an http URI, an HTTP server should not return a representation and a 200 status code in response to a GET for that URI, though the server may provide access (using one of the patterns provided in the Cool URIs document) to a description of the Work, a "Web document" itself identified by a distinct URI.

A similar argument can, I think, be made for the case of the FRBR Expression:

An expression is the specific intellectual or artistic form that a work takes each time it is "realized." Expression encompasses, for example, the specific words, sentences, paragraphs, etc. that result from the realization of a work in the form of a text, or the particular sounds, phrasing, etc. resulting from the realization of a musical work. The boundaries of the entity expression are defined, however, so as to exclude aspects of physical form, such as typeface and page layout, that are not integral to the intellectual or artistic realization of the work as such.

- FRBR Section 3.2.2

Again we're dealing with an "abstraction", albeit a more "specific", less "generic" one than a Work. And on this basis, like the Work, it falls into the category of "real-world objects", and again, while an Expression may be identified by an http URI and an HTTP server may provide access to a description of an Expression, it should not provide a representation of an Expression.

In considering the other two FRBR Group 1 entity types, Manifestation and Item, it is perhaps easiest to consider the application of FRBR to physical resources and to digital resources separately.

Considering the physical world first, it is perhaps helpful to consider the Item first, as it seems to me it also sheds some light on the nature of the Manifestation. The FRBR definition of Item is very much grounded in the physical:

The entity defined as item is a concrete entity. It is in many instances a single physical object (e.g., a copy of a one-volume monograph, a single audio cassette, etc.). There are instances, however, where the entity defined as item comprises more than one physical object (e.g., a monograph issued as two separately bound volumes, a recording issued on three separate compact discs, etc.).

- FRBR Section 3.2.4

These Items are the "real world objects" which traditionally libraries have been concerned with managing (acquiring, storing, maintaining, providing access to, distributing, disposing of). From the perspective of httpRange-14 and the "Cool URIs" document, then, these "real-world objects" may be described by Web documents, but they are not themselves Web documents. So a physical Item may be identified by an http URI, and an HTTP server may provide access to a description of such an Item, but it can't provide a representation of it.

Now take the case of the Manifestation:

The third entity defined in the model is manifestation: the physical embodiment of an expression of a work.

The entity defined as manifestation encompasses a wide range of materials, including manuscripts, books, periodicals, maps, posters, sound recordings, films, video recordings, CD-ROMs, multimedia kits, etc. As an entity, manifestation represents all the physical objects that bear the same characteristics, in respect to both intellectual content and physical form.

- FRBR Section 3.2.3

Again a Manifestation is dealing with physical form, but furthermore, a Manifestation is still an abstraction: its role in the FRBR model is to capture characteristics that are true of a set of individual Items which "exemplify" that Manifestation (even in the case where a unique Item which is the sole exemplar of a Manifestation). Seen in this light, then, I think a Manifestation also falls into the category of things which may be described by one or more Web documents, but is not itself a Web document.

In turning to the context of the digital world, I think it's worth highlighting that although the FRBR specification contains some references to "electronic resources", the coverage of digital resources in the text very limited, and indeed the introduction acknowledges that "the dynamic nature of entities recorded in digital formats" is one of the areas that require further analysis.

It seems relatively straightforward to transfer the concepts of Work and Expression into the digital sphere, as they are independent of the form in which content is "embodied".

The question of what constitutes a FRBR Item in the digital domain is rather more difficult to pin down, particularly since the FRBR document itself focuses exclusively on the physical in its discussion of the Item. Ingbert Floyd and Allen Renear take on this challenge in their poster, "What Exactly is an Item in the Digital World?" (ASIST, 2007)

In the physical world, they argue, the thing which carries information is the same thing for which information managers typically describe characteristics such as provenance, condition, and access restrictions - the attributes of the Item in FRBR. In the digital world, this is no longer true: information is carried by the physical state of some component of a computer system, something the authors call an instance of "patterned matter and energy" (PME) - but information managers rarely concern themselves with managing such entities and recording their attributes. Entities such as a "file", however, are the focus of management and description - but a digital "file" isn't really the "concrete entity" that FRBR calls an Item. Two approaches to the Item are possible, then: the Item-as-PME approach, which "maintains that a fundamental aspect of being an item is being a concrete physical thing", or the Item-as-"file" approach which addopts the pragmatic position that "items are the things, whatever their nature (physical, abstract, or metaphorical), which play the role in bibliographic control that FRBR assigns to items".

The question I'm posing here is, I think, a different, and narrower, one from the broader one grappled with by Ingbert and Renear: if we are treating a FRBR Item as a Web resource, for the case of an exemplar of a Manifestation in digital format, is that resource an "information resource", for which a representation can be served? From the Web Architecture perspective, it seems to me that it is the case that "all of their essential characteristics can be conveyed in a message". The Scholarly Works Application Profile takes this approach: the copy of a PDF document available from an institutional repository server, or the copy of an mp3 file constituting an episode of a podcast, is the FRBR Item. These, after all, are the things which, "play the role in bibliographic control that FRBR assigns to items".

A further issue here is that FRBR lists "Access Address (URL)" as an attribute of a Manifestation, rather than of an Item, and I'm not sure whether this is compatible with the SWAP approach.

The concept of Manifestation in the digital case seems the most difficult to categorise. On the one hand, as noted above, FRBR states that a Manifestation is an abstraction corresponding to a set of objects with the same characteristics of both form and content. On the other hand, it seems to me that one could argue that for Manifestations in digital form, it is true that "all of their essential characteristics can be conveyed in a message", since the notion of Manifestation encapsulates that of specific intellectual content "embodied" in a specific form. For consistency with the physical case, I guess the former would be best, but I'm not sure on this point.

So those rather lengthy musings might suggest the following (tentative, I hasten to add... I'm mostly just trying to think through my rationale here) approach to identifying and serving representations/descriptions of the FRBR entities, at least using the approach that SWAP takes to the Item:

Entity Type HTTP Behaviour
Work

Identify using http URI

Provide description of Work. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Expression

Identify using http URI

Provide description of Expression. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Manifestation Physical

Identify using http URI

Provide description of Manifestation. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Digital

Identify using http URI

Provide description of Manifestation. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Item Physical

Identify using http URI

Provide description of Item. Either use a "hash URI", or respond to GET with 303 See Other and http URI of description (generic or using content negotiation.

Digital

Identify using http URI

Provide representation of Item. (Respond to GET with 200 and representation).

One final point.... The use of HTTP content negotiation on the Web introduces a dimension which I'm not sure sits very easily within the FRBR model. Using content negotiation, I may decide to expose a single resource on the Web, using a single URI, but configure my server so that, at any point in time, depending on a range of factors (the preferences of the client, the IP address of the client, etc.) it returns different representations of that resource - representations which may vary by (amongst other things) media type (format) or language. From the FRBR perspective, such variations would, I think, result in the creation of different Manifestations (for the media type case) or even different Expressions (for the language case). In the SWAP case, I think the implication is that Item representations should not vary, at least by media type or language.

December 18, 2008

JISC IE and e-Research Call briefing day

I attended the briefing day for the JISC's Information Environment and e-Research Call in London on Monday and my live-blogged notes are available on eFoundations LiveWire for anyone that is interested in my take on what was said.

Quite an interesting day overall but I was slightly surprised at the lack of name badges and a printed delegate list, especially given that this event brought together people from two previously separate areas of activity. Oh well, a delegate list is promised at some point.  I also sensed a certain lack of buzz around the event - I mean there's almost £11m being made available here, yet nobody seemed that excited about it, at least in comparison with the OER meeting held as part of the CETIS conference a few weeks back.  At that meeting there seemed to be a real sense that the money being made available was going to result in a real change of mindset within the community.  I accept that this is essentially second-phase money, building on top of what has gone before, but surely it should be generating a significant sense of momentum or something... shouldn't it?

A couple of people asked me why I was attending given that Eduserv isn't entitled to bid directly for this money and now that we're more commonly associated with giving grant money away rather than bidding for it ourselves.

The short answer is that this call is in an area that is of growing interest to Eduserv, not least because of the development effort we are putting into our new data centre capability.  It's also about us becoming better engaged with the community in this area.  So... what could we offer as part of a project team? Three things really: 

  • Firstly, we'd be very interested in talking to people about sustainable hosting models for services and content in the context of this call.
  • Secondly, software development effort, particularly around integration with Web 2.0 services.
  • Thirdly, significant expertise in both Semantic Web technologies (e.g. RDF, Dublin Core and ORE) and identity standards (e.g. Shibboleth and OpenID).

If you are interested in talking any of this thru further, please get in touch.

November 28, 2008

SWORD Facebook application & "social deposit"

Last week, Stuart Lewis of Aberystwyth University announced the availability of his Facebook repository deposit application, which makes use of the SWORD AtomPub profile. Stuart's post appeared just a day before a post by Les Carr in which he includes a presentation on "leveraging" the value of items once they are in a repository, by providing "feeds" of various flavours and/or supporting the embedding of deposited items in other externally-created items.

Stuart describes the SWORD Facebook application as enabling what he calls "social deposit":

Being able to deposit from within a site such as Facebook would enable what I’m going to call the Social Deposit. What does a social deposit look like? Well, it has the following characteristics:

  • It takes place within a social networking type site such as Facebook.
  • The deposit is performed by the author of a work, not a third party.
  • Once the deposit has taken place, messages and updates are provided stating that the user has performed the deposit.
  • Friends and colleagues of the depositor will see that a deposit has taken place, and can read what has been deposited if they want to.
  • Friends and colleagues of the depositor can comment on the deposit.

So the social deposit takes place within the online social surroundings of a depositor, rather than from within a repository. By doing so, the depositor can leverage the power of their social networks so that their friends and colleagues can be informed about the deposit.

It occurred to me it would be interesting to compare the approach Stuart has taken in the SWORD Facebook app with the approach taken in "deposit" tools typically used with - highly "social" - "repositories" like Flickr (e.g. the Flickr Uploadr client) or the approach sometimes used with weblogs (e.g. blogging clients like Windows Live Writer).

The actions of posting images to my Flickr collection or posting entries to my weblog are both "deposit" actions to my "repositories". As a result of that "deposit", the availability of my newly deposited resources - my images, my weblog posts - is "notified" - either through some mechanism internal to the target system, or (as Les's presentation illustrates) through approaches based on feeds "out of" the repository - to members of my various "social network(s)":

  • my "internal-to-Flickr" network of Flickr contacts;
  • the network of people who aren't my Flickr contact but subscribe to my personal Flickr feed, or to tag-based or group-based Flickr feeds I add to;
  • the network of people who subscribe to my weblog feed, or to one of my pull-my-stuff-together aggregation feeds.

And so on....

The point I wanted to highlight here is - as Stuart notes above - that the "social" aspect isn't directly associated with the "deposit" action: the Flickr uploader (AFAIK) doesn't interact with my Flickr contact list to ping my contacts; Windows Live Writer doesn't know anything about who out there in the blogosphere has subscribed to my weblog. Using these tools, deposit itself is an "individual" rather than a "social" action, if you like. Rather, the social aspect is supported from the "output"/"publication" features of the repository.

In contrast, if I understand Stuart's description of the Facebook deposit app correctly, the "social" dimension here is based on the context of the "deposit" action. Here, the "deposit" tool - Stuart's Fb app - is "socially aware", in the sense that it, rather than the target repository, is responsible for creating notifications in a feed - and the readership of that feed is shaped by the context of the deposit action rather than by the context of "publication": it's my network of Fb friends who see the notifications, not my network of Flickr contacts.

Though of course it may be that the repository I target using the Fb deposit app also enables all the sort of personal-/tag-/group-based output feed functionality I describe above for the Flickr/weblog cases. And I may well take my personal repository feed and "pipe it in to" a social network service - if I still bothered with Facebook (I don't, but that's another story!), I might be using a Flickr Fb app or a weblog app to add notifications to my Fb news feed! So these scenarios aren't exclusive, by any means.

I'm not sure I have any real conclusions here, tbh, and just to be clear, I certainly don't mean to sound negative about the development. Quite the contrary, it provides a very vivid example of how the different aspects of repository use can straddle different application contexts and how the SWORD protocol can be deployed within those different contexts. I think it also provides an illustration of Paul Walk's point about separating out some of our repository concerns (though I note that Paul's model does see the "source repository" as a provider of feeds).

It's certainly worth exploring the different dimensions of the "sociality" of the two approaches.I guess I'm arguing that (to me) "social deposit" isn't a substitute for the socialness that comes with the sort of "output" features Les describes - but it may well turn out to be a useful complement.

November 14, 2008

On sharing...

Great post from Scott Leslie on EdTechPost, Planning to share versus just sharing, about why institutional approaches to sharing so often fail.  The post is primarily about initiatives around sharing learning content but my suspicion is that it applies much more widely and (I think) endorses a lot of the things I've been saying about needing to understand, and play to, the real social networks that researchers use when we are thinking about repositories.

Here are a couple of quotes:

...grow your network by sharing, not planning to share or deciding who to share with; the tech doesn’t determine the sharing - if you want to share, you will; weave your network by sharing what you can, and they will share what they can - people won’t share [without a lot of added incentives] stuff that’s not easy or compelling for them to share. Create virtuous cycles that amplify network effects. Given the right ’set,’ simple tech is all they need to get started.

Talking about traditional institutional approaches to sharing, he says:

In my experience, a ton of time goes into defining ahead of time what is to be shared. Often with little thought to whether it’s actually something that is easy for them to share. And always, because its done ahead of time, with the assumption that it will be value, not because someone is asking for it, right then, with a burning need. Maybe I’m being too harsh, but my experience over a decade consulting and working on these kinds of projects is that I’m not. Someone always thinks that defining these terms ahead of time is a good idea. And my experience is that you then get people not sharing very much, because to do so takes extra effort, and that what does get shared doesn’t actually get used, because despite what they said while they were sitting in the requirements gathering sessions, they didn’t actually know what the compelling need was, it just sounded like a good idea at the time.

Furthermore:

The institutional approach, in my experience, is driven by people who will end up not being the ones doing the actual sharing nor producing what is to be shared. They might have the need, but they are acting on behalf of some larger entity.

And:

...much time goes into finding the right single “platform” to collaborate in (and somehow it always ends up to blame - too clunky, too this, too that.) And because typically the needs for the platform have been defined by the collective’s/collaboration’s needs, and not each of the individual users/institutions, what results is a central “bucket” that people are reluctant to contribute to, that is secondary to their ‘normal’ workflow, and that results in at least some of the motivation (of getting some credit, because even those of us who give things away still like to enjoy some recognition) being diminshed. And again, in my experience, in not a whole lot of sharing going on.

Is this stuff ringing any repository bells for people?

November 07, 2008

Some (more) thoughts on repositories

I attended a meeting of the JISC Repositories and Preservation Advisory Group (RPAG) in London a couple of weeks ago.  Part of my reason for attending was to respond (semi-formally) to the proposals being put forward by Rachel Heery in her update to the original Repositories Roadmap that we jointly authored back in April 2006.

It would be unfair (and inappropriate) for me to share any of the detail in my comments since the update isn't yet public (and I suppose may never be made so).  So other than saying that I think that, generally speaking, the update is a step in the right direction, what I want to do here is rehearse the points I made which are applicable to the repositories landscape as I see it more generally.  To be honest, I only had 5 minutes in which to make my comments in the meeting, so there wasn't a lot of room for detail in any case!

Broadly speaking, I think three points are worth making.  (With the exception of the first, these will come as no surprise to regular readers of this blog.)

Metrics

There may well be some disagreement about this but it seems to me that the collection of material we are trying to put into institutional repositories of scholarly research publications is a reasonably well understood and measurable corpus.  It strikes me as odd therefore that the metrics we tend to use to measure progress in this space are very general and uninformative.  Numbers of institutions with a repository for example - or numbers of papers with full text.  We set targets for ourselves like, "a high percentage of newly published UK scholarly output [will be] made available on an open access basis" (a direct quote from the original roadmap).  We don't set targets like, "80% of newly published UK peer-reviewed research papers will be made available on an open access basis" - a more useful and concrete objective.

As a result, we have little or no real way of knowing if are actually making significant progress towards our goals.  We get a vague feel for what is happening but it is difficult to determine if we are really succeeding.

Clearly, I am ignoring learning object repositories and repositories of research data here because those areas are significantly harder, probably impossible, to measure in percentage terms.  In passing, I suggest that the issues around learning object repositories, certainly the softer issues like what motivates people to deposit, are so totally different from those around research repositories that it makes no sense to consider them in the same space anyway.

Even if the total number of published UK peer-reviewed research papers is indeed hard to determine, it seems to me that we ought to be able to reach some kind of suitable agreement about how we would estimate it for the purposes of repository metrics.  Or we could base our measurements on some agreed sub-set of all scholarly output - the peer-reviewed research papers submitted to the current RAE (or forthcoming REF) for example.

A glass half empty view of the world says that by giving ourselves concrete objectives we are setting ourselves up for failure.  Maybe... though I prefer the glass half full view that we are setting ourselves up for success.  Whatever... failure isn't really failure - it's just a convenient way of partitioning off those activities that aren't worth pursuing (for whatever reason) so that other things can be focused on more fully.  Without concrete metrics it is much harder to make those kinds of decisions.

The other issue around metrics is that if the goal is open access (which I think it is), as opposed to full repositories (which are just a means to an end) then our metrics should be couched in terms of that goal.  (Note that, for me at least, open access implies both good management and long-term preservation and that repositories are only one way of achieving that).

The bottom-line question is, "what does success in the repository space actually look like?".  My worry is that we are scared of the answers.  Perhaps the real problem here is that 'failure' isn't an option?

Executive summary: our success metrics around research publications should be based on a percentage of the newly published peer-reviewed literature (or some suitable subset thereof) being made available on an open access basis (irrespective of how that is achieved).

Emphasis on individuals

Across the board we are seeing a growing emphasis on the individual, on user-centricity and on personalisation (in its widest sense).  Personal Learning Environments, Personal Research Environments and the suite of 'open stack' standards around OpenID are good examples of this trend.  Yet in the repository space we still tend to focus most on institutional wants and needs.  I've characterised this in the past in terms of us needing to acknowledge and play to the real-world social networks adopted by researchers.  As long as our emphasis remains on the institution we are unlikely to bring much change to individual research practice.

Executive summary: we need to put the needs of individuals before the needs of institutions in terms of how we think about reaching open access nirvana.

Fit with the Web

I written and spoken a lot about this in the past and don't want to simply rehash old arguments.  That said, I think three things are worth emphasising:

Concentration

Global discipline-based repositories are more successful at attracting content than institutional repositories.  I can say that with only minimal fear of contradiction because our metrics are so poor - see above :-).  This is no surprise.  It's exactly what I'd expect to see.  Successful services on the Web tend to be globally concentrated (as that term is defined by Lorcan Dempsey) because social networks tend not to follow regional or organisational boundaries any more.

Executive summary: we need to work out how to take advantage of global concentration more fully in the repository space.

Web architecture

Take three guiding documents - the Web Architecture itself, REST, and the principles of linked data.  Apply liberally to the content you have at hand - repository content in our case.  Sit back and relax. 

Executive summary: we need to treat repositories more like Web sites and less like repositories.

Resource discovery

On the Web, the discovery of textual material is based on full-text indexing and link analysis.  In repositories, it is based on metadata and pre-Web forms of citation.  One approach works, the other doesn't.  (Hint: I no longer believe in metadata as it is currently used in repositories).  Why the difference?  Because repositories of research publications are library-centric and the library world is paper-centric - oh, and there's the minor issue of a few hundred years of inertia to overcome.  That's the only explanation I can give anyway.  (And yes, since you ask... I was part of the recent movement that got us into this mess!). 

Executive summary: we need to 1) make sure that repository content is exposed to mainstream Web search engines in Web-friendly formats and 2) make academic citation more Web-friendly so that people can discovery repository content using everyday tools like Google.

Simple huh?!  No, thought not...

I realise that most of what I say above has been written (by me) on previous occasions in this blog.  I also strongly suspect that variants of this blog entry will continue to appear here for some time to come.

October 21, 2008

ORE 1.0 published

I'm pleased to note that, at the end of last week, Carl Lagoze and Herbert Van de Sompel announced the publication of version 1.0 of the OAI ORE specifications. I was travelling for most of the week, and had very little time to keep up with email, so the last minute dotting of i's and crossing of t's fell to the other editors and I'm grateful for their efforts in pulling things together.

(Of course, we're already noticing various minor things which need correcting!)

I think the main changes from the previous (0.9) release are:

As it happened, I was talking about ORE in a presentation last week (more on that in a follow-up post) and I expressed the opinion then that, leaving aside for a moment the core ORE model of Aggregations and Aggregated Resources, I think one of the significant contributions of ORE may turn out to be its emphasis on what I think of as a "resource-centric" approach and (at least some of) the conventions of the Semantic Web and "Linked Data" communities. In particular, I think this is a potentially important change for the "Open Archives"/"eprint repository" community, where to a large extent - not entirely, but to a large extent - repository developments on the Web have been conditioned by the more "service-oriented" framework of the OAI-PMH protocol and an emphasis on XML and XML Schema. It's also probably fair to say that I don't think the ORE project really started from this perspective, but rather things evolved and shifted - perhaps not always in a straight line! - in this direction as the work proceeded.

The ORE model itself is quite general in nature, and, as Herbert acknowledges in a presentation here (a nice set of slides which provides a good overview in itself, I think), it's not easy to predict how ORE might be applied: a number of experimental/test applications are noted in that presentation, but many others are possible. For my own part, I'm particularly interested in seeing how/whether ORE can be used in association with other models, like FRBR.

September 30, 2008

Open Science

Via Richard Akerman on Science Library Pad I note that a presentation made to a British Library Board awayday (on 23rd Sept), The Future of Research (Science and Technology), by Carole Goble is now available on Slideshare:

The presentation looks at the way in which scientific and technology-related research is changing, particularly thru the use of the Web to support open, data-driven research - essentially enabling a more immediate, transparent and repeatable approach to science.

The ideas around open science are interesting.  Coincidentally, a few Eduserv bods met with Cameron Neylon yesterday and he talked us thru some of the work going on around blog-driven open labbooks and the like.  Good stuff.  Whatever one thinks about the success or otherwise of institutional repositories as an agent of change in scholarly communication there seems little doubt that the 'open' movement is where things are headed because it is such a strong enabler of collaboration and communication.

Slide 24 of the presentation above introduces the notion that open "methods are scientific commodities".  Obvious really, but something I hadn't really thought about.  I note that there seem to be some potential overlaps here with the approaches to sharing pedagogy between lecturers/teachers enabled by standards such as Learning Design - "pedagogies as learning commodities" perhaps? - though I remain somewhat worried about how complex these kinds of things can get in terms of mark-up languages.

The presentation ends with some thoughts about the impact that this new user-centric (scientist-centric) world of personal research environments has on libraries:

  • We don’t come to the library, it comes to us.
  • We don’t use just one library or one source.
  • We don’t use just one tool!
  • Library services embedded in our toolkits, workbenches, browsers, authoring tools.

I find the closing scenario (slide 67) somewhat contrived:

Prior to leaving home Paul, a Manchester graduate student, syncs his iPhone with the latest papers, delivered overnight by the library via a news syndication feed. On the bus he reviews the stream, selecting a paper close to his interest in HIV-1 proteases. The data shows apparent anomalies with his own work, and the method, an automated script, looks suspect. Being on-line he notices that a colleague in Madrid has also discovered the same paper through a blog discussion and they Instant Message, annotating the results together. By the time the bus stops he has recomputed the results, proven the anomaly, made a rebuttal in the form of a pubcast to the Journal Editor, sent it to the journal and annotated the article with a comment and the pubcast. [Based on an original idea by Phil Bourne]

If nothing else, it is missing any reference to Twitter (see the MarsPhoenix Twitter feed for example) and Second Life! :-).  That said, there is no doubt that the times they are a'changing.

My advice?  You'd better start swimming or you'll sink like a stone :-)

September 18, 2008

Worlds apart together

Sometimes things just seem to come together in odd ways!

Take this afternoon for example...

On the one hand, the jisc-repositories mailing list came briefly to life with a discussion about the legality of storing images of people without having explicitly gained their permission.  A variety of viewpoints came forth, both for and against, which I would broadly categorise (very unfairly!) as common sense vs. legal sense.

Meanwhile, at almost exactly the same time in another corner of the universe, James Clay was waiving his mobile phone/video camera around indiscriminately during question time at the MoLeNET conference, broadcasting all and sundry live to qik.com and challenging (in quite an "in your face" way) the assembled panel to comment on the impact of mobile technology on the delivery of learning in FE. 

The sound isn't brilliant throughout, but it's worth watching.

I don't know what point I'm making here other than to note the obvious - that nothing is straight-forward and that the 'net continues to change, and change us, in quite fundamental ways.

August 20, 2008

Directory of repository-related blogs

The JISC-funded Repositories Support Project has developed quite a nice list of repository-related blogs (and other RSS feeds).

Worth taking a look and suggesting additional feeds if they are missing.

They provide an OPML file which means that everything listed here should be aggregatable (is that a word!?) but I had a quick go using Yahoo Pipes and failed miserably I'm afraid to say.  Not sure if that is my fault or not but I seem to recall having problems before with large OPML files in Yahoo Pipes so perhaps there is some built-in limitation?

August 05, 2008

ORE and Atom

At the end of last week, Herbert Van de Sompel posted an important proposal to the OAI ORE Google Group, suggesting significant changes in the way ORE Resource Maps are represented using Atom.

The proposal has two key components:

  • To express an ORE Aggregation at the level of an Atom Entry, rather than (as in the current draft) at the level of an Atom Feed
  • To convey ORE-specific relationships types using add-ons/extensions, rather than by making ORE-specific interpretations of pre-existing Atom relationship types

There are some details still to be worked out, particularly on the second point, and especially given that it is a significant change at quite a late stage in the development of the specifications, the project is looking for feedback on the proposal.

If possible, please respond to the OAI ORE Google Group, rather than by commenting here :-)

July 18, 2008

Does metadata matter?

This is a 30 minute slidecast (using 130 slides), based on a seminar I gave to Eduserv staff yesterday lunchtime.  It tries to cover a broad sweep of history from library cataloguing, thru the Dublin Core, Web search engines, IEEE LOM, the Semantic Web, arXiv, institutional repositories and more.

It's not comprehensive - so it will probably be easy to pick holes in if you so choose - but how could it be in 30 minutes?!

The focus is ultimately on why Eduserv should be interested in 'metadata' (and surrounding areas), to a certain extent trying to justify why the Foundation continues to have a significant interest in this area.  To be honest, it's probably weakest in its conclusions about whether, or why, Eduserv should retain that interest in the context of the charitable services that we might offer to the higher education community.

Nonetheless, I hope it is of interest (and value) to people.  I'd be interested to know what you think.

As an aside, I found that the Slideshare slidecast editing facility was mostly pretty good (this is the first time I've used it), but that it seemed to struggle a little with the very large number of slides and the quickness of some of the transitions.

AtomPub Video Tutorial

From Joe Gregorio of Google, a short video introduction to the Atom Publishing Protocol (RFC 5023):

Which, following Tim Bray's exhortation, I shall henceforth refer to only as "AtomPub".

June 16, 2008

Web 2.0 and repositories - have we got our repository architecture right?

For the record... this is the presentation I gave at the Talis Xiphos meeting last week, though to be honest, with around 1000 Slideshare views in the first couple of days (presumably thanks to a blog entry by Lorcan Dempsey and it being 'featured' by the Slideshare team) I guess that most people who want to see it will have done so already:

Some of my more recent presentations have followed the trend towards a more "picture-rich, text-poor" style of presentation slides.  For this presentation, I went back towards a more text-centric approach - largely because that makes the presentation much more useful to those people who only get to 'see' it on Slideshare and it leads to a more useful slideshow transcript (as generated automatically by Slideshare).

As always, I had good intentions around turning it into a slidecast but it hasn't happened yet, and may never happen to be honest.  If it does, you'll be the first to know ;-) ...

After I'd finished the talk on the day there was some time for Q&A.  Carsten Ulrich (one of the other speakers) asked the opening question, saying something along the lines of, "Thanks for the presentation - I didn't understand a word you were saying until slide 11".  Well, it got a good laugh :-).  But the point was a serious one... Carsten admitted that he had never really understood the point of services like arXiv until I said it was about "making content available on the Web".

OK, it's a sample of one... but this endorses the point I was making in the early part of the talk - that the language we use around repositories simply does not make sense to ordinary people and that we need to try harder to speak their language.

June 10, 2008

Talis Xiphos Research Day

This blog entry was used to host a live blog report from the Talis Xiphos Research Day, held at the Talis offices outside Birmingham on 10 June 2008.

Note that this page, and the live blog it contains, has been edited after the event to correct minor typos and so on.  However, no substantial changes have been made.

The day's agenda was:

09:00 Coffee & Registration
09:30 Welcome & Introduction
09:45 Peter Murray-Rust, University of Cambridge - Data-Driven research.
10:30 Andy Powell, Eduserv - Web 2.0 and repositories - have we got our repository architecture right?
11:15 Coffee
11:30 Carsten Ulrich, Shanghai Jiao Tong University - Why Web 2.0 is Good for Learning and for Research: Principles and Prototypes.
12:15 Alan Masson, Senior Lecturer in Learning Technologies, University of Ulster - Formalising the informal - using a Hybrid Learning Model to Describe Learning Practices.
13:00 Lunch
13:45 Chris Clarke, Talis - Project Xulu - Creating a Social Network from a Web of Scholarly Data.
14.15 Alan Trow-Poole and Ian Corns, Talis - Project Zephyr: Letting Students Weave Their Own Path.
14:45 Attendee discussion and feedback
16.00 Close

As you may note from the live blog transcript, there was a small modification to the afternoon agenda, adding Nadeem Shabir (Talis) at the end of the programme.

Clearly, I was unable to live blog myself.  Rob Styles and Nadeem Shabir added comments to the live blogging system as my talk progressed.  Unfortunately, comments in the Coveritlive system that I was using need to be approved by the live blog author in real time and I was not available to do so, meaning that their comments only became visible after I had finished my talk and had started live blogging the next speaker.  Apologies that things go slightly out of order at that point.

Overall, the day was very enjoyable and interesting, with talks covering a range of topics relevant to learning and research.

Talis Xiphos Research Day (06/10/2008)
Powered by: CoveritLive
9:36
Andy Powell -  ok, Peter Murray Rust (PMR) is just starting...
9:37
[Comment From AdrianStevenson]
Don't suppose there's any audio feed?
9:38
Andy Powell -  @ade no, sorry, no audio - i may try setting something up later, but too rushed now with my talk coming up next
9:40
Andy Powell -  PMR saying thast publishers are primarily about preventing access to content - then asking if there are any publishers in the room - which there are! lol
9:40
Adrian Stevenson -  @AndyPowell OK. Shame I didn't spot this event. as would have attended An audio recording at least would be great. Slides anywhere?
9:40
Andy Powell -  my slides are at http://tinyurl.com/4fehq8
9:40
Adrian Stevenson -  cheers
9:40
Andy Powell -  not sure about others
9:41
Andy Powell -  PMR not using slides as such
9:42
Andy Powell -  PMR currently outlining topic of his talk - open data, repositories, etheses, semantic data, science commons, ...
9:42
Andy Powell -  openNoteBook science
9:44
Andy Powell -  showing graph of atmospheric carbon dioxide
9:45
Andy Powell -  arguing that ability to share digital copies of data and analysis (a graph in this case) very important
9:46
Andy Powell -  scientific publication = discourse (human readable), embedded facts, etc. but using of PDF prevents machine re-use
9:47
Adrian Stevenson -  Cheers for this micro blogging btw Andy. Don't feel obliged to do it all day!
9:47
Andy Powell -  i find it helpful for me :-)
9:47
[Comment From Paul Miller]
Presentations should go up at www.talis.com/xiphos/events shortly.
9:49
Andy Powell -  in chemistry, some data available thru some chimcal databases (repositories) but embedded chemical formula, etc. only available in human-readable form
9:49
Adrian Stevenson -  Ta Paul
9:50
Andy Powell -  patent office is looking at sharing semantic data
9:50
Andy Powell -  this disrupts gatekeepers who currently re-purpose non-semantic data for industry
9:51
Andy Powell -  gatekeepers currently quite powerful
9:52
Andy Powell -  talking hangovers from paper age of publishing - from victorian age - still affecting the way publishing happens today - e.g publishers asking for graphs to be removed from papers to save "space"
9:53
Andy Powell -  talking about the importance of publishing data as well as results - so that results can be verified, re-tested, etc.
9:53
Adrian Stevenson -  Anybody there able to cover Andy's talk? Unless he's going to do both ..
9:54
Andy Powell -  now showing 'real' scientists at work - journal of visualized experiments www.jove.com
9:55
Andy Powell -  reports of expeiments using video
9:55
Andy Powell -  and text explanations
9:57
Andy Powell -  noting importance of publishing science in sufficient detail that it can be repeated
9:57
Andy Powell -  big science epitomises the data deluge and data is well supported
9:58
Andy Powell -  long-tail science also generates lots of data
9:59
Andy Powell -  err, no, i aint going to do both - but i'm just saying the same old stuff :-) repositpories, tada yada, web 2.0 yada yada, semantic web yada yada - that kind of thing
10:00
Andy Powell -  most science done in the long tail - small lab work and so on
10:00
Andy Powell -  how do we deal with the long tail of science - not currently well catered for
10:00
Andy Powell -  repositories don't currently meet the needs of long tail science
10:01
Andy Powell -  paul might cover my bit - paul, you can comment without moderation - hint hint
10:02
Andy Powell -  domain repositories currently cater better for long tail science - you don't use your institutional repository
10:02
Andy Powell -  30% of scientists have lost their data at some point
10:03
Andy Powell -  showing a typical thesis
10:03
Andy Powell -  graphs, molecules, equations, ...
10:04
Andy Powell -  PDF == reading with boxing gloves on
10:04
Andy Powell -  can't get at the interesting data that is embedded in the document
10:04
Andy Powell -  showing robot (called oscar) that reads word files and extracts interesting data
10:05
Andy Powell -  xml exposed by word much better than pdf in terms of processability
10:06
Andy Powell -  table constructed automatically from text of document
10:07
Andy Powell -  can process large part of the chemical literature using this kind of technology

10:07
Andy Powell -  but... copyright (to publishers) prevents re-use
10:08
[Comment From PeteJ]
This sounds similar content to PM-R's OR08 keynote session. Not that that's bad thing. Just sayin'.
10:09
Andy Powell -  demoing another tool called CrystalEye - wwmm.ch.cam.ac.uk/crystaleye (I think)
10:10
Andy Powell -  grabs ToC for laterst issue of journal and then draws molecules that are not in the original publication based on info extracted
10:10
PeteJ -  Memories of fierce MSWord v PDF arguments between text miners & preservationists @ OR08
10:11
Andy Powell -  PMRs robots could go thru whole chemical literature and open it up - but strong business/publisher lobby prevents it
10:11
Andy Powell -  now showing what can be done with theses
10:13
Andy Powell -  chemistry "not much fun to read in bed"! lol
10:13
Andy Powell -  you heard it here first
10:14
Andy Powell -  showing more automatic analysis of research data (based on RDF data I think)
10:15
Andy Powell -  i'm getting a bit lost here - talking about chemical data which i don't understand - think the underlying message is that a picture is worth a thousand words - but the way we share words prevent people from drawing pictures easily
10:16
Andy Powell -  ah... now onto repositories
10:16
Andy Powell -  asking about sourceforge and eclipse
10:17
Andy Powell -  sourceforge is a repository - for managing computer code
10:17
Andy Powell -  collaborative environment
10:17
Andy Powell -  i.e. it supports social interactions
10:18
[Comment From Guest]
@andypowell - reading chemistry in bed probably wasn't what was making the floorboards squeak in your hotel...
10:18
[Comment From Silversprite]
Bloated PDF can be nightmare for researchers handicapped by tide-dependent broadband :-(
10:19
Andy Powell -  sorry - i'm being slow to approve comments!   concentrating on presentation too hard
10:20
Andy Powell -  talking about bioclipse - open source tool
10:23
Andy Powell -  now moving on to science commons
10:23
Andy Powell -  outgrowth of creative commons
10:24
[Comment From Paul Miller]
Wondering if Owen can take over when Andy speaks, next?
10:24
Andy Powell -  possibly... but think i'm going to present from my own machine
10:25
Andy Powell -  need to protect our data (as being open) before others come along and "steal" it
10:25
Andy Powell -  explicitly flag data as being "open"
10:25
Andy Powell -  talking about publishers agin - it's NOT their data!
10:29
Andy Powell -  in Q and A
10:29
Andy Powell -  I'm up next... so live blogging will stop for next 45 minutes
10:30
[Comment From Chris Keene]
Thanks Andy, hopefully someone will blog a summary of your presentation to go along with the slides.
10:33
Adrian Stevenson -  Don't forget to hit the Audacity record button Andy
11:23
Andy Powell -  ok, i'm back now... but we are going into coffe so there'll be a brief pause
11:38
Andy Powell -  right... next session about to start
11:38
Andy Powell -  Carsten Ulrich, Shanghai Jiao Tong University - Why Web 2.0 is Good for Learning and for Research: Principles and Prototypes
11:39
[Comment From Nadeem Shabir]
Andy Powell starts his talk ... on Web 2.0 and Institutional Repositories
11:39
[Comment From Nadeem Shabir]
andy talking about how we need to get the achitecture for repositories right
11:39
[Comment From Rob Styles]
"PDF is a cul-de-sac" nice quote.
11:39
[Comment From Rob Styles]
"we tend to focus on service-oriented approaches" that is we focus on services on the content, rather than the content itself
11:39
[Comment From Rob Styles]
Flickr, YouTube et al - successful repositiories, based on the social activity surrounding the content
11:39
[Comment From Rob Styles]
insitutional repositiories don't match the social networks of researchers, which are subject-based, cross-institutional and global.
11:39
[Comment From Rob Styles]
Andy suggesting that global subject-centric repositories as a possible solution
11:39
[Comment From Rob Styles]
nice quote "Thou Shalt Deposit" to force content into repositories that would otherwise remain empty
11:39
[Comment From Rob Styles]
arxiv.org as good example of how to do it, but started before we knew how to scale things like that
11:39
[Comment From Rob Styles]
Q: Why do blogs work when institutional repositories don't?
11:39
[Comment From Rob Styles]
very difficult to have a real conversation about what's best as many just want to get the next step - opening up the research using reporitories and OAI-PMH - done and embedded.
11:39
[Comment From Rob Styles]
slideshare shown as an example of a web2.0 repository
11:39
[Comment From Rob Styles]
Summary: Go Simple - RSS, Tagging, full-text indexing, microformats (maybe)
11:39
[Comment From Rob Styles]
Alternatively, we look to the semantic web and add real meaning
11:39
[Comment From Rob Styles]
references to FRBR, SWAP, ORE and DCAM
11:40
[Comment From Rob Styles]
Digital Photography and Flickr fundamentally changed the nature of photography - the growth of something different and new,
11:40
[Comment From Rob Styles]
whereas scholarly publication is taking what we have done on paper and replicate it on the web.
11:40
[Comment From Rob Styles]
Carsten: "ah, so repositories are about making content available on the web"
11:40
[Comment From Rob Styles]
"I didn't get what arxiv.org was for, but had no problem understanding slideshare"
11:40
[Comment From Rob Styles]
andy: complexity is because we live in this hybrid web/paper world
11:41
Andy Powell -  apologies to rob and nadeem who tried to cover my talk - but i hadn't approved them in advance so the rest of you didn't see it
11:41
Andy Powell -  wish coveritlive had a 'approve all' mode
11:44
Andy Powell -  Carsten talking about web2.0 and learning
11:44
Andy Powell -  web 2.0 as a research tool
11:45
Andy Powell -  going to cover 3 examples i think
11:46
Andy Powell -  describing pedagogy associated with traditional teaching methods - teacher imparting knowledge thru lectures - little opportunity for discussion by students
11:46
Adrian Stevenson -  Cheers to Rob  and Nadeem. Doesn't really matter wasn't live for us remoters.
11:46
Andy Powell -  arguing that LMS (learning management systems) are teacher centred
11:47
Andy Powell -  intelligent tuoring system try to model the knowledge of experts - e.g. in physics
11:48
Andy Powell -  based on cognitive learning theories
11:48
Andy Powell -  can recognise when topics have been understood by the student - but very expensive
11:48
Andy Powell -  still based on the idea that expert has knowledge to impart to others
11:49
Andy Powell -  citing stephen downes as example of theorists that have embraced power of web 2.0
11:50
Andy Powell -  quoting confucious: "tell me and i'll forgot - show me and i may remmber - involve me and i'll know it forever" (sorry paraphrasing)
11:50
Andy Powell -  suggesting that there has been little analysis of the pedagogical value of Web 2.0
11:51
Andy Powell -  what is web 2.0?   tim o'reilly and so on...
11:51
Andy Powell -  participation
11:51
Andy Powell -  important in learning as way for students to express themselves and as new tools for teaching
11:52
Andy Powell -  "participation" as in expressing themselves in pictures, videos, etc.
11:52
Andy Powell -  facilitates "constructivist" learning
11:53
Andy Powell -  @silversprite session will remain on efoundations
11:53
Andy Powell -  requires open approach
11:54
PeteJ -  Jumping back a bit to Andy's presentation... while SlideShare, Flickr et al do the cross-institutional/social bit, they do tend to divide the world up by resource type. My "scholarly works" include papers, videos, still images, audio etc w relationships between them that cut across resource type boundaries. (This video is delivery of those slides based on that paper etc).
11:54
Andy Powell -  web 2.0 increases audience - but needs to be exploited
11:54
Andy Powell -  @petej yes, agreed
11:55
Andy Powell -  web 2.0 - huge variety of data available - thru apis - often annotated - increasingly semantic linkedData
11:56
Andy Powell -  interesting for teaching - gives ability to re-combine vast array of material from real networks - real contexts
11:56
Andy Powell -  open linked data - big potential for learning
11:57
Andy Powell -  architecture of assembly - access via apis - functionality via widgets
11:57
Andy Powell -  in education can build prototypes very quickly - e.g. using yahoo pipes
11:58
[Comment From Silversprite]
When the day is done, is there a way of getting this whole "Cover It Live" session (not in PDF!)?
11:58
Andy Powell -  talking about personal learning environments
11:58
Andy Powell -  showing scott wilson's diagram
11:59
Andy Powell -  PLE for language learning built from iGoogle - drag and drop to build it
11:59
Andy Powell -  web 2.0 = perpetual beta
12:00
Andy Powell -  in education the improvements in functionality can be good but also distracting
12:00
Andy Powell -  e.g. need to adapt manuals, etc.
12:00
Andy Powell -  but developers open to feedback - which is good
12:00
Andy Powell -  e.g. asking twitter developers to open up channels explicitly for learning application
12:01
Andy Powell -  web 2.0 independent access to data - long tail - lightweight models
12:01
Andy Powell -  principles of web2.0 enable social and active learning
12:02
Andy Powell -  best used for t&l when you exploit these features = active role for teachers
12:03
Andy Powell -  web 2.0 for research - multitude of services with aerchitecture of assembly - easy to combine stuff - quick prototyping
12:03
Andy Powell -  good for assessing hypotheses
12:03
Andy Powell -  ok, 2 examples...
12:04
Andy Powell -  1st example: learning resource creation - intention not to overload lecturers
12:04
Andy Powell -  authoring learning resources is hard, time consuming and costly
12:04
Andy Powell -  how can we help them?
12:05
Andy Powell -  hypothesis that social bookmrking should help them
12:05
Andy Powell -  get lecturers to use del.icio.us to bookmark resources - but with predefined tags for concepts, subjects, instructional types and difficulty/level
12:07
Andy Powell -  used this to extend the LMS by embedding links from del.icio.us - very low cost to implement prototype
12:07
Andy Powell -  why? - because of the del.icio.us api
12:08
Andy Powell -  feedback from lecturers was +ve - lecturer suggested also allowing students to tag resources
12:08
Andy Powell -  but students don't tag resources if textbook is good enough - so plan to use this approach on course where textbook material isn't rich enough
12:09
Andy Powell -  2nd example: using microblogging for language learning
12:09
Andy Powell -  vocational learners shy, seldom active, limited time
12:09
Andy Powell -  goal was to provide practice possibilities
12:10
Andy Powell -  microblogging (twitter?) increasing sense of community
12:10
Andy Powell -  encoutraged participation by reducing transactional distance to teacher - quick and easy way of active participation
12:11
Andy Powell -  yes, using twitter for this
12:11
Andy Powell -  example shown is EfL courses
12:11
Andy Powell -  implemented by downloading all twitter contributions - grades based on number of contributions - not quality of contributions
12:12
Andy Powell -  made use of twitter api - but also needed screen scraping because of limitations in twitter api
12:13
Andy Powell -  98 out of 110 students participated - 5574 twitter updates over 7 weeks
12:13
Andy Powell -  teachers also contributed - ~3 updates per day
12:13
Andy Powell -  questionaire at end of experiment
12:14
Andy Powell -  most students liked it - only 5% of students anti the use of twitter on the course
12:14
Andy Powell -  "twitter is the same as the schoolyard"
12:14
Andy Powell -  50% students said that they communicated not just with each other but with other native english speakers thru twitter
12:15
Andy Powell -  but - correction of mistakes typically not done
12:15
Andy Powell -  lessons learned...
12:15
Andy Powell -  web 2.0 can stimulate learning and participation
12:15
Andy Powell -  use of twitter continued after course had finished
12:16
Andy Powell -  social dimension was very important
12:16
Andy Powell -  students encouraged each other to participate
12:16
Andy Powell -  though sometimes students reverted to native language
12:16
Andy Powell -  teacher as both moderator and participator
12:17
Andy Powell -  little use of mobile devices - too many updates from twitter
12:17
Andy Powell -  no students integrated twitter into their blogs (they weren't shown how to do this)
12:18
Andy Powell -  now going to talk about Totuba Toolkit - start-up company in Shanghai
12:18
Andy Powell -  seeking feedback about whether what they are doing sounds useful
12:19
Andy Powell -  toolkit for creating and storing "notes"
12:19
Andy Powell -  automated intelligent suggestions for realted resources
12:19
Andy Powell -  find stuff and bookmark it
12:20
Andy Powell -  add addional info and notes - what kind of resource it is
12:20
Andy Powell -  ability to export reference list of bookmarked resources
12:21
Andy Powell -  visualise what has been collected in different ways - e.g. as knowledge map
12:23
Andy Powell -  goal of totuba is to facilitate process of learning and research by removing unnecessary steps - automate integration work - make it easier to find associated resources and peers
12:24
Andy Powell -  lessons learned again...   about use of Web 2.0 for elearning generally
12:24
Andy Powell -  architecture of asembly good - prototyping good
12:24
Andy Powell -  but - reliance on third party api
12:25
Andy Powell -  yet another login - but e.g. open social graph, and openid should help with this
12:25
Andy Powell -  not all functionality available via the api
12:25
Andy Powell -  web 2.0 less suitable for "designed instruction"
12:25
Andy Powell -  good at building community
12:26
Andy Powell -  but have to be prepared for side-effects - e.g. in the 2nd example students started using twitter avatar image to share photos!
12:26
Andy Powell -  requires active teacher - stimulating use of tools
12:26
Andy Powell -  again - noting reliance on third-party tools
12:27
Andy Powell -  linked open data is still for experts
12:28
Andy Powellslideshare.net/ullrich for the slides
12:28
Andy Powell -  ok, now taking questions...
12:30
Andy Powell -  PMR: HE is about getting a degree - not about learning - web 2.0 great for learning, poor for assessment ??
12:31
Andy Powell -  Alan Masson, Senior Lecturer in Learning Technologies, University of Ulster - Formalising the informal - using a Hybrid Learning Model to Describe Learning Practices - up next
12:34
Andy Powell -  technology provides opportunities but need to work out how to enable teachers and learners to take advantage of it
12:34
Andy Powell -  going to be talking about Hybrid Learning Model (HLM) and implications of its use to help lecturers reflect on their teaching practice
12:35
Andy Powell -  facilitating "learner centred" reflective practice by teachers - but need to change teaching practice in order for this to happen - not easy to do
12:36
Andy Powell -  need to describe current practice - disseminate new practice - ensuring learner is "core"
12:36
Andy Powell -  developed "modeling framework" to achieve this
12:37
Andy Powell -  practitioners have a comfort zone - focus on content and assessment
12:38
Andy Powell -  learning design - i.e. IMS LD spec. - provides basis for work
12:38
Andy Powell -  structure within which content and assessment can be placed
12:38
Andy Powell -  formal schemas and vocabs
12:39
Andy Powell -  But LD not reflective in nature - it's about design - UI of tools not yet mature - beta interfaces
12:40
Andy Powell -  HLM brings together "8LEM" model (Uni of Liege) and "Closed set of learning verbs" (Sue Bennett, Uni of Wollongong)
12:40
Andy Powell -  focus on interactions between participants
12:41
Andy Powell -  describing 8LEM - see http://cetl.ulster.ac.uk/elearning/index.php?page=8LEM-1 for details
12:42
Andy Powell -  now looking at the learning verbs
12:42
Andy Powell -  need to see slides for this bit!
12:43
Andy Powell -  we have been given cards in our delegate packs apparently
12:44
Andy Powell -  have used to cards with lecturers - facilitated, informal context (to improve reflection) with model transcribed into relevant data grid
12:45
Andy Powell -  how are cards used??
12:45
Andy Powell -  teaching staff bring along a set of objectives
12:45
Andy Powell -  1:1 sessions lasting 45 mins to 1 hours
12:46
Andy Powell -  describing cards being used to facilite understanding about what lecturer is trying to achieve (in a specific lesson)
12:48
Andy Powell -  trying to captuire what the teacher thinks they are doing and what they expect the students to be doing
12:49
Andy Powell -  link these activities to resources and something else - urghh, i'm struggling here
12:50
Andy Powell -  HLM results in text based grid and animated activity plan - but staff also like to see a mindmap
12:51
Andy Powell -  showing a completed grid - presented as an animated walk-thru
12:51
Andy Powell -  walk-thru shows teacher's role, learner's role, and what will be learned
12:52
Andy Powell -  intention is to help students understand the process rather than just the outcomes
12:53
Andy Powell -  result is that lecturers are formalising processes that haven't be articulated before
12:53
Andy Powell -  they are creating artifacts that formalise what they do - but also that challenge what they are doing
12:54
Andy Powell -  i.e. that challenge their teaching values
12:55
Andy Powell -  this model potentially helps bridge divide between "wooly" teaching practice (what actually happens in the classroom) and highly formalised constructs such as IMS LD
12:56
Andy Powell -  benefit of the model is that small chunks of structured information provides very useful building blocks for teachers
12:58
Andy Powell -  practitioner feedback about developoment and use of the model +ve - e.g. "encouraged me to think about leaner's perspective rather than just focusing on the teacher"
1:00
Andy Powell -  describing evaluation by learners of the use of the model - intention was to help year 1 students to new learning environment of uni
1:02
Andy Powell -  student: "helped me to bring everything together and know what is expected of me"
1:03
Andy Powell -  verbs helped students understand what processes were expected of them
1:08
Andy Powell -  sorry... i'm still here... but struggling to get my head round some of this... my problem, not the speaker's
1:10
Andy Powell -  use cases in which this approach is expected to be relevent...
1:11
Andy Powell -  raising awareness of learner perspective in teaching and learning processes
1:11
Andy Powell -  reflecting on and reviewing current practice
1:11
Andy Powell -  planning and designing course materials
1:11
Andy Powell -  providing reference framework for course administration
1:12
Andy Powell -  assisting students to adapt to new learning situations
1:13
Andy Powell -  summary - light model - easy to capture stuff - focus on practice - focus on learner perspective - multiple use cases - +ve evaluations so far - model formally embedded into Uni of Ulster (thru staff induction) - formalising the informal
1:15
Andy Powell -  one of the drivers for this is need to bring more diverse range of students and address issues around retention
1:15
Andy Powell -  also trying to address cultural issues between student base (partic. new students) and older base of existing staff
1:17
Andy Powell -  it'll be lunch in a minute... back in a while
1:20
Andy Powell -  assessment of model follows: "usability, use, impact" track - project is in 1st year - so not into 'impact' phase yet
1:20 [Be Right Back Countdown] 30 minutes
2:00
Chris Keene -  thinking Chris Clarke should just be taking stage?
2:06
Andy Powell -  ok, i'm back from lunch - sorry we are running slightly late now
2:07
Andy Powell -  Chris Clarke, Talis - Project Xulu - Creating a Social Network from a Web of Scholarly Data
2:07
Andy Powell -  describing the early web
2:08
Andy Powell -  what we have today is a web of documents - millions of documents - but they are human oriented - machines can't understand them
2:09
Chris Keene -  yes, how dare you take lunch away from your laptop :)
2:09
Andy Powell -  looking at a 'simple' google query "how many people were evacuated during hurricane katrina"
2:09
Andy Powell -  Google gives a fairly decent answer - from wikipedia page
2:10
Andy Powell -  but - more by luck than judgement
2:11
Andy Powell -  one of the problems with the web is that the meaning of links in hidden - not machine-understandable - makes page rank less useful than it might be
2:11
Andy Powell -  semantics of links cannot be determined by machines
2:11
Andy Powell -  arguing that we need a machine-readable web
2:12
Andy Powell -  back to wikipedia page about katrina - it contains lots of assertions about facts and so on
2:12
Andy Powell -  DBpedia has derived 218 million assertions out of wikipedia
2:13
Andy Powellwww.powerset.com have built a user-experience out of the DBpedia data - which is openly available
2:13
Andy Powell -  gives much better result than that obtained by simple Google search
2:15
Andy Powellhttp://dbpedia.org/
2:15
Andy Powell -  just like web of documents, the web of data is distributed
2:15
Andy Powell -  lots of participants
2:16
Andy Powell -  describing what talis is - i'll spare you the details - mentioning the talis platform
2:16
Andy Powell -  "doing the heavy lifting for the semantic web"
2:16
Andy Powell -  chris going to talk about project xiphos
2:17
Andy Powell -  what can we (talis) do using a web of scholarly data?
2:17
Andy Powell -  given metadata for 500 articles by a friendly company - sorry, i missed the name of the company
2:18
Andy Powell -  developed visualisation of the relationships between those articles
2:19
Rob Styles -  The visualisation is Relation Browser by Moritz Stefaner
2:19
Andy Powell -  thanks rob
2:19
Andy Powell -  different colors representing different relationships between entities in the data
2:19
Andy Powell -  19800 distinct articles - mainly thru citations
2:20
Andy Powell -  21029 people
2:20
Andy Powell -  instigated a small talis project to investigate how this data can be made useful
2:20
Rob Styles -  4 people for one month, for all design and coding
2:21
Andy Powell -  developed a scholarly social network prototype
2:22
Andy Powell -  demoing from PoV of real female researcher who is listed in the sample graph
2:22
Andy Powell -  search for "flowers" (a person's name - the name of one of her collaborators)

2:23
Andy Powell -  results categorised into 4 tabs - things, people, subjects, collections
2:23
Andy Powell -  TP Flower (the person she is looking for) appears under 'people' tab
2:24
Andy Powell -  Xiphos can pre-build 'home' page for people based on knowledge in the sample graph
2:24
Andy Powell -  'home' page shows 'work', 'knows' and 'collections' tabs
2:25
Andy Powell -  the xiphos system then allows end-users to augment the computed information by hand - e.g. by clicking on an 'i know this person' link
2:26
Andy Powell -  system prompts for registration info, then tries to marry up name and other info against knwoledge in the graph
2:26
Andy Powell -  takes newly registered user to their new homepage - a rich page because of knowledge in the graph
2:27
Andy Powell -  though it might also contain some errors because of fuzziness in the graph
2:28
Andy Powell -  info includes the person's 'network' - 4 types of relationships - people you know, people you cite, people who cite you, people that you are watching
2:28
Andy Powell -  the last of these gives people a way of "watching" (i.e. tracking) someone, without indicating that you formally know them
2:29
Andy Powell -  "watching" someone is a one way relationship i.e. you know who you are watching but you don't know who is watching you
2:30
Andy Powell -  clicking on a 'work' (i.e a publication) takes you to a page for that work - overview, citations, cited by, collections
2:31
Andy Powell -  also offers a thumbnail preview of the document itself - but small enough not to break copyright (arguably)
2:31
Andy Powell -  navigate thru citations both inbound and outbound
2:32
Rob Styles -  The thumbnails in the prototype were generated with permission ;-)
2:32
Andy Powell -  collections give a way to organise stuff in ways that are relevant to the end user
2:33
Andy Powell -  collections can be watched
2:33
Andy Powell -  people can be members of collections i think
2:33
Andy Powell -  ??
2:33
Rob Styles -  did he mention you can add people to collections
2:33
Andy Powell -  not sure
2:34
Andy Powell -  can also organise stuff by events
2:35
Rob Styles -  well, people can be collaborators on the creation and management of a collection, they can also be an entry in a collection - as in a collection of people of interest
2:35
Andy Powell -  (events not implemented - just wire-framed currently)
2:35
Andy Powell -  also offers 'repository' functionality in the form of a 'Vault'
2:36
Andy Powell -  bit like sourceforge - but could offer view across distributed set of repositories
2:36
Andy Powell -  why a social network?
2:37
Andy Powell -  because it encourages users to clean/correct the data in ways that can't be done purely automatically
2:37
Andy Powell -  what else can be done with the graph?
2:38
Andy Powell -  want to see other players to be able to build stuff on it
2:39
Andy Powell -  encouaging people to think about "if you own metadata, what is it's place in the web of data?"
2:39
Andy Powelltalis.com/xiphos for full details
2:40
Andy Powell -  Q: what is your business model?   none currently - this is a prototype - but could think about pay-per-view - open access
2:40
Andy Powell -  PaulM: this work is about showing what is possible - new business models may emerge
2:41
Andy Powell -  Talis looking for/hoping for new data set that can be made more openly available
2:41
Andy Powell -  source code is available
2:41
Andy Powell -  or can be made available
2:42
Andy Powell -  Q: platform is a place to put data - but what stops it from becoming yet another silo - what still needs to be put in place to be able to work across different 'platform'-like services
2:43
Andy Powell -  robots will gather it all in - by following links in the data - open linked open data are working on this
2:44
Andy Powell -  Q: do tools exist right now?
2:44
Andy Powell -  ask Tom Heath :-)
2:44
Andy Powell -  Both Google and Yahoo are working in this space
2:45
Andy Powell -  Q: how are modifications to the graph treated - in terms of versioning?
2:45
Andy Powell -  stored in such a way that changes can be rolled back
2:47
Andy Powell -  next up...
2:47
Andy Powell -  Ian Corns, Talis - Project Zephyr: Letting Students Weave Their Own Path
2:48
Andy Powell -  disconnection of teaching/learning styles - students of google generation
2:49
Andy Powell -  digital natives - desire multimedia environment - always connected - real and virtual in parrallel
2:49
Andy Powell -  multi-plexing
2:50
Andy Powell -  disliked activities are simply skipped - huh??
2:50
Andy Powell -  they are content producers - particularly in a "rip, mix and burn" context
2:51
Andy Powell -  52% of first year undergrads are mature students (>21) but characteristics of google generation now more widely shared
2:51
Andy Powell -  now talking about resource lists - existing product is talis list
2:52
Andy Powell -  take a reading list from a lecturer and represent it electronically
2:52
Andy Powell -  access to resources more seemless for students - management of resources better for library
2:53
Andy Powell -  but - talis list not meeting the needs of google gen. students
2:53
Andy Powell -  project zephyr intended to overcome this
2:54
Andy Powell -  web of scholarly data - the reading list provides a significant way of linking resources - very interesting and valuable semantics
2:54
Andy Powell -  talis list is seen of a library application - major hurdles in getting lecturers to engage with it
2:55
Andy Powell -  student gets most benefit - costs lie with lecturer
2:55
Andy Powell -  need to find a way to provide lecturer with some value
2:55
Andy Powell -  zephyr intedned to do this
2:57
Andy Powell -  1st benefit to lecturers: by creating list in one place (zephyr) it can be surfaced in multiple places (facebook, library, etc.)
2:58
Andy Powell -  2nd benefit to lecturers: improve quality and depth of reading lists - e.g. see which resources have already been used elsewhere in other lists
2:59
Andy Powell -  3rd benefit: connecting the student with the lecturer - social networking - allowing student to ask questions, provide feedback, etc. - also connecting to peers
2:59
Andy Powell -  typical list has 500 items in it??   did he really say that!?
3:00
Andy Powell -  student experience... talis list is very much of the 'web of documents' vein
3:01
Andy Powell -  now showing screen shots of zephyr
3:02
Andy Powell -  lists presented thru navigable interface based on uni organisation hierarchy
3:02
Andy Powell -  having selected a list - student is shown a list of books and other stuff
3:03
Andy Powell -  can then filter by type
3:03
Andy Powell -  can also view superset of multiple list
3:04
Andy Powell -  can spot trends in terms of resources that are used in multiple modules
3:04
Andy Powell -  student can prioritise reading based on this
3:04
Andy Powell -  preserve structure of list as created by the lecturer
3:05
Andy Powell -  drill down into individual resources - including reviewing and rating by end-users
3:05
[Comment From Owen Stephens]
500 items not atypical - especially in humanities/arts (although sounds high as an average - science lists tend to be v short)
3:06
Andy Powell -  can also add resources into groups for collaborative activities, assignments, etc.
3:09
Andy Powell -  now looks like we have Nadeem Shabir (Talis) - not on programme i think
3:12
Andy Powell -  Note that Owen Stephens has been blogging day's talks at http://www.meanboyfriend.com/overdue_ideas/
3:12
Andy Powell -  presentation title is Open World Thinking
3:13
Andy Powell -  asking "what is the most widely used resource in first year undergrad comp sci courses?"
3:13
Andy Powell -  how can we answer that question?
3:13
Andy Powell -  data buried inside institutions
3:13
Andy Powell -  HE is silos within silos
3:15
Andy Powell -  institutions have been walled gardens - therefore solutions sold into institutions by third parties reflect that
3:15
Andy Powell -  the value in being open not recognised
3:15
Andy Powell -  not a technical problem
3:16
Andy Powell -  need a fundamental shift in thinking abouit openness
3:16
Andy Powell -  linked data is not just the semantic web done right - it is the web done right
3:17
Andy Powell -  designing for appropriation
3:17
Andy Powell -  us census data was made open without a specific application in mind
3:18
Andy Powell -  openness of description = coming to an agreed view on ways of describing things
3:18
Andy Powell -  then you can share stuff - integrate stuff - relate stuff
3:19
Andy Powell -  open descriptions + dereferenceable uris gives you interoperability for free
3:19
Andy Powell -  ontologies - formal representation of set of concepts within a domain
3:19
Andy Powell -  foaf - sioc - skos
3:20
Andy Powell -  e.g. the hybrid learning model (HML) (shown earlier) provides us with a simple ontology
3:21
Andy Powell -  Talis ontologies include...
3:22
Andy Powell -  Academic Institutions Internal Structures Ontology
3:22
Andy Powell -  Generic Lifecycle (workflow) Ontology
3:22
Andy Powell -  Resource List Ontology (sioc, bibo, foaf)
3:22
Andy Powell -  see www.vocab.org
3:22
Andy Powell -  Openness of access...
3:23
Andy Powell -  anywhere, anytime, anyhow (i.e. not necessary within a web browser)
3:23
Andy Powell -  this is a key to personalised learning
3:25
Andy Powell -  don't sell applications anymore - build contextualised views on web of data
3:26
Andy Powell -  it's only by being more open that the first question can be answered
3:26
Andy Powell -  openness is key to being able to rip, mix and burn
3:27
Andy Powell -  Xiphos is being built from ground up to embrace these kinds of questions
3:31
Andy Powell -  last presentation has finished - we are going into open discussion session - am going to sign off shortly

June 04, 2008

ORE Implementer Community Wiki

A very quick addendum to my post yesterday about the Beta ORE specs: the ORE Technical Committee has set up a "community wiki" for the use of implementers examining and using these specs, to build up a collection of notes of experiences, reflections, "best practice", etc, and generally for sharing any other useful supplementary materials. The structure and content is fairly skeletal at the moment, but will be expanded over the coming days.

Thanks to Rob Sanderson of the University of Liverpool for setting this up.

June 03, 2008

Beta Release of ORE Specifications and User Guides

You've probably seen this announcement on various mailing lists by now, but yesterday Carl Lagoze and Herbert Van de Sompel announced the publication of "beta" versions of the specifications being developed by the OAI ORE project:

Over the past eighteen months the Open Archives Initiative <http://www.openarchives.org/> (OAI), in a project called Object Reuse and Exchange <http://www.openarchives.org/ore/> (OAI-ORE), has gathered international experts from the publishing, web, library, and eScience community to develop standards for the identification and description of aggregations of online information resources.  These aggregations, sometimes called compound digital objects, may combine distributed resources with multiple media types including text, images, data, and video.  The goal of these standards is to expose the rich content in these aggregations to applications that support authoring, deposit, exchange, visualization, reuse, and preservation.  Although a motivating use case for the work is the changing nature of scholarship and scholarly communication, and the need for cyberinfrastructure to support that scholarship, the intent of the effort is to develop standards that generalize across all web-based information including the increasing popular social networks of "web 2.0".

I'm a member of the "editorial group" of the Technical Committee which worked on the documents. To be honest, I've struggled to find the time to make as much input as I'd have liked in the couple of weeks, so I'm grateful to the other members of that group for getting content into shape for this release.

May 02, 2008

SWAP and ORE

There's an interesting mini-thread on the jisc-repositories list, started by the announcement by Google to drop support for OAI-PMH which Paul Walk blogged recently.

I reproduce my contribution here, partly because I haven't been blogging much lately and it seems a shame to waste the text :-) and partly because email messages have a tendency to disappear into the ether.

It seems to me that Google's lack of support for the OAI-PMH is largely a non-event - because their support for it was (ultimately) a non-event.  They never supported it fully in any case AFAIK and, in some cases at least, support was broken because they didn't recognise links higher in the server tree than the OAI base URL.

It highlights the fact that the OAI-PMH will never be a mainstream Web protocol, but so what... I think we spotted that anyway!

There are technical reasons why the OAI-PMH was always going to struggle (I say that only with the benefit of hindsight) because of its poor fit with the Web Architecture.  Whilst I don't suppose that directly factored into Google's thinking in any sense, I think it is worth remembering.

On the 'social' thing I very strongly agree and I've argued several times in the past that we need to stop treating stores of content purely as stores of content and think about the social networks that need to build up around them.  It seems to me that the OAI-PMH has never been a useful step in that direction in the way that, say, RSS has been in the context of blogging.

Simple DC suffers from being both too complex (i.e. more complex than RSS) and too simple (i.e. not rich enough to meet some scholarly functional requirements).  Phil Cross suggests that we need to move towards a more complex solution, i.e. SWAPOAI-ORE takes a different but similar step in the direction of complexity - though it is probably less conceptually challenging that SWAP in many ways.  ORE's closeness to Atom might be its saving grace - on the other hand, it's differences to Atom might be its undoing.  Come back in 3 year's time and I'll tell you which! :-)

I like SWAP because I like FRBR... and whenever I've sat down and worked with FRBR I've been totally sold on how well it models the bibliographic world.  But, and it's a very big but, however good the model is, SWAP is so conceptually challenging that it is hard to see it being adopted easily.

For me, I think the bottom line question is, "do SWAP or ORE help us build social networks around content?".  If the answer is "no", and I guess in reality I think the answer might well be "no", then we are focusing our attention in the wrong place.

More positively, I note that "SWAP and ORE" has quite a nice ring to it! :-)

April 15, 2008

IMLS Digital Collections & Content

Another somewhat belated post.... Andy and I both get occasional invitations to be members of advisory/steering groups for various programmes and projects operating in the areas in which we have an interest. I'm currently a member of the Advisory Group for the second phase of the Digital Collections and Content project which is funded by the Institute of Museum and Library Services and led by a team at the University of Illinois at Urbana-Champaign. Given the UK focus of the Foundation, it's probably slightly unusual for me to take on such a role for a US project, but it combines a number of our interests - repositories, resource discovery, metadata, the use of cultural heritage resources for learning and research, and I have also worked with some members of the project team in the past in the development of the Dublin Core Collections Application Profile.

The group met recently in Chicago, and although I wasn't able to attend the meeting in person, I managed to join in by phone for a couple of hours. One area in which the project seems to be doing some interesting work is in the relationships between collection-level description and item description, and in particular the use of algorithms/rules by which item-level metadata might be inferred from collection-level metadata.

The project is also exploring how collection-level metadata might be presented more effectively during searching, particularly to provide contextual information for individual items.

April 14, 2008

Open Repositories 2008

I spent a large part of last week the week before last (Tuesday, Wednesday & Friday) at the Open Repositories 2008 conference at the University of Southampton.

There were something around 400 delegates there, I think, which I guess is an indicator of the considerable current level of interest around the R-word. Interestingly, if I recall conference chair Les Carr's introductory summary of stats correctly, nearly a quarter of these had described themselves as "developers", so the repository sphere has become a locus for debate around technical issues, as well as the strategic, policy and organisational aspects. The JISC Common Repository Interfaces Group (CRIG) had a visible presence at the conference, thanks to the efforts of David Flanders and his comrades, centred largely around the "Repository Challenge" competition (won by Dave Tarrant, Ben O’Steen and Tim Brody with their "Mining with ORE" entry).

The higher than anticipated number of people did make for some rather crowded sessions at times. There was a long queue for registration, though that was compensated for by the fact that I came away from that process with exactly two small pieces of paper: a name badge inside an envelope on which were printed the login details or the wireless network. (With hindsight, I could probably have done with a one page schedule of what was on in which location - there probably was one which I missed picking up!) Conference bags (in a rather neat "vertical" style which my fashion-spotting companions reliably informed me was a "man bag") were available, but optional. (I was almost tempted, as I do sport such an accessory at weekends, and it was black rather than dayglo orange, but decided to resist on the grounds that there was a high probability of it ending up in the hotel wastepaper bin as I packed up to leave.) Nul points, however, to those advertisers who thought it was a good idea to litter every desktop surface in the crowded lecture theatre with their glossy propaganda, with the result that a good proportion of it ended up on the floor as (newly manbagged-up) delegates squeezed their way to their seats.

The opening keynote was by Peter Murray-Rust of the Unilever Centre for Molecular Informatics, University of Cambridge. With some technical glitches to contend with, which must have been quite daunting in the circumstances - Peter has posted a quick note on his view of the experience! "I have no idea what I said" :-)) - , Peter delivered a somewhat "non-linear" but always engaging and entertaining overview of the role of repositories for scientific data. He noted the very real problem that while ever increasing quantities of data are being generated, very little of it is being successfully captured, stored and made accessible to others. Peter emphasised that any attempt to capture this data effectively must fit in with the existing working practices of scientists, and must be perceived as supporting the primary aims of the scientist, rather than introducing new tasks which might be regarded as tangential to those aims. And the practices of those scientists may, in at least some areas of scientific research, be highly "locally focused" i.e. the scientists see their "allegiances" as primarily to a small team with whom data is shared - at least in the first instance, an approach categorised as "long tail science" (a term attributed to Peter's colleague Jim Downing). Peter supported his discussion with examples drawn from several different e-Chemistry projects and initiatives, including the impressive OSCAR-3 text mining software which extracts descriptions of chemical compounds from documents)

Most of the remainder of the Tuesday and Wednesday I spent in paper sessions. The presentation I enjoyed most was probably a presentation by Jane Hunter from the University of Queensland on the work of the HarvANA project on a distributed approach to annotation and tagging of resources from the Picture Australia collection (in the first instance at least - at the end, Jane whipped through a series of examples of applying the same techniques to other resources). Jane covered a model for annotation on tagging based on the W3C Annotea model, a technical architecture for gathering and merging distributed annotations/taggings (using OAI-PMH to harvest from targets at quite short time intervals (though those intervals could be extended if preferred/required)), browser-based plug-in tools to perform annotation/tagging, and also touched on the relationships between tagging and formally-defined ontologies. The HarvANA retrieval system currently uses an ontology to enhance tag-based retrieval - "ontology-based or ontology-directed folksonomy" - , but the tags provided could also contribute to the development/refinement of that ontology, "folksonomy-directed ontology". Although it was in many ways a repository-centric approach and Jane focused on the use of existing, long-established technologies, she also succeeded in placing repositories firmly in the context of the Web: as systems which enable us to expose collections of resources (and collections of descriptions of those resources), which then enter the Web of relationships with other resources managed and exposed by other systems - here, the collections of annotations exposed by the Annotea servers, but potentially other collections too.

At Wednesday lunch time, (once I managed to find the room!) I contributed to a short "birds of a feather" session co-ordinated by Rosemary Russell of UKOLN and Julie Allinson of the University of York on behalf of the Dublin Core Scholarly Communications Community. We focused mainly on the Scholarly Works Application Profile and its adoption of a FRBR-based model, and talked around the extension of that approach to other resource types which is under consideration in a number of sibling projects currently being funded by JISC. (Rather frustratingly for me, this meeting clashed with another BoF session on Linked Data which I would really have liked to attend!)

I should also mention the tremendously entertaining presentation by Johan Bollen of the Los Alamos National Laboratory on the research into usage metrics carried out by the MESUR project. Yes, I know, "tremendously entertaining" and "usage statistics" aren't the sort of phrases I expect to see used in close proximity either. Johan's base premise was, I think, that seeking to illustrate impact through blunt "popularity" measures was inadequate, and he drew a distinction between citation - the resources which people announce in public that they have read - and usage - the actual resources they have downloaded. Based on a huge dataset of usage statistics provided by a range of popular publishers and aggregators, he explored a variety of other metrics, comparing the (surprisingly similar) rankings of journals obtained via several of these metrics with the rankings provided by the citation-based Thomson impact factor. I'm not remotely qualified to comment on the appropriateness of Johan's choice of algorithms, but the fact that Johan kept a large audience engaged at the end of a very long day was a tribute to his skill as a presenter. (Though I'd still take issue with the Britney (popular but insubstantial?)/Big Star (low-selling but highly influential/lauded by the cognoscenti) opposition: nothing by Big Star can compare with the strutting majesty of "Toxic". No, not even "September Gurls".)

On the Friday, I attended the OAI ORE Information Day, but I'll make that the subject of a separate post.

All in all - give or take a few technical hiccups - it was a successful conference, I think (and thanks to Les and his team for their hard work) - perhaps more so in terms of the "networking" that took place around the formal sessions, and the general "buzz" there seemed to be around the place, than because of any ground-breaking presentations.

And yet, and yet... at the end of the week I did come away from some of the sessions with my niggling misgivings about the "repository-centric" nature of much of the activity I heard described slightly reinforced. Yes, I know: what did I expect to hear at a conference called "Open Repositories"?! :-) But I did feel an awful lot of the emphasis was on how "repository systems" communicate with each other (or how some other app communicates with one repository system and then with another repository system ) e.g. how can I "get something out" of your repository system and "put it into" my repository system, and so on. It seems to me that - at the technical level at least - we need to focus less on seeing repository systems as "specific" and "different" from other Web applications, and focus more on commonalities. Rather than concentrating on repository interfaces we should ensure that repository systems implement the uniform interface defined by the RESTful use of the HTTP protocol. And then we can shift our focus to our data, and to

  • the models or ontologies (like FRBR and the CIDOC Conceptual Reference Model, or even basic one-object-is-made-available-in-multiple-formats models) which condition/determine the sets of resources we expose on the Web, and see the use of those models as choices we make rather than something "technologically determined" ("that's just what insert-name-of-repository-software-app-of-choice does");
  • the practical implementation of formalisms like RDF which underpin the structure of our representations describing instances of the entities defined by those models, through the adoption of conventions such as those advocated by the Linked Data community

In this world, the focus shifts to "Open (Managed) Collections" (or even "Open Linked Collections"), collections of documents, datasets, images, of whatever resources we choose to model and expose to the world. And as a consumer of those resources  I (and, perhaps more to the point, my client applications) really don't need to know whether the system that manages and exposes those collections is a "repository" or a "content management system" or something else (or if the provider changes that system from one day to the next): they apply the same principles to interactions with those resources as they do to any other set of resources on the Web.

March 05, 2008

Concentration and diffusion - the two ways of Web 2.0

Lorcan Dempsey has now blogged his ideas around two key aspects of Web 2.0, concentration and diffusion, The two ways of Web 2.0, which I referred to in my keynote at VALA 2008 but was unable to cite properly.

As I said in my talk, I think these two concepts are very helpful as we think about the impact of Web 2.0 on the kinds of online services we build and use in the education space.

February 26, 2008

Preserving the ABC of scholarly communication

Somewhat belatedly, I've been re-reading Lorcan Dempsey's post from October last year, Quotes of the day (and other days?): persistent academic discourse, in which he ponders the role of academic blogs in scholarly discourse and the apparent lack of engagement by institutions in thinking about their preservation.

I like Grainne Conole's characterisation of the place of blogging in scholarly communication:

  • Academic paper: reporting of findings against a particular narrative, grounded in the literature and related work; style – formal, academic-speak
  • Conference presentation: awareness raising of the work, posing questions and issues about the work, style – entertaining, visual, informal
  • Blogging – snippets of the work, reflecting on particular issues, style – short, informal, reflective

(even though it would have been better in alphabetical order! :-) ) and I'm tempted to wonder whether and how this characterisation will change over the next few years, as blogging continues to grow in importance as a communication medium.

Lorcan ends with:

Universities and university libraries are recognizing that they have some responsibility to the curation of the intellectual outputs of their academics and students. So far, this has not generally extended to thinking about blogs. What, if anything, should the Open University or Harvard be doing to make sure that this valuable discourse is available to future readers as part of the scholarly record?

As I argued in my most recent post about repositories, I suspect that most academics would currently expect to host their blogs outside their institution.  (Note that I'm hypothesising here, since I haven't asked any real academics this question - however, the breadth and depth of external blog services seems so overwhelming that it would be hard for institutions to try to compel their academics to use an institutional blogging service IMHO). This leaves institutions (or anyone else for that matter) that want to curate the blogging component of their intellectual output with a problem.  Somehow, they have to aggregate their part of the externally held scholarly record into an internal form, such that they can curate it.

I don't see this as an impossible task - though clearly, there is a challenge here in terms of both technology and policy.

In the context of the debate about institutional repositories, my personal opinion is that this situation waters down the argument that repositories have to be institutional because that is the only way in which the scholarly record can be preserved.  Sorry, I just don't buy it.

February 21, 2008

Linked Data (and repositories, again)

This is another one of those posts that started life in the form of various drafts which I didn't publish because I thought they weren't quite "finished", but then seemed to become slightly redundant because anything of interest had already been said by lots of other people who were rather more on the ball than I was. But as there seems to be a rapid growth of interest in this area at the moment, and as it ties in with some of the themes Andy highlights in his recent posts about his presentation at VALA 2008, I thought I'd make an effort to pull try to pull some of these fragments together.

If I'd got round to compiling my year-end Top 5 Technical Documents list for 2007 (whaddya mean, you don't have a year-end Top 5 Technical Documents list?), my number one would have been How to Publish Linked Data on the Web by Chris Bizer, Richard Cyganiak and Tom Heath.

In short, the document fleshes out the principles Tim Berners-Lee sketches in his Linked Data note - essentially the foundational principles for the Semantic Web. As Berners-Lee notes

The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data.  With linked data, when you have some of it, you can find other, related, data. (emphasis added)

And the key to realising this, argues Berners-Lee, lies in following four base rules:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs. so that they can discover more things.

Bizer, Cyganiak & Heath present linked data as a combination of key concepts from the Web Architecture on the one hand (including the TAG's resolution to the httpRange-14 issue) and the RDF data model on the other, and distill them into a form which is on the one hand clear and concise, and on the other backed up by effective, practical guidelines for their application. While many of those guidelines are available in some form elsewhere (e.g. in TAG findings or in notes such as Cool URIs...), it's extremely helpful to have these ideas collated and presented in a very practically focused style.

As an aside, in the course of assembling those guidelines, they suggest that some of those principles might benefit from some qualification, in particular the use of URI aliases, which the Web Architecture document suggests are best avoided. For the authors,

URI aliases are common on the Web of Data, as it can not realistically be expected that all information providers agree on the same URIs to identify a non-information resources. URI aliases provide an important social function to the Web of Data as they are dereferenced to different descriptions of the same non-information resource and thus allow different views and opinions to be expressed. (emphasis added)

I'm prompted to mention Linked Data now in part by Andy's emphasis on Web Architecture and Semantic Web technologies, but also by a post by Mike Bergman a couple of weeks ago, reflecting on the growth in the quantity of data now available following the principles and conventions recommended by the Bizer, Cyganiak & Heath paper. In his post, Bergman includes a copy of a graphic from Richard Cyganiak providing a "birds-eye view "of the Linked Data landscape, and highlighting the principal sources by domain or provider.

"What's wrong with that picture?", as they say. I was struck (but not really surprised) by the absence - with the exception of the University of Southampton's Department of Electronics & Computer Science - of any of the data about researchers and their outputs that is being captured and exposed on the Web by the many "repository" systems of various hues within the UK education sector. While in at least some cases institutions (or trans-institutional communities) are having a modicum of success in capturing that data, it seems to me that the ways in which it is typically made available to other applications mean that it is less visible and less usable than it might be.

Or, to borrow an expression used by Paul Miller of Talis in a post  on Nodalities, we need to think about how to make sure our repository systems are not simply "on the Web" but firmly "of the Web" - and the practices of the growing Linked Data community, it seems to me, provide a firm foundation for doing that.

February 20, 2008

Repositories follow-up - global vs. institutional

There have been a number of responses to my my VALA 2008 keynote on the future of repositories, which Brian Kelly has helpfully summarised to a large extent in a post on his blog.  There are several themes here, which probably need to be separated out for further discussion.  One such is my emphasis on building 'global' (as opposed to 'institutional') repository services.

Before I do that however, I just want to clarify one thing.  Mike Ellis suggests that he is "bemused as to *why* repositories (at all)".  I'll leave others to answer that.  Suffice to say that I was not intending to argue that the management of scholarly stuff (and the workflows around that stuff) is unimportant.  Of course it is.  Just that our emphasis should not be on the particular kinds of systems that we choose to use to undertake that management, but on the bigger objective of open access and how whatever systems we put in place surface content on the Web and support the construction of compelling scholarly social networks.  I am perfectly happy that some people will build systems that they choose to call repositories.  Others will build content management systems.  Still others something else.  The labeling is almost irrelevant (except insofar that it doesn't get in the way of communicating the broader 'open access' message).

OK, back to the issue of global vs. institutional services.  Rachel Heery says:

I don’t really see that there is conflict between encouraging more content going into institutional repositories and ambitions to provide more Web 2.0 type services on top of aggregated IR content. Surely these things go together?

Paul Walk makes a similar point in his blogged response:

The half sentence I don’t quite buy is the “global repository services”. Why can’t we “focus on building and/or using global scholarly social networks” (which I support) based on institutional repository services? We don’t have a problem with institutional web sites do we? Or institutional library OPACs? We have certainly managed to network the latter on a global scale, and built interesting services around this...

Yes, point(s) taken... though I think that the institutional Web site and the OPAC are not primarily 'social networks' (and even if they are, the network they are serving is largely institutionally focussed) so there is a difference.  As I argued in the original blog entry, scholarly social networks are global in nature (or at least extra-institutional).

Of course, the blogosphere is a good example of a global social network being layered on top of a distributed base of content.  On the face of it this seems to argue against my 'global repository' view.  So what is different?  Well, to be honest I'm not sure.  Clearly, the blogosphere is not built out of 'institutional' blog services and my strong suspicion is that if we approached academic blogging in the same way we approach academic repositories we would rapidly kill off its future as a means of scholarly communication :-) .  Long live an open, free market approach to the provision of blogs!  God help us if institutions start trying to lay down the law about when and where its members can blog.  There is a role for institutional blogging services but only as part of a wider landscape of options where individuals can pick and choose a solution that is most appropriate to them.

And that is one of my fundamental points about repositories I guess...  when institutional repositories stop being an option that individuals can choose to make use of and instead become the only option on the table because that is what mandates and policies say must be used, we have a problem.  Instead we need to focus on making scholarly content available on the Web in whatever form makes sense to individual scholars.  My strong suspicion is that if someone came along and built a global research repository, let's call it ResearchShare for the sake of argument (though I'm aware that name is taken), and styled its features after the likes of Slideshare, we would end up with something far more compelling to individual scholars than current institutional offerings.

Note that I'm not being overly dogmatic here.  In my view there are as many routes to open access as there are ways of surfacing content on the Web.  If individual scholars want to do their own thing that's fine by me, provided they do it in a way that ensures their content is at a reasonably persistent URI and is indexed by Google and the like.

This leaves institutions with the problem of picking up the pieces of the multiple ways in which individual scholars choose to surface their scholarly content on the Web.  Well sorry guys... get used to it!

Overall, I don't disagree much with Stu Weibel's take on this.  It's a complex area with lots of competing interests, some rather entrenched.  As Stu notes:

It is still possible that another entirely different model will emerge... more in-the-cloud. A distributed model does seem to complicate curation, (and that institutional reputation thing), but I wouldn't count it out just yet. Still, some institution has to take care of this stuff... responsibility involves the attachement to artifacts, even if they are bitstreams.

February 13, 2008

Repositories thru the looking glass

P1050338 I spent last week in Melbourne, Australia at the VALA 2008 Conference - my first trip over to Australia and one that I thoroughly enjoyed.  Many thanks to all those locals and non-locals that made me feel so welcome.

I was there, first and foremost, to deliver the opening keynote, using it as a useful opportunity to think and speak about repositories (useful to me at least - you'll have to ask others that were present as to whether it was useful for anyone else).

It strikes me that repositories are of interest not just to those librarians in the academic sector who have direct responsibility for the development and delivery of repository services.  Rather they represent a microcosm of the wider library landscape - a useful case study in the way the Web is evolving, particularly as manifest through Web 2.0 and social networking, and what impact those changes have on the future of libraries, their spaces and their services.

My keynote attempted to touch on many of the issues in this area - issues around the future of metadata standards and library cataloguing practice, issues around ownership, authority and responsibility, issues around the impact of user-generated content, issues around Web 2.0, the Web architecture and the Semantic Web, issues around individual vs. institutional vs. national, vs. international approaches to service provision.

In speaking first I allowed myself the luxury of being a little provocative and, as far as I can tell from subsequent discussion, that approach was well received.  Almost inevitably, I was probably a little too technical for some of the audience.  I'm a techie at heart and a firm believer that it is not possible to form a coherent strategic view in this area without having a good understanding of the underlying technology.  But perhaps I am also a little too keen to inflict my world-view on others. My apologies to anyone who felt lost or confused.

I won't repeat my whole presentation here.  My slides are available from Slideshare and a written paper will become available on the VALA Web site as soon as I get round to sending it to the conference organisers!

I can sum up my talk in three fairly simple bullet points:

  • Firstly, that our current preoccupation with the building and filling of 'repositories' (particularly 'institutional repositories') rather than the act of surfacing scholarly material on the Web means that we are focusing on the means rather than the end (open access).  Worse, we are doing so using language that is not intuitive to the very scholars whose practice we want to influence.
  • Secondly, that our focus on the 'institution' as the home of repository services is not aligned with the social networks used by scholars, meaning that we will find it very difficult to build tools that are compelling to those people we want to use them.  As a result, we resort to mandates and other forms of coercion in recognition that we have not, so far, built services that people actually want to use.  We have promoted the needs of institutions over the needs of individuals.  Instead, we need to focus on building and/or using global scholarly social networks based on global repository services.  Somewhat oddly, ArXiv (a social repository that predates the Web let alone Web 2.0) provides us with a good model, especially when combined with features from more recent Web 2.0 services such as Slideshare.
  • Finally, that the 'service oriented' approaches that we have tended to adopt in standards like the OAI-PMH, SRW/SRU and OpenURL sit uncomfortably with the 'resource oriented' approach of the Web architecture and the Semantic Web.  We need to recognise the importance of REST as an architectural style and adopt a 'resource oriented' approach at the technical level when building services.

I'm pretty sure that this last point caused some confusion and is something that Pete or I need to return to in future blog entries.  Suffice to say at this point that adopting a 'resource oriented' approach at the technical level does not mean that one is not interested in 'services' at the business or function level.

[Image: artwork outside the State Library of Victoria]

January 30, 2008

Learning Materials & FRBR

JISC is currently funding a study, conducted by Phil Barker of JISC CETIS, to survey the requirements for a metadata application profile for learning materials held by digital repositories. Yesterday Phil posted an update on work to date, including a pointer to a (draft) document titled Learning Materials Application Profile Pre-draft Domain Model which 'suggests a "straw man" domain model for use during the project which, hopefully, will prove useful in the analysis of the metadata requirements'.

The document outlines two models: the first is of the operations applied to a learning object (based on the OAIS model) and the second is a (very outline) entity-relational model for a learning resource - which is based on a subset of the Functional Requirements for the Bibliographic Record (FRBR) model. As far as I can recall, this is the first time I've seen the FRBR model applied to the learning object space - though of course at least some of the resources which are considered "learning resources" are also described as bibliographic resources, and I think at least some, if not many, of the functions to be supported by "learning object metadata" are analogous to those to be supported by bibliographic metadata.

I do have some quibbles with the model in the current draft. Without a fuller description of the functions to be supported, it's difficult to assess whether it meets those requirements - though  I recognise that, as I think the opening comment I cited above indicates, there's an element of "chicken and egg" involved in this process: you need to have at least an outline set of entity types before you can start talking about operations on instances of those types. Clearly a FRBR-based approach should facilitate interoperability between learning object repositories and systems based on FRBR or on FRBR-derivatives like the Eprints/Scholarly Works Application Profile (SWAP). I have to admit the way "Context" is modelled at present doesn't look quite right to me, and I'm not sure about the approach of collapsing the concepts of an individual agency and a class of agents into a single "Agent" entity type in the model. (For me the distinguishing characteristic of what the SWAP calls an "Agent" is that, while it encompasses both individuals and groups, an "Agent" is something which acts as a unit, and I'm not sure that applies in the same way to the intended audience for a resource.) The other aspect I was wondering about is the potential requirement to model whole-part relationships, which, AFAICT, are excluded from the current draft version. FRBR supports a range of variant whole-part relations between instances of the principal FRBR entity types, although in the case of the SWAP, I don't think any of them were used.

But I'm getting ahead of myself here really - and probably ending up sounding more negative than I intend! I think it's a positive development to see members of the "learning metadata community" exploring - critically - the usefulness of a model emerging from the library community. I need to read the draft more carefully and formulate my thoughts more coherently, but I'll be trying to send some comments to Phil.

January 24, 2008

OAI ORE specification roll-out meetings

The OAI ORE project is co-ordinating two open meetings to introduce the (forthcoming) beta versions of the set of specifications which the project has developed to describe aggregations of resources. (The  current alpha versions are http://www.openarchives.org/ore/0.1/

The first meeting is in the USA on 3 March 2008 at Johns Hopkins University, Baltimore, MD. (Press release)

The second meeting is in the UK on 4 April 2008 at the University of Southampton in conjunction with the Open Repositories 2008 conference. (Press release)

Please note that, in both cases, spaces are limited and registration is required.

About

Search

Loading
eFoundations is powered by TypePad