I attended part of UKSG earlier this week, listening to three great presentations in the New approaches to research session on Monday afternoon (by Philip Bourne, Cameron Neylon and Bill Russell) and presenting first thing Tuesday morning in the Rethinking 'content' session.
(A problem with my hearing meant that I was very deaf for most of the time, making conversation in the noisy environment rather tiring, so I decided to leave the conference early Tuesday afternoon. Unfortunately, that meant that I didn't get much of an opportunity to network with people. If I missed you, sorry. Looking at the Twitter stream, it also meant that I missed what appear to have been some great presentations on the final day. Shame.)
Anyway, for what it's worth, my slides are below. I was speaking on the theme of 'open, social and linked', something that I've done before, so for regular readers of this blog there probably won't be too much in the way of news.
With respect to the discussion of 'social' and it's impact on scholarly communication, there is room for some confusion because 'social' is often taken to mean, "how does one use social media like Facebook, Twitter, etc. to support scholarly communication?". Whilst I accept that as a perfectly sensible question, it isn't quite what I meant in this talk. What I meant was that we need to better understand the drivers for social activity around research and research artefacts, which probably needs breaking down into the various activities that make up the scholarly research workflow/cycle, in order that we can build tools that properly support that social activity. That is something that I don't think we have yet got right, particularly in our provision of repositories. Indeed, as I argued in the talk, our institutional repository architecture is more or less in complete opposition to the social drivers at play in the research space. Anyway... you've heard all this from me before.
Cameron Neylon's talk was probably the best of the ones that I saw and I hope my talk picked up on some of the themes that he was developing. I'm not sure if Cameron's UKSG slides are available yet but there's a very similar set, The gatekeeper is dead, long live the gatekeeper, presented at the STM Innovation Seminar last December. Despite the number of slides, these are very quick to read thru, and very understandable, even in the absence of any audio. On that basis, I won't re-cap them here. Slides 112 onwards give a nice summary: "we are the gatekeepers... enable, don't block... build platforms, not destinations... sell services, not content... don't think about filtering or control... enable discovery". These are strong messages for both the publishing community and libraries. All in all, his points about 'discovery defecit' rather than 'filter failure' felt very compelling to me.
On the final day there were talks about open access and changing subscription models, particularly from 'reader pays' to 'author pays', based partly on the recently released study commissioned by the Research Information Network (RIN), JISC, Research Libraries UK (RLUK), the Publishing Research Consortium (PRC) and the Wellcome Trust, Heading for the open road: costs and benefits of transitions in scholarly communications. We know that the web is disruptive to both publishers and libraries but it seemed to me (from afar) that the discussions at UKSG missed the fact that the web is potentially also disruptive to the process of scholarly communication itself. If all we do is talk about shifting the payment models within the confines of current peer-review process we are missing a trick (at least potentially).
What strikes me as odd, thinking back to that original hand-drawn diagram of the web done by Tim Berners-Lee, is that, while the web has disrupted almost every aspect of our lives to some extent, it has done relatively little to disrupt scholarly communication except in an 'at the margins' kind of way. Why is that the case? My contention is that there is such a significant academic inertia to overcome, coupled with a relatively small and closed 'market', that the momentum of change hasn't yet grown sufficiently - but it will. The web was invented as a scholarly device, yet it has, in many ways, resulted in less transformation there than in most other fields. Strange?
The use of iTunesU by UK universities has come up in discussions a couple of times recently, on Brian Kelly's UK Web Focus blog (What Are UK Universities Doing With iTunesU? and iTunes U: an Institutional Perspective) and on the closed ALT-C discussion list. In both cases, as has been the case in previous discussions, my response has been somewhat cautious, an attitude that always seems to be interpreted as outright hostility for some reason.
So, just for the record, I'm not particularly negative about iTunesU and in some respects I am quite positive - if nothing else, I recognise that the adoption of iTunesU is a very powerful motivator for the generation of openly available content and that has got to be a good thing - but a modicum of scepticism is always healthy in my view (particularly where commercial companies are involved) and I do have a couple of specific concerns about the practicalities of how it is used:
Firstly that students who do not own Apple hardware and/or who choose not to use iTunes on the desktop are not disenfranchised in any way (e.g. by having to use a less functional Web interface). In general, the response to this is that they are not and, in the absence of any specific personal experience either way, I have to concede that to be the case.
Secondly (and related to the first point), that in an environment where most of the emphasis seems to be on the channel (iTunesU) rather than on the content (the podcasts), that confusion isn't introduced as to how material is cited and referred to – i.e. do some lecturers only ever refer to 'finding stuff on iTunesU', while others offer a non-iTunesU Web URL, and others still remember to cite both? I'm interested in whether universities who have adopted iTunesU but who also make the material available in other ways have managed to adopt a single way of citing the material that is on offer?
Both these concerns relate primarily to the use of iTunesU as a distribution channel for teaching and learning content within the institution. They apply much less to its use as an external 'marketing' channel. iTunesU seems to me (based on a gut feel more than on any actual numbers) to be a pretty effective way of delivering OER outside the institution and to have a solid 'marketing win on the back of that. That said, it would be good to have some real numbers as confirmation (note that I don't just mean numbers of downloads here - I mean conversions into 'actions' (new students, new research opps, etc.)). Note that I also don't consider 'marketing' to be a dirty word (in this context) - actually, I guess this kind of marketing is going to become increasingly important to everyone in the HE sector.
There is a wider, largely religious, argument about whether "if you are not paying for it, you aren't the customer, you are part of the product" but HE has been part of the MS product for a long while now and, worse, we have paid for the privilege – so there is nothing particularly new there. It's not an argument that particularly bothers me one way or the other, provided that universities have their eyes open and understand the risks as well as the benefits. In general, I'm sure that they do.
On the other hand, while somebody always owns the channel, some channels seem to me to be more 'open' (I don't really want to use the word 'open' here because it is so emotive but I can't think of a better one) than others. So, for example, I think there are differences in an institution adopting YouTube as a channel as compared with adopting iTunesU as a channel and those differences are largely to do with the fit that YouTube has with the way the majority of the Web works.
As mentioned previously, I spoke at the FAM10 conference in Cardiff last week, standing in for another speaker who couldn't make it and using material crowdsourced from my previous post, Key trends in education - a crowdsource request, to inform some of what I was talking about. The slides and video from my talk follow:
As it turns out, describing the key trends is much easier than thinking about their impact on federated access management - I suppose I should have spotted this in advance - so the tail end of the talk gets rather weak and wishy-washy. And you may disagree with my interpretation of the key trends anyway. But in case it is useful, here's a summary of what I talked about. Thanks to those of you who contributed comments on my previous post.
By way of preface, it seems to me that the core working assumptions of the UK Federation have been with us for a long time - like, at least 10 years or so - essentially going back to the days of the centrally-funded Athens service. Yet over those 10 years the Internet has changed in almost every respect. Ignoring the question of whether those working assumptions still make sense today, I think it certainly makes sense to ask ourselves about what is coming down the line and whether our assumptions are likely to still make sense over the next 5 years or so. Furthermore, I would argue that federated access management as we see it today in education, i.e. as manifested thru our use of SAML, shows a rather uncomfortable fit with the wider (social) web that we see growing up around us.
And so... to the trends...
The most obvious trend is the current financial climate, which won't be with us for ever of course, but which is likely to cause various changes while it lasts and where the consequences of those changes, university funding for example, may well be with us much longer than the current crisis. In terms of access management, one impact of the current belt-tightening is that making a proper 'business case' for various kinds of activities, both within institutions and nationally, will likely become much more important. In my talk, I noted that submissions to the UCISA Award for Excellence (which we sponsor) often carry no information about staff costs, despite an explicit request in the instructions to entrants to indicate both costs and benefits. My point is not that institutions are necessarily making the wrong decisions currently but that the basis for those decisions, in terms of cost/benefit analysis, will probably have to become somewhat more rigorous than has been the case to date. Ditto for the provision of national solutions like the UK Federation.
More generally, one might argue that growing financial pressure will encourage HE institutions into behaving more and more like 'enterprises'. My personal view is that this will be pretty strongly resisted, by academics at least, but it may have some impact on how institutions think about themselves.
Secondly, there is the related trend towards outsourcing and shared services, with the outsourcing of email and other apps to Google being the most obvious example. Currently that is happening most commonly with student email but I see no reason why it won't spread to staff email as well in due course. At the point that an institution has outsourced all its email to Google, can one assume that it has also outsourced at least part of its 'identity' infrastructure as well? So, for example, at the moment we typically see SAML call-backs being used to integrate Google mail back into institutional 'identity' and 'access management' systems (you sign into Google using your institutional account) but one could imagine this flipping around such that access to internal systems is controlled via Google - a 'log in with Google' button on the VLE for example. Eric Sachs, of Google, has recently written about OpenID in the Enterprise SaaS market, endorsing this view of Google as an outsourced identity provider.
Thirdly, there is the whole issue of student expectations. I didn't want to talk to this in detail but it seems obvious that an increasingly 'open' mashed and mashable experience is now the norm for all of us - and that will apply as much to the educational content we use and make available as it does to everything else. Further, the mashable experience is at least as much about being able to carry our identities relatively seamlessly across services as it is about the content. Again, it seems unclear to me that SAML fits well into this kind of world.
There are two other areas where our expectations and reality show something of a mis-match. Firstly, our tightly controlled, somewhat rigid approach to access management and security are at odds with the rather fuzzy (or at least fuzzilly interpretted) licences negotiated by Eduserv and JISC Collections for the external content to which we have access. And secondly, our over-arching sense of the need for user privacy (the need to prevent publishers from cross-referencing accesses to different resources by the same user for example) are holding back the development of personalised services and run somewhat counter to the kinds of things we see happening in mainstream services.
Fourthly, there's the whole growth of mobile - the use of smart-phones, mobile handsets, iPhones, iPads and the rest of it - and the extent to which our access management infrastructure works (or not) in that kind of 'app'-based environment.
Then there is the 'open' agenda, which carries various aspects to it - open source, open access, open science, and open educational resources. It seems to me that the open access movement cuts right to the heart of the primary use-case for federated access management, i.e. controlling access to published scholarly literature. But, less directly, the open science movement, in part, pushes researchers towards the use of more open 'social' web services for their scholarly communication where SAML is not typically the primary mechanism used to control access.
Similarly, the emerging personal learning environment (PLE) meme (a favorite of educational conferences currently), where lecturers and students work around their institutional VLE by choosing to use a mix of external social web services (Flickr, Blogger, Twitter, etc.) again encourages the use of external services that are not impacted by our choices around the identity and access management infrastructure and over which we have little or no control. I was somewhat sceptical about the reality of the PLE idea until recently. My son started at the City of Bath College - his letter of introduction suggested that he created himself a Google Docs account so that he could do his work there and submit it using email or Facebook. I doubt this is college policy but it was a genuine example of the PLE in practice so perhaps my scepticism is misplaced.
We also have the changing nature of the relationship between students and institutions - an increasingly mobile and transitory student body, growing disaggregation between the delivery of learning and accreditation, a push towards overseas students (largely for financial reasons), and increasing collaboration between institutions (both for teaching and research) - all of which have an impact on how students see their relationship with the institution (or institutions) with whom they have to deal. Will the notion of a mandated 3 or 4 year institutional email account still make sense for all (or even most) students in 5 or 10 years time?
In a similar way, there's the changing customer base for publishers of academic content to deal with. At the Eduserv Symposium last year, for example, David Smith of CABI described how they now find that having exposed much of their content for discovery via Google they have to deal with accesses from individuals who are not affiliated with any institution but who are willing to pay for access to specific papers. Their access management infrastructure has to cope with a growing range of access methods that sit outside the 'educational' space. What impact does this have on their incentives for conforming to education-only norms?
And finally there's the issue of usability, and particularly the 'where are you from' discovery problem. Our traditional approach to this kind of problem is to build a portal and try and control how the user gets to stuff, such that we can generate 'special' URLs that get them to their chosen content in such a way that they can be directed back to us seemlessly in order to login. I hate portals, at least insofar as they have become an architectural solution, so the less said the better. As I said in my talk, WAYFless URLs are an abomination in architectural terms, saved only by the fact that they work currently. In my presentation I played up the alternative usability work that the Kantara ULX group have been doing in this area, which it seems to me is significantly better than what has gone before. But I learned at the conference that Shibboleth and the UK WAYF service have both also been doing work in this area - so that is good. My worry though is that this will remain an unsolvable problem, given the architecture we are presented with. (I hope I'm wrong but that is my worry). As a counterpoint, in the more... err... mainstream world we are seeing a move towards what I call the 'First Bus' solution (on the basis that in many UK cities you only see buses run by the First Group (despite the fact that bus companies are supposed to operate in a free market)) where you only see buttons to log in using Google, Facebook and one or two others.
I'm not suggesting that this is the right solution - just noting that it is one strategy for dealing with an otherwise difficult usability problem.
Note that we are also seeing some consolidation around technology as well - notably OpenID and OAuth - though often in ways that hides it from public view (e.g. hidden behind a 'login with google' or 'login with facebook' button).
Which essentially brings me to my concluding screen - you know, the one where I talk about all the implications of the trends above - which is where I have less to say than I should! Here's the text more-or-less copy-and-pasted from my final slide:
‘education’ is a relatively small fish in a big pond (and therefore can't expect to drive the agenda)
mainstream approaches will win (in the end) - ignoring the difficult question of defining what is mainstream
the current financial climate will have an effect somewhere
HE institutions are probably becoming more enterprise-like but they are still not totally like commercial organisations and they tend to occupy an uncomfortable space between the ‘enterprise’ and the ‘social web’ driven by different business needs (c.f. the finance system vs PLEs and open science)
the relationships between students (and staff) and institutions are changing
In his opening talk at FAM10 the day before, David Harrison had urged the audience to become leaders in the area of federated access management. In a sense I want the same. But I also want us, as a community, to become followers - to accept that things happen outside our control and to stop fighting against them the whole time.
Unfortunately, that's a harder rallying call to make!
Your comments on any/all of the above are very much welcomed.
Last week I attended an invite-only meeting at the JISC offices in London, notionally entitled a "JISC IE Technical Review" but in reality a kind of technical advisory group for the JISC and RLUK Resource Discovery Taskforce Vision [PDF], about which the background blurb says:
The JISC and RLUK Resource Discovery Taskforce was formed to focus on defining the requirements for the provision of a shared UK resource discovery infrastructure to support research and learning, to which libraries, archives, museums and other resource providers can contribute open metadata for access and reuse.
The morning session felt slightly weird (to me), a strange time-warp back to the kinds of discussions we had a lot of as the UK moved from the eLib Programme, thru the DNER (briefly) into what became known (in the UK) as the JISC Information Environment - discussions about collections and aggregations and metadata harvesting and ... well, you get the idea.
In the afternoon we were split into breakout groups and I ended up in the one tasked with answering the question "how do we make better websites in the areas covered by the Resource Discovery Taskforce?", a slightly strange question now I look at it but one that was intended to stimulate some pragmatic discussion about what content providers might actually do.
Our group started from the principles of Linked Data - assign 'http' URIs to everything of interest, serve useful content (both human-readable and machine-processable (structured according to the RDF model)) at those URIs, and create lots of links between stuff (internal to particular collections, across collections and to other stuff). OK... we got slightly more detailed than that but it was a fairly straight-forward view that Linked Data would help and was the right direction to go in. (Actually, there was a strongly expressed view that simply creating 'http' URIs for everything and exposing human-readable content at those URIs would be a huge step forward).
Then we had a discussion about what the barriers to adoption might be - the problems of getting buy-in from vendors and senior management, the need to cope with a non-obvious business model (particularly in the current economic climate), the lack of technical expertise (not to mention semantic expertise) in parts of those sectors, the endless discussions that might take place about how to model the data in RDF, the general perception that Semantic Web is permanently just over the horizon and so on.
And, in response, we considered the kinds of steps that JISC (and its partners) might have to undertake to build any kind of political momentum around this idea.
To cut a long story short, we more-or-less convinced ourselves out of a purist Linked Data approach as a way forward, instead preferring a 4 layer model of adoption, with increasing levels of semantic richness and machine-processability at each stage:
expose data openly in any format available (.csv files, HTML pages, MARC records, etc.)
assign 'http' URIs to things of interest in the data, expose it in any format available (.csv files, HTML pages, etc.) and serve useful content at each URI
assign 'http' URIs to things of interest in the data, expose it as XML and serve useful content at each URI
assign 'http' URIs to things of interest in the data and expose Linked Data (as per the discussion above).
These would not be presented as steps to go thru (do 1, then 2, then 3, ...) but as alternatives with increasing levels of semantic value. Good practice guidance would encourage the adoption of option 4, laying out the benefits of such an approach, but the alternatives would provide lower barriers to adoption and offer a simpler 'sell' politically.
The heterogeneity of data being exposed would leave a significant implementation challenge for the aggregation services attempting to make use of it and the JISC (and partners) would have to fund some pretty convincing demonstrators of what might usefully be achieved.
One might characterise these approaches as 'data.glam.uk' (echoing 'data.gov.uk' but where 'glam' is short for 'galleries, libraries, archives and museums') and/or Digital UK (echoing the pragmatic approaches being successfully adopted by the Digital NZ activity in New Zealand).
Despite my reservations about the morning session, the day ended up being quite a useful discussion. That said, I remain somewhat uncomfortable with its outcomes. I'm a purest at heart and the 4 levels above are anything but pure. To make matters worse, I'm not even sure that they are pragmatic. The danger is that people will adopt only the lowest, least semantic, option and think they've done what they need to do - something that I think we are seeing some evidence of happening within data.gov.uk.
Perhaps even more worryingly, having now stepped back from the immediate talking-points of the meeting itself, I'm not actually sure we are addressing a real user need here any more - the world is so different now than it was when we first started having conversations about exposing cultural heritage collections on the Web, particularly library collections - conversations that essentially pre-dated Google, Google Scholar, Amazon, WorldCat, CrossRef, ... the list goes on. Do people still get agitated by, for example, the 'book discovery' problem in the way they did way back then? I'm not sure... but I don't think I do. At the very least, the book 'discovery' problem has largely become an 'appropriate copy' problem - at least for most people? Well, actually, let's face it... for most people the book 'discovery' and 'appropriate copy' problems have been solved by Amazon!
I also find the co-location of libraries, museums and archives, in the context of this particular discussion, rather uncomfortable. If anything, this grouping serves only to prolong the discussion and put off any decision making?
Overall then, I left the meeting feeling somewhat bemused about where this current activity has come from and where it is likely to go.
In a step towards openness, the UK has opened up its data to be interoperable with the Attribution Only license (CC BY). The National Archives, a department responsible for “setting standards and supporting innovation in information and records management across the UK,” has realigned the terms and conditions of data.gov.uk to accommodate this shift. Data.gov.uk is “an online point of access for government-held non-personal data.” All content on the site is now available for reuse under CC BY. This step expresses the UK’s commitment to opening its data, as they work towards a Creative Commons model that is more open than their former Click-Use Licenses.
This feels like a very significant move - and one that I hadn't fully appreciated in the recent buzz around data.gov.uk.
Jane Park ends her piece by suggesting that "the UK as well as other governments move in the future towards even fuller openness and the preferred standard for open data via CC Zero". Indeed, I'm left wondering about the current move towards CC-BY in relation to the work undertaken a while back by Talis to develop the Open Data Commons Public Domain Dedication and Licence.
In general factual data does not convey any copyrights, but it may be subject to other rights such as trade mark or, in many jurisdictions, database right. Because factual data is not usually subject to copyright, the standard Creative Commons licenses are not applicable: you can’t grant the exclusive right to copy the facts if that right isn’t yours to give. It also means you cannot add conditions such as share-alike.
Waivers, on the other hand, are a voluntary relinquishment of a right. If you waive your exclusive copyright over a work then you are explictly allowing other people to copy it and you will have no claim over their use of it in that way. It gives users of your work huge freedom and confidence that they will not be persued for license fees in the future.
Ian Davis' post gives detailed technical information about how such waivers can be used.
We also used the 'machine-readable' phrase in the original DNER Technical Architecture, the work that went on to underpin the JISC Information Environment, though I think we went on to use both 'machine-understandable' and 'machine-processable' in later work (both even more of a mouthful), usually with reference to what we loosely called 'metadata'. We also used 'm2m - machine to machine' a lot, a phrase introduced by Lorcan Dempsey I think. Remember that this was back in 2001, well before the time when the idea of offering an open API had become as widespread as it is today.
All these terms suffer, it seems to me, from emphasising the 'readability' and 'processability' of data over its 'linkedness'. Linkedness is what makes the Web what it is. With hindsight, the major thing that our work on the JISC Information Environment got wrong was to play down the importance of the Web, in favour of a set of digital library standards that focused on sharing 'machine-readable' content for re-use by other bits of software.
Looking at things from the perspective of today, the terms 'Linked Data' and 'Web of Data' both play up the value in content being inter-linked as well as it being what we might call machine-readable.
For example, if we think about open access scholarly communication, the JISC Information Environment (in line with digital libraries more generally) promotes the sharing of content largely through the harvesting of simple DC metadata records, each of which typically contains a link to a PDF copy of the research paper, which, in turn, carries only human-readable citations to other papers. The DC part of this is certainly MRD... but, overall, the result isn't very inter-linked or Web-like. How much better would it have been to focus some effort on getting more Web links between papers embedded into the papers themselves - using what we would now loosely call a 'micro format'? One of the reasons I like some of the initiatives around the DOI (though I don't like the DOI much as a technology), CrossRef springs to mind, is that they potentially enable a world where we have the chance of real, solid, persistent Web links between scholarly papers.
RDF, of course, offers the possibility of machine-readability, machine-processable semantics, and links to other content - which is why it is so important and powerful and why initiatives like data.gov.uk need to go beyond the CSV and XML files of this world (which some people argue are good enough) and get stuff converted into RDF form.
As an aside, DCMI have done some interesting work on Interoperability Levels for Dublin Core Metadata. While this work is somewhat specific to DC metadata I think it has some ideas that could be usefully translated into the more general language of the Semantic Web and Linked Data (and probably to the notions of the Web of Data and MRD).
Mike, I think, would probably argue that this is all the musing of a 'purist' and that purists should be ignored - and he might well be right. I certainly agree with the main thrust of the presentation that we need to 'set our data free', that any form of MRD is better than no MRD at all, and that any API is better than no API. But we also need to remember that it is fundamentally the hyperlink that has made the Web what it is and that those forms of MRD that will be of most value to us will be those, like RDF, that strongly promote the linkability of content, not just to other content but to concepts and people and places and everything else.
The labels 'Linked Data' and 'Web of Data' are both helpful in reminding us of that.
As I'm sure everyone knows by now, the UK Government's data.gov.uk site was formally launched yesterday to a significant fanfare on Twitter and elsewhere. There's not much I can add other than to note that I think this initiative is a very good thing and I hope that we can contribute more in the future than we have done to date.
In truth, I’ve been waiting for Joe Bloggs on the street to mention in passing – “Hey, just yesterday I did ‘x’ online” and have it be one of those new ‘Services’ that has been developed from the release of our data. (Note: A Joe Bloggs who is not related to Government or those who encircle Government. A real true independent Citizen.)
It may be a long wait.
The reality is that releasing the data is a small step in a long walk that will take many years to see any significant value. Sure there will be quick wins along the way – picking on MP’s expenses is easy. But to build something sustainable, some series of things that serve millions of people directly, will not happen overnight. And the reality, as Tom Loosemore pointed out at the London Data Store launch, it won’t be a sole developer who ultimately brings it to fruition.
Sir Tim said ordinary citizens will be able to use the data in conjunction with Ordnance Survey maps to show the exact location of road works that are completely unnecessary and are only being carried out so that some lazy, stupid bastard with a pension the size of Canada can use up his budget before the end of March.
The information could also be used to identify Britain's oldest pothole, how much business it has generated for its local garage and why in the name of holy buggering fuck it has never, ever been fixed.
And, while we are on the subject of maps and so on, today's posting to the Ernest Marples Blog, Postcode Petition Response — Our Reply, makes for an interesting read about the government's somewhat un-joined-up response to a petition to "encourage the Royal Mail to offer a free postcode database to non-profit and community websites":
The problem is that the licence was formed to suit industry. To suit people who resell PAF data, and who use it to save money and do business. And that’s fine — I have no problem with industry, commercialism or using public data to make a profit.
But this approach belongs to a different age. One where the only people who needed postcode data were insurance and fulfilment companies. Where postcode data was abstruse and obscure. We’re not in that age any more.
We’re now in an age where a motivated person with a laptop can use postcode data to improve people’s lives. Postcomm and the Royal Mail need to confront this and change the way that they do things. They may have shut us down, but if they try to sue everyone who’s scraping postcode data from Google, they’ll look very foolish indeed.
Finally — and perhaps most importantly — we need a consistent and effective push from the top. Number 10’s right hand needs to wake up and pay attention to the fantastic things the left hand’s doing.
Public data will be published in reusable, machine-readable form
Public data will be available and easy to find through a single easy to use online access point (http://www.data.gov.uk/)
Public data will be published using open standards and following the recommendations of the World Wide Web Consortium
Any 'raw' dataset will be represented in linked data form
More public data will be released under an open licence which enables free reuse, including commercial reuse
Data underlying the Government's own websites will be published in reusable form for others to use
Personal, classified, commercially sensitive and third-party data will continue to be protected.
(Bullet point numbers added by me.)
I'm assuming that "linked data" in point 4 actually means "Linked Data", given reference to W3C recommendations in point 3.
There's also a slight tension between points 4 and 5, if only because the use of the phrase, "more public data will be released under an open licence", in point 5 implies that some of the linked data made available as a result of point 4 will be released under a closed licence. One can argue about whether that breaks the 'rules' of Linked Data but it seems to me that it certainly runs counter to the spirit of both Linked Data and what the government says it is trying to do here?
That's a pretty minor point though and, overall, this is a welcome set of principles.
Linked Data, of course, implies URIs and good practice suggests Cool URIs as the basic underlying principle of everything that will be built here. This applies to all government content on the Web, not just to the data being exposed thru this particular initiative. One of the most common forms of uncool URI to be found on the Web in government circles is the technology-specific .aspx suffix... hey, I work for an organisation that has historically provided the technology to mint a great deal of these (though I think we do a better job now). It's worth noting, for example, that the two URIs that I use above to cite the Putting the Frontline First document both end in .aspx - ironic huh?
I'm not suggesting that cool URIs are easy, but there are some easy wins and getting the message across about not embedding technology into URIs is one of the easier ones... or so it seems to me anyway.
My talk didn't go too well to be honest, partly because I was on last and we were over-running so I felt a little rushed but more because I'd cut the previous set of slides down from 119 to 6 (4 really!) - don't bother looking at the slides, they are just images - which meant that I struggled to deliver a very coherent message. I looked at the most significant environmental changes that have occurred since we first started thinking about the JISC IE almost 10 years ago. The resulting points were largely the same as those I have made previously (listen to the Trento presentation) but with a slightly preservation-related angle:
the rise of social networks and the read/write Web, and a growth in resident-like behaviour, means that 'digital identity' and the identification of people have become more obviously important and will remain an important component of provenance information for preservation purposes into the future;
Linked Data (and the URI-based resource-oriented approach that goes with it) is conspicuous by its absence in much of our current digital library thinking;
scholarly communication is increasingly diffusing across formal and informal services both inside and outside our institutional boundaries (think blogging, Twitter or Google Wave for example) and this has significant implications for preservation strategies.
That's what I thought I was arguing anyway!
I also touched on issues around the growth of the 'open access' agenda, though looking at it now I'm not sure why because that feels like a somewhat orthogonal issue.
Anyway... the middle bullet has to do with being mainstream vs. being niche. (The previous speaker, who gave an interesting talk about MyExperiment and its use of Linked Data, made a similar point). I'm not sure one can really describe Linked Data as being mainstream yet, but one of the things I like about the Web Architecture and REST in particular is that they describe architectural approaches that haven proven to be hugely successful, i.e. they describe the Web. Linked data, it seems to me, builds on these in very helpful ways. I said that digital library developments often prove to be too niche - that they don't have mainstream impact. Another way of putting that is that digital library activities don't spend enough time looking at what is going on in the wider environment. In other contexts, I've argued that "the only good long-term identifier, is a good short-term identifier" and I wonder if that principle can and should be applied more widely. If you are doing things on a Web-scale, then the whole Web has an interest in solving any problems - be that around preservation or anything else. If you invent a technical solution that only touches on scholarly communication (for example) who is going to care about it in 50 or 100 years - answer, not all that many people.
It worries me, for example, when I see an architectural diagram (as was shown yesterday) which has channels labelled 'OAI-PMH', XML' and 'the Web'!
After my talk, Chris Rusbridge asked me if we should just get rid of the JISC IE architecture diagram. I responded that I am happy to do so (though I quipped that I'd like there to be an archival copy somewhere). But on the train home I couldn't help but wonder if that misses the point. The diagram is neither here nor there, it's the "service-oriented, we can build it all", mentality that it encapsulates that is the real problem.
great technology panorama painted by @billt in closing talk at #cetis09
And it was a great panorama - broad, interesting and entertainingly delivered. It was a good performance and I am hugely in awe of people who can give this kind of presentation. However, what the talk didn't do was move from the "this is where technology has come from, this is where it is now and this is where it is going" kind of stuff to the "and this is what it means for education in the future". Which was a shame because in questioning after his talk Thompson did make some suggestions about the future of print news media (not surprising for someone now at the BBC) and I wanted to hear similar views about the future of teaching, learning and research.
As Oleg Liber pointed out in his question after the talk, universities, and the whole education system around them, are lumbering beasts that will be very slow to change in the face of anything. On that basis, whilst it is interesting to note that (for example) we can now just about store a bit on an atom (meaning that we can potentially store a digital version of all human output on something the weight of a human body), that we can pretty much wire things directly into the human retina, and that Africa will one-day overtake 'digital' Britain in the broadband stakes are interesting individual propositions in their own right, there comes a "so what?" moment where one is left wondering what it actually all means.
As an aside, and on a more personal note, I suggest that my daughter's experience of university (she started at Sheffield Hallam in September) is not actually going to be very different to my own, 30-odd years ago. Lectures don't seem to have changed much. Project work doesn't seem to have changed much. Going out drinking doesn't seem to have changed much. She did meet all her 'hall' flat-mates via Facebook before she arrived in Sheffield I suppose :-) - something I never had the opportunity to do (actually, I never even got a place in hall). There is a big difference in how it is all paid for of course but the interesting question is how different university will be for her children. If the truth is, "not much", then I'm not sure why we are all bothering.
At one point, just after the bit about storing a digital version of all human output I think, Thompson did throw out the question, "...and what does that mean for copyright law?". He didn't give us an answer. Well, I don't know either to be honest... though it doesn't change the fact that creative people need to be rewarded in some way for their endeavours I guess. But the real point here is that the panorama of technological change that Thompson painted for us, interesting as it was, begs some serious thinking about what the future holds. Maybe Thompson was right to lay out the panorama and leave the serious thinking to us?
He was surprisingly positive about Linked Data, suggesting that the time is now right for this to have a significant impact. I won't disagree because I've been making the same point myself in various fora, though I tend not to shout it too loudly because I know that the Semantic Web has a history of not quite making it. Indeed, the two parallel sessions that I attended during the conference, University API and the Giant Global Graph both focused quite heavily on the kinds of resources that universities are sitting on (courses, people/expertise, research data, publications, physical facilities, events and so on) that might usefully be exposed to others in some kind of 'open' fashion. And much of the debate, particularly in the second session (about which there are now some notes), was around whether Linked Data (i.e. RDF) is the best way to do this - a debate that we've also seen played out recently on the uk-government-data-developers Google Group.
The three primary issues seemed to be:
Why should we (universities) invest time and money exposing our data in the hope that people will do something useful/interesting/of value with it when we have many other competing demands on our limited resources?
Why should we take the trouble to expose RDF when it's arguably easier for both the owner and the consumer of the data to expose something simpler like a CSV file?
Why can't the same ends be achieved by offering one or more services (i.e. a set of one or more APIs) rather than the raw data itself?
In the ensuing debate about the why and the how, there was a strong undercurrent of, "two years ago SOA was all the rage, now Linked Data is all the rage... this is just a fashion thing and in two years time there'll be something else". I'm not sure that we (or at least I) have a well honed argument against this view but, for me at least, it lies somewhere in the fit with resource-orientation, with the way the Web works, with REST, and with the Web Architecture.
On the issue of the length of time it is taking for the Semantic Web to have any kind of mainstream impact, Ian Davis has an interesting post, Make or Break for the Semantic Web?, arguing that this is not unusual for standards track work:
Technology, especially standards track work, takes years to cross the chasm from early adopters (the technology enthusiasts and visionaries) to the early majority (the pragmatists). And when I say years, I mean years. Take CSS for example. I’d characterise CSS as having crossed the chasm and it’s being used by the early majority and making inroads into the late majority. I don’t think anyone would seriously argue that CSS is not here to stay.
According to this semi-official history of CSS the first proposal was in 1994, about 13 years ago. The first version that was recognisably the CSS we use today was CSS1, issued by the W3C in December 1996. This was followed by CSS2 in 1998, the year that also saw the founding of the Web Standards Project. CSS 2.1 is still under development, along with portions of CSS3.
Paul Walk has also written an interesting post, Linked, Open, Semantic?, in which he argues that our discussions around the Semantic Web and Linked Data tend to mix up three memes (open data, linked data and semantics) in rather unhelpful ways. I tend to agree, though I worry that Paul's proposed distinction between Linked Data and the Semantic Web is actually rather fuzzier than we may like.
On balance, I feel a little uncomfortable that I am not able to offer a better argument against the kinds of anti-Linked Data views expressed above. I think I understand the issues (or at least some of them) pretty well but I don't have them to hand in a kind of this is why Linked Data is the right way forward 'elevator pitch'.
Something to work on I guess!
[Image: a slide from Bill Thompson's closing keynote to the CETIS 2009 Conference]
My talk was entitled "Open, social and linked - what do current Web trends tell us about the future of digital libraries?" and I've been holding off blogging about it or sharing my slides because I was hoping to create a slidecast of them. Well... I finally got round to it and here is the result:
Like any 'live' talk, there are bits where I don't get my point across quite as I would have liked but I've left things exactly as they came out when I recorded it. I particularly like my use of "these are all very bog standard... err... standards"! :-)
Towards the end, I refer to David White's 'visitors vs. residents' stuff, about which I note he has just published a video. Nice one.
Anyway... the talk captures a number of threads that I've been thinking and speaking about for the last while. I hope it is of interest.
One assumes that this kind of thing will become much more common at universities over the next few years.
Having had a very quick look, it feels like the material is more descriptive than prescriptive - which isn't meant as a negative comment, it just reflects the current state of play. The section on Data documentation & metadata for example, gives advice as simple as:
Have you created a "readme.txt" file to describe the contents of files in a folder? Such a simple act can be invaluable at a later date.
but also provides a link to the UK Data Archive's guidance on Data Documentation and Metadata, which at first sight appears hugely complex. I'm not sure what your average research will make of it?
(In passing, I note that the UKDA seem to be promoting the use of the Data Documentation Initiative standard at what they call the 'catalogue' level, a standard that I've not come across before but one that appears to be rooted firmly outside the world of linked data, which is a shame.)
Similarly, the section on Methods for data sharing lists a wide range of possible options (from "posting on a University website" thru to "depositing in a data repository") without being particularly prescriptive about which is better and why.
(As a second aside, I am continually amazed by this firm distinction in the repository world between 'posting on the website' and 'depositing in a repository' - from the perspective of the researcher, both can, and should, achieve the same aims, i.e. improved management, more chance of persistence and better exposure.)
As we have found with repositories of research publications, it seems to me that research data repositories (the Edinburgh DataShare in this case) need to hide much of this kind of complexity, and do most of the necessary legwork, in order to turn what appears to be a simple and obvious 'content management' workflow (from the point of view of the individual researcher) into a well managed, openly shared, long term resource for the community.
I have a minor quibble with the way the data has been presented in the report, in that it's not overly clear how the 179 respondents represented in Figure 1 have been split across the three broad areas (Sciences, Social Sciences, and Arts and Humanities) that appear in subsequent figures. One is left wondering how significant the number of responses in each of the 3 areas was? I would have preferred to see Figure 1 organised in such a way that the 'departments and faculties' were grouped more obviously into the broad areas.
That aside, I think the report is well worth reading. I'll just highlight what the authors perceive to be the emerging themes:
It is clear that different disciplines have different requirements and approaches to research data.
Current provision of facilities to encourage and ensure that researchers have data stores where they can deposit their valuable data for safe-keeping and for sharing, as appropriate, varies from discipline to discipline.
Local data management and preservation activity is very important with most data being held locally.
Expectations about the rate of increase in research data generated indicate not only higher data volumes but also an increase in different types of data and data generated by disciplines that have not until recently been producing volumes of digital output.
Significant gaps and areas of need remain to be addressed.
Advice on practical issues related to managing data across their life cycle. This help would range from assistance in producing a data management/sharing plan; advice on best formats for data creation and options for storing and sharing data securely; to guidance on publishing and preserving these research data.
A secure and user-friendly solution that allows storage of large volume of data and sharing of these in a controlled fashion way allowing fine grain access control mechanisms.
A sustainable infrastructure that allows publication and long-term preservation of research data for those disciplines not currently served by domain specific services such as the UK Data Archive, NERC Data Centres, European Bioinformatics Institute and others.
Funding that could help address some of the departmental challenges to manage the research data that are being produced.
Pretty high level stuff so nothing particularly surprising there. It seems to me that some work drilling down into each of these areas might be quite useful.
digitalPreservation, eResearch, openAccess, repositories, research
The blog entry is a few days old so apologies if you've already seen it, but one advantage of coming late to the discussion is that there is an interesting thread of comments on the post, in particular around the value/complexity of RDF vs. other data encoding formats.
Worth a look, and worth taking the time to comment if you can. I think this is a useful and interesting development and something to be encouraged.
While I probably do spend longer than is healthy in front of a PC on a typical weekend, I have to admit to a fairly high level of resistance to attending "work-related" events at weekends, especially if travel is involved. My Saturdays are for friends, footy, films, & music, possibly accompanied by beer, ideally in some combination.
The morning session featured three presentations from people working in the development/aid sector. Mark Charmer talked about AKVO, and its mission to the facilitate connections between funders and projects in the area of water and sanitation, and to streamline reporting by projects (through support for submissions of updates by SMS). Vinay Gupta described the use of wiki technology to build Appropedia, a collection of articles on "appropriate technology" and related aid/development issues, including project histories and detailed "how-to"-style information. The third session was a collaboration between Karin Christiansen, on the Publish What You Fund campaign to promote greater access to information about aid, and Simon Parrish on the work of Aidinfo to develop standards for the sharing of such information.
One recurring theme in these presentations was that of valuable information - from records of practical project experience "on the ground" to records of funding by global agencies - being "locked away" from, or at least only partially accessible to, the parties who would most benefit from it. The other fascinating (to me, at least) element was the emphasis on the growing ubiquity of mobile technology: while I'm accustomed to this in the UK, I was still quite taken aback by the claim (I think, by Mark) that in the near future there will be large sections of the world's population who have access to a mobile phone, but not to a toilet.
The main part of the day was dedicated to the "Open Spaces" session of short presentations. Initially, IIRC, these had been programmed as two parallel sessions in which the speakers were allocated 10 minutes each. On the day, the decision was taken to merge them into a single session with (nearly 20, I think?) speakers delivering very short "lightning" talks. We were offered the opportunity to vote on this, I hasten to add, and at the time avoiding missing out on contributions had seemed like a Good Idea, if time permitted. But with hindsight, I'm not sure it was the right choice: it led to a situation in which speakers had to deliver their content in less time than they had anticipated (and some adjusted better than others), there was little time for discussion, and the pace and diversity of the contributions, some slightly technical, but mostly focusing more on social/cultural aspects, did make it rather difficult for me to identify common threads.
The next slot was dedicated to the relationship between Open Data and Linked Data and the Semantic Web, with short, largely non-technical, presentations by Tom Scott of the BBC, Jeni Tennison, and Leigh Dodds of Talis. Maybe it was just because I was familiar with the topic, but it felt to me that this part of the day worked well, and the cohesive theme enabled speakers to build on each other's contributions.
Although I quite enjoyed the linked data talks, it's probably true to say that - Leigh's announcement aside - they didn't really introduce me to anything I didn't know already - but there again, I probably wasn't the primary target audience.
The day ended with a presentation by David Bollier, author of Viral Spiral, on the "sharing economy". Unfortunately, things were over-running slightly at that point, and I only caught the first few minutes before I had to leave for my train home - which was a pity as I think that session probably did consolidate some of the issues related to business models which had been touched on in some of the short talks.
Overall, I suppose I came away feeling the event might have benefited from a slightly tighter focus, maybe building around the content of the two themed sessions. Having said that, I recognise that the call for contributions had been explicitly very "open", and the event did attract a very mixed audience, many probably with quite different expectations from my own! :-)
I spent the first couple of days this week at the British Library in London, attending the Unlocking Audio 2 conference. I was there primarily to give an invited talk on the second day.
You might notice that I didn't have a great deal to say about audio, other than to note that what strikes me as interesting about the newer ways in which I listen to music online (specifically Blip.fm and Spotify) is that they are both highly social (almost playful) in their approach and that they are very much of the Web (as opposed to just being 'on' the Web).
What do I mean by that last phrase? Essentially, it's about an attitude. It's about seeing being mashed as a virtue. It's about an expectation that your content, URLs and APIs will be picked up by other people and re-used in ways you could never have foreseen. Or, as Charles Leadbeater put it on the first day of the conference, it's about "being an ingredient".
I went on to talk about the JISC Information Environment (which is surprisingly(?) not that far off its 10th birthday if you count from the initiation of the DNER), using it as an example of digital library thinking more generally and suggesting where I think we have parted company with the mainstream Web (in a generally "not good" way). I noted that while digital library folks can discuss identifiers forever (if you let them!) we generally don't think a great deal about identity. And even where we do think about it, the approach is primarily one of, "who are you and what are you allowed to access?", whereas on the social Web identity is at least as much about, "this is me, this is who I know, and this is what I have contributed".
I think that is a very significant difference - it's a fundamentally different world-view - and it underpins one critical aspect of the difference between, say, Shibboleth and OpenID. In digital libraries we haven't tended to focus on the social activity that needs to grow around our content and (as I've said in the past) our institutional approach to repositories is a classic example of how this causes 'social networking' issues with our solutions.
I stole a lot of the ideas for this talk, not least Lorcan Dempsey's use of concentration and diffusion. As an aside... on the first day of the conference, Charles Leadbeater introduced a beach analogy for the 'media' industries, suggesting that in the past the beach was full of a small number of large boulders and that everything had to happen through those. What the social Web has done is to make the beach into a place where we can all throw our pebbles. I quite like this analogy. My one concern is that many of us do our pebble throwing in the context of large, highly concentrated services like Flickr, YouTube, Google and so on. There are still boulders - just different ones? Anyway... I ended with Dave White's notions of visitors vs. residents, suggesting that in the cultural heritage sector we have traditionally focused on building services for visitors but that we need to focus more on residents from now on. I admit that I don't quite know what this means in practice... but it certainly feels to me like the right direction of travel.
I concluded by offering my thoughts on how I would approach something like the JISC IE if I was asked to do so again now. My gut feeling is that I would try to stay much more mainstream and focus firmly on the basics, by which I mean adopting the principles of linked data (about which there is now a TED talk by Tim Berners-Lee), cool URIs and REST and focusing much more firmly on the social aspects of the environment (OpenID, OAuth, and so on).
Prior to giving my talk I attended a session about iTunesU and how it is being implemented at the University of Oxford. I confess a strong dislike of iTunes (and iTunesU by implication) and it worries me that so many UK universities are seeing it as an appropriate way forward. Yes, it has a lot of concentration (and the benefits that come from that) but its diffusion capabilities are very limited (i.e. it's a very closed system), resulting in the need to build parallel Web interfaces to the same content. That feels very messy to me. That said, it was an interesting session with more potential for debate than time allowed. If nothing else, the adoption of systems about which people can get religious serves to get people talking/arguing.
Overall then, I thought it was an interesting conference. I suspect that my contribution wasn't liked by everyone there - but I hope it added usefully to the debate. My live-blogging notes from the two days are here and here.
The day was both interesting and somewhat disappointing...
Interesting primarily because of the obvious political tension in the room (which I characterised on Twitter as a potential bun-fight between librarians and the rest but which in fact is probably better summed up as a lack of shared agreement around centralist (discipline-based) solutions vs. institutional solutions).
Disappointing because the day struck me more as a way of presenting a done-deal than as a real opportunity for debate.
The other thing that I found annoying was the constant parroting of the view that "researchers want to share their data openly" as though this is an obvious position. The uncomfortable fact is that even the UKRDS report's own figures suggest that less than half (43%) of those surveyed "expressed the need to access other researchers' data" - my assumption therefore is that the proportion currently willing to share their data openly will be much smaller.
Don't take this as a vote against open access, something that I'm very much in favour of. But, as we've found with eprint archives, a top-down "thou shalt deposit because it is good for you" approach doesn't cut it with researchers - it doesn't result in cultural change. Much better to look for, and actively support, those areas where open sharing of data occurs naturally within a community or discipline, thus demonstrating its value to others.
That said, a much more fundamental problem facing the provision of collaborative services to the research community is that funding happens nationally but research happens globally (or at least across geographic/funding boundaries) - institutions are largely irrelevant whichever way you look at it [except possibly as an agent of long term preservation - added 6 March 2009]. Resolving that tension seems paramount to me though I have no suggestions as to how it can be done. It does strike me however that shared discipline-based services come closer to the realities of the research world than do institutional services.
[Note: This entry was originally posted on the 9th Feb 2009 but has been updated in light of comments.]
An interesting thread has emerged on the American Scientist Open Access Forum based on the assertion that in Germany "freedom of research forbids mandating on university level" (i.e. that a mandate to deposit all research papers in an institutional repository (IR) would not be possible legally). Now, I'm not familiar with the background to this assertion and I don't understand the legal basis on which it is made. But it did cause me to think about why there might be an issue related to academic freedom caused by IR deposit mandates by funders or other bodies.
In responding to the assertion, Bernard Rentier says:
No researcher would complain (and consider it an infringement upon his/ her academic freedom to publish) if we mandated them to deposit reprints at the local library. It would be just another duty like they have many others. It would not be terribly useful, needless to say, but it would not cause an uproar. Qualitatively, nothing changes. Quantitatively, readership explodes.
Quite right. Except that the Web isn't like a library so the analogy isn't a good one.
If we ignore the rarefied, and largely useless, world of resource discovery based on the OAI-PMH and instead consider the real world of full-text indexing, link analysis and, well... yes, Google then there is a direct and negative impact of mandating a particular place of deposit. For every additional place that a research paper surfaces on the Web there is a likely reduction in the Google-juice associated with each instance caused by an overall diffusion of inbound links.
So, for example, every researcher who would naturally choose to surface their paper on the Web in a location other than their IR (because they have a vibrant central (discipline-based) repository (CR) for example) but who is forced by mandate to deposit a second copy in their local IR will probably see a negative impact on the Google-juice associated with their chosen location.
Now, I wouldn't argue that this is an issue of academic freedom per se, and I agree with Bernard Rentier (earlier in his response) that the freedom to "decide where to publish is perfectly safe" (in the traditional academic sense of the word 'publish'). However, in any modern understanding of 'to publish' (i.e. one that includes 'making available on the Web') then there is a compromise going on here.
The problem is that we continue to think about repositories as if they were 'part of a library', rather than as a 'true part of the fabric of the Web', a mindset that encourages us to try (and fail) to redefine the way the Web works (through the introduction of things like the OAI-PMH for example) and that leads us to write mandates that use words like 'deposit in a repository' (often without even defining what is meant by 'repository') rather than 'make openly available on the Web'.
In doing so I think we do ourselves, and the long term future of open access, a disservice.
Addendum (10 Feb 2009): In light of the comments so far (see below) I confess that I stand partially corrected. It is clear that Google is able to join together multiple copies of research papers. I'd love to know the heuristics they use to do this and I'd love to know how successful those heuristics are in the general case. Nonetheless, on the basis that they are doing it, and on the assumption that in doing so they also combine the Google juice associated with each copy, I accept that my "dispersion of Google-juice" argument above is somewhat weakened.
Good practice: Avoiding URI aliases A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.
The reasons given align very closely to the ones I gave above, though couched in more generic language:
Although there are benefits (such as naming flexibility) to URI aliases, there are also costs. URI aliases are harmful when they divide the Web of related resources. A corollary of Metcalfe's Principle (the "network effect") is that the value of a given resource can be measured by the number and value of other resources in its network neighborhood, that is, the resources that link to it.
The problem with aliases is that if half of the neighborhood points to one URI for a given resource, and the other half points to a second, different URI for that same resource, the neighborhood is divided. Not only is the aliased resource undervalued because of this split, the entire neighborhood of resources loses value because of the missing second-order relationships that should have existed among the referring resources by virtue of their references to the aliased resource.
Now, I think that some of the discussions around linked data are pushing at the boundaries of this guidance, particularly in the area of non-information resources. Nonetheless, I think this is an area in which we have to tread carefully. I stand by my original statement that we do not treat scholarly papers as though they are part of the fabric of the Web - we do not link between them in the way we link between other Web pages. In almost all respects we treat them as bits of paper that happen to have been digitised and the culprits are PDF, the OAI-PMH, an over-emphasis on preservation and a collective lack of imagination about the potential transformative effect of the Web on scholarly communication. We are tampering at the edges and the result is a mess.
It seems to me that there is now quite a general acceptance of what the 'open access' movement is trying to achieve. I know that not everyone buys into that particular world-view but, for those of us that do, we know where we are headed and most of us will probably recognise it when we get there. Here, for example, is Yishay Mor writing to the open-science mailing list:
I would argue that there's a general principle to
consider here. I hold that any data collected by public money should be made
freely available to the public, for any use that contributes to the public
good. Strikes me as a no-brainer, but of course - we have a long way to go.
A fairly straight-forward articulation of the open access position and a goal that I would thoroughly endorse.
The problem is that we don't always agree as a community about how best to get there.
I've been watching two debates flow past today, both showing some evidence of lack of consensus in the map reading department, though one much more long-standing than the other. Firstly, the old chestnut about the relative merits of central repositories vs. institutional repositories (initiated in part by Bernard Rentier's blog post, Institutional, thematic or centralised repositories?) but continued on various repository-related mailing lists (you know the ones!). Secondly, a newer debate about whether formal licences or community norms provide the best way to encourage the open sharing of research data by scientists and others, a debate which I tried to sum up in the following tweet:
@yishaym summary of open data debate... OD is good & needs to be encouraged - how best to do that? 1 licences (as per CC) or 2 social norms
If there's a problem here, and perhaps there isn't, then it is that the arguments and debates are taking place between people who ultimately want the same thing. I'm reminded of Monty Python's Life of Brian:
Brian: Excuse me. Are you the Judean People's Front?
Reg: Fuck off! We're the People's Front of Judea
It's like we all share the same religion but we disagree about which way to face while we are praying. Now, clearly, some level of debate is good. The point at which it becomes not good is when it blocks progress which is why, generally speaking, having made my repository-related architectural concerns known a while back, I try and resist the temptation to reiterate them too often.
Cameron Neylon has a nice summary of the licensing vs. norms debate on his blog. It's longer and more thoughtful than my tweet! This is a newer debate and I therefore feel more positive that it is able to go somewhere. My initial reaction was that a licensing approach is the most sensible way forward but having read through the discussion I'm no longer so sure.
So what's my point? I'm not sure really... but if I wake up in 4 years time and the debate about licensing vs. norms is still raging, as has pretty much happened with the discussion around CRs vs. IRs, I'll be very disappointed.
a plan to secure Britain’s place at the forefront of the global digital economy. The interim report contains more than 20 recommendations, including specific proposals on:
next generation networks
universal access to broadband
the creation of a second public service provider of scale
the modernisation of wireless radio spectrum holdings
a digital future for radio
a new deal for digital content rights
enhancing the digital delivery of public services
I haven't read the full report, much of which is about greater roll-out of broadband connectivity, but I have taken a look through section 3, entitled Digital Content [PDF], which is the part that interests me most.
Here's a Wordle of just that section:
And a few word counts:
open (1) (but not in the context of 'open content')
rights / rightsholders (37)
I'll leave you to draw your own conclusions.... suffice to say, I would have preferred to see at least some discussion about the benefits that open digital content can bring to the economy.
I don't want to comment in too much detail on this story since I freely admit to not having properly done my homework, but I will note that my default position on this kind of issue is that we (yes, all of us) are better off in those cases where data is able to be made available on an 'open' rather than 'proprietary' basis and I think this view of the world definitely applies in this case.
The Guardian story is somewhat simplistic, IMHO, not on the question of 'open' vs. 'closed' but on how easy it would be for such data, assuming that it was to be made openly available, to get into search engines (by which I assume the article really means Google?) in a meaningful way. Flooding the Web with multiple copies of metadata about multiple copies of books is non-trivial to get right (just think of the issues around sensibly assigning 'http' URIs to this kind of stuff for example) such that link counting, ranking of books vs. other Web resources, and providing access to appropriate copies can be done sensibly. There has to be some point of 'concentration' (to use Lorcan Dempsey's term) around which such things can happen - whether that is provided by Google, Amazon, Open Library, OCLC, Talis, the Library of Congress or someone else. Too many points of concentration and you have a problem... or so it seems to me.
Quite an interesting day overall but I was slightly surprised at the lack of name badges and a printed delegate list, especially given that this event brought together people from two previously separate areas of activity. Oh well, a delegate list is promised at some point. I also sensed a certain lack of buzz around the event - I mean there's almost £11m being made available here, yet nobody seemed that excited about it, at least in comparison with the OER meeting held as part of the CETIS conference a few weeks back. At that meeting there seemed to be a real sense that the money being made available was going to result in a real change of mindset within the community. I accept that this is essentially second-phase money, building on top of what has gone before, but surely it should be generating a significant sense of momentum or something... shouldn't it?
A couple of people asked me why I was attending given that Eduserv isn't entitled to bid directly for this money and now that we're more commonly associated with giving grant money away rather than bidding for it ourselves.
The short answer is that this call is in an area that is of growing interest to Eduserv, not least because of the development effort we are putting into our new data centre capability. It's also about us becoming better engaged with the community in this area. So... what could we offer as part of a project team? Three things really:
Firstly, we'd be very interested in talking to people about sustainable hosting models for services and content in the context of this call.
Secondly, software development effort, particularly around integration with Web 2.0 services.
Thirdly, significant expertise in both Semantic Web technologies (e.g. RDF, Dublin Core and ORE) and identity standards (e.g. Shibboleth and OpenID).
If you are interested in talking any of this thru further, please get in touch.
The presentation looks at the way in which scientific and technology-related research is changing, particularly thru the use of the Web to support open, data-driven research - essentially enabling a more immediate, transparent and repeatable approach to science.
The ideas around open science are interesting. Coincidentally, a few Eduserv bods met with Cameron Neylon yesterday and he talked us thru some of the work going on around blog-driven open labbooks and the like. Good stuff. Whatever one thinks about the success or otherwise of institutional repositories as an agent of change in scholarly communication there seems little doubt that the 'open' movement is where things are headed because it is such a strong enabler of collaboration and communication.
Slide 24 of the presentation above introduces the notion that open "methods are scientific commodities". Obvious really, but something I hadn't really thought about. I note that there seem to be some potential overlaps here with the approaches to sharing pedagogy between lecturers/teachers enabled by standards such as Learning Design - "pedagogies as learning commodities" perhaps? - though I remain somewhat worried about how complex these kinds of things can get in terms of mark-up languages.
The presentation ends with some thoughts about the impact that this new user-centric (scientist-centric) world of personal research environments has on libraries:
We don’t come to the library, it comes to us.
We don’t use just one library or one source.
We don’t use just one tool!
Library services embedded in our toolkits, workbenches, browsers, authoring tools.
I find the closing scenario (slide 67) somewhat contrived:
Prior to leaving home Paul, a Manchester graduate student, syncs his iPhone with the latest papers, delivered overnight by the library via a news syndication feed. On the bus he reviews the stream, selecting a paper close to his interest in HIV-1 proteases. The data shows apparent anomalies with his own work, and the method, an automated script, looks suspect. Being on-line he notices that a colleague in Madrid has also discovered the same paper through a blog discussion and they Instant Message, annotating the results together. By the time the bus stops he has recomputed the results, proven the anomaly, made a rebuttal in the form of a pubcast to the Journal Editor, sent it to the journal and annotated the article with a comment and the pubcast. [Based on an original idea by Phil Bourne]
If nothing else, it is missing any reference to Twitter (see the MarsPhoenix Twitter feed for example) and Second Life! :-). That said, there is no doubt that the times they are a'changing.
My advice? You'd better start swimming or you'll sink like a stone :-)
This is a 30 minute slidecast (using 130 slides), based on a seminar I gave to Eduserv staff yesterday lunchtime. It tries to cover a broad sweep of history from library cataloguing, thru the Dublin Core, Web search engines, IEEE LOM, the Semantic Web, arXiv, institutional repositories and more.
It's not comprehensive - so it will probably be easy to pick holes in if you so choose - but how could it be in 30 minutes?!
The focus is ultimately on why Eduserv should be interested in 'metadata' (and surrounding areas), to a certain extent trying to justify why the Foundation continues to have a significant interest in this area. To be honest, it's probably weakest in its conclusions about whether, or why, Eduserv should retain that interest in the context of the charitable services that we might offer to the higher education community.
Nonetheless, I hope it is of interest (and value) to people. I'd be interested to know what you think.
As an aside, I found that the Slideshare slidecast editing facility was mostly pretty good (this is the first time I've used it), but that it seemed to struggle a little with the very large number of slides and the quickness of some of the transitions.
The Guardian's Tech Weekly podcast from Wednesday this week contains a brief but interesting interview with Cory Doctorow (about 21 minutes into the podcast if you want to jump straight to it). In it he talks about his 3 key reasons for adopting open licences for his books. Speaking about the work he produces he says:
Firstly, artistically it doesn't seem like a plausible 21st century piece of art if it is not intended to be copied. There's something anachronistic in doing otherwise - "it's like making horse shoes or something".
Secondly, morally we are not going to be able to stop people copying and remixing work anyway, and our attempts at doing so to date have resulted in horrible things happening like spying on people, kicking them off the Internet, or suing old ladies or very young people for all their money. Further, like most of us, he was a avid copier when he was part of the "time rich, cash poor" demographic - "I never would have had a single romantic episode if it wasn't for the mix tape". If he was 17 again he'd be copying and remixing stuff it so it seems hypocritical to try to stop it happening to his own stuff.
Finally, financially the fundamental problem isn't piracy, it's obscurity. The people who don't buy his books, do so because they've never heard of them, not because the books are openly available online.
He then goes on to talk about his desire to give practical help to those people who "get" the open access argument but need help in making it happen effectively. And towards the end he touches on the people illegally selling CC licenced Flickr images on eBay issue that I blogged about a while back.
He is referring to the fact that "some people who had put photos on Flickr under a Creative Commons
non-commercial licence found that they were being sold on eBay by
someone who was claiming the rights to them".
Apparently not - and, judging by the responses to the article, using words like 'theft' and 'steal' clearly rub some people up the wrong way... "The photographs in question simply are not being stolen. They're being
copied. No thieves in existence there, but copiers. Illegal copiers I'm
I have to confess that when I used to talk to my own children about illegally downloading music on the Web I tended to use the analogy of shoplifting, on the basis that each time they downloaded a track they would be denying a shop (and the artist) a sale . So what was their typical reaction? Basically, they thought I was completely mad (and they probably weren't wrong - either in the specific or general case!).
As ParkyDR says in a comment on the article: "Nothing has been taken, the original owner still has the photo, in this
case even copying it was ok (CC licenced), the license was broken when
the photo was sold".
Well, yes... but it still feels a lot like theft in many ways? As Nickminers says, "if you make money from a photo that was taken by somebody else,
you have effectively stolen the money that the copyright holder should
have earned from the sale".
Dvdhldn suggests using the phrase "copyright theft" as a compromise between "theft" and "illegal copying" which sounds reasonable to me. Whatever... the key point here is that I don't think many people would disagree that selling someone else's CC-BY-NC images without permission is wrong. The issue is only with what words we should use to label the activity.
So here's a less clear cut scenario... Brian Kelly tweeted the other day about a new competitor service to Slideshare called AuthorSTREAM. The new service looks interesting and offers some functionality not currently present in Slideshare, though I have to say I feel slightly uncomfortable about how far the new service has gone to make itself look and feel like the original.
But the service itself isn't the issue. Brian also noticed that someone had taken copies of a large number of old UKOLN Powerpoint presentations and uploaded them to the AuthorSTREAM site. I took a look for my own presentations and sure enough, a few of those uploaded were mine.
Hmmm... that's a little annoying. Or is it? No, perhaps not - there's no attempt at passing these presentations off as being by someone else so perhaps it is just good visibility. On the other hand, I know of at least one case where the continued availability of old, technically out of date, material on the Web does more harm than good and I'd prefer to be in control of when I publish my own crap thank you very much.
So, it's not clear cut by any means...
I also noticed that one of my more recent presentations has been made available, uploaded by someone called 'Breezy' and labeled on AuthorSTREAM as andy powell presentation, though the original on Slideshare is called The Repository Roadmap - are we heading in the right direction? This is more frustrating in a way. The presentation is already available on the Web in a very accessible form and someone else uploading it to a different service just waters down the Google juice of the original. That's downright unhelpful, at least from my perspective. If Breezy had asked, I'd have said no and asked him or her to link to the original.
Now, I must stress that Breezy has done nothing legally wrong here. The original presentation is made available under a CC-BY licence (at least that was what I intended, though I've just noticed that in fact, on this occasion, I forgot to add a CC licence until just now!). So in some sense, I am explicitly encouraging Breezy to do what he or she has done through my use of open licences.
But supposing Breezy had taken all of my presentations from Slideshare and replicated them all on AuthorSTREAM. Would that have been OK? Again, according to the individual licence on each presentation Breezy would have done nothing wrong - at least, not legally. But morally... that seems like a different kettle of fish? At least from my point of view.
It's frustrating because what I really want is a licence that says, "you can take this content, unbundle it, and use the parts to create a new derivative work but you can't simply copy the whole work and republish it on the Web unchanged" and more fundamentally, "you can do stuff with the individual resources that I make available but you can't take everything I've ever created and make it all available at a new location on the Web wholesale".
The bottom line is that there's a difference between making a new, derivative work and simply copying stuff.
Enough said... at the end of the day Creative Commons licences are the best we've got for making content openly available on the Web and in those few cases where things go a bit wrong I can either learn to live with it or try to resolve the situation with a simple email.
For the record... this is the presentation I gave at the Talis Xiphos meeting last week, though to be honest, with around 1000 Slideshare views in the first couple of days (presumably thanks to a blog entry by Lorcan Dempsey and it being 'featured' by the Slideshare team) I guess that most people who want to see it will have done so already:
Some of my more recent presentations have followed the trend towards a more "picture-rich, text-poor" style of presentation slides. For this presentation, I went back towards a more text-centric approach - largely because that makes the presentation much more useful to those people who only get to 'see' it on Slideshare and it leads to a more useful slideshow transcript (as generated automatically by Slideshare).
As always, I had good intentions around turning it into a slidecast but it hasn't happened yet, and may never happen to be honest. If it does, you'll be the first to know ;-) ...
After I'd finished the talk on the day there was some time for Q&A. Carsten Ulrich (one of the other speakers) asked the opening question, saying something along the lines of, "Thanks for the presentation - I didn't understand a word you were saying until slide 11". Well, it got a good laugh :-). But the point was a serious one... Carsten admitted that he had never really understood the point of services like arXiv until I said it was about "making content available on the Web".
OK, it's a sample of one... but this endorses the point I was making in the early part of the talk - that the language we use around repositories simply does not make sense to ordinary people and that we need to try harder to speak their language.
An article in Tuesday's Education Guardian, Teach online to compete, British universities told, caught my eye - not least because it appears to say very little about teaching online. Rather, it talks about making course materials available online, which is, after all, very different. To be fair, Carol Comer, academic development advisor (eLearning) at the University of Chester, does make this point towards the end of the article.
The report on which the story is based is "a paper for the latest edition of ppr, the publication of influential thinktank the Institute for Public Policy Research". I'm not sure if the paper is currently finished - it doesn't really look finished to be honest - the fonts seem to be all over the shop but perhaps I'm being too picky. Or perhaps the Guardian have got sight of it a little early?
The report suggests that the UK should:
establish a centralised online hub of diverse British open courseware offerings at www.ocw.ac.uk, presented in easily-readable formats and accessible to teachers, students and citizens alike
establish the right and subsequent capacity for non-students and non-graduates to take the same exam as do face-to-face students, through the provision of open access exam sessions
pass an Open Access Act through Parliament, establishing a new class of Open degree, achieved solely using open courseware
conduct a high-profile public information campaign, promoting the opportunities afforded open courseware and open access examinations and degrees, targeted at adult learners, excluded minorities and students at pre-university age
OK, I confess that I found the report quite long and I didn't quite get to the end (err, make that beyond halfway). I'm as big a fan of open access as the next person, probably more so, so I don't have a problem with the suggestion that we should be making more courseware openly available. I'm just not convinced that anyone could get themselves up to degree level simply by downloading / reading / watching / listening to a load of open access courseware - no matter how good it is. The report makes reference to MIT's OpenCourseware and the OU's OpenLearn initiatives. Call me a cynic, but I've always suspected that MIT makes its cousreware available online, not for the greater good of humanity but so that more students will enroll at MIT? OK, I'm adopting an intentionally extreme position here and I'm sure people at MIT do have the best of intentions - but I think it is also the case that they don't see the giving away of courseware in any way harmful to their current business models. The OU's OpenLearn initiative (treated somewhat unfairly by the parts of the report I read) is slightly different in any case since the OU is by definition a distance-based institution - or so it seems to me.
So, I should probably stop at this point - having not properly read the report fully. If you think I've been very unfair when you read the report yourself, let me know by way of a comment.
This is good news, though one is tempted to wonder why it has taken so long! I've argued for a while now that using a relatively closed licensing model and forcing registration before use would more or less stop the service in its tracks.
Through the development of JorumOpen, lecturers and teachers will be
able to share materials under the Creative Commons licence framework:
this makes sharing easier, granting users greater rights for use and
re-use of online content and easier to understand. Importantly, it does
not require prior registration. As a result availability is global as
well as across UK universities and colleges. JorumOpen will run
alongside a 'members only' facility, JorumEducationUK, that will
support sharing of material just within the UK educational sector; this
will be available only to registered users and contributors, as is
currently the case.
Is the addition of JorumOpen enough to turn the service around? I'm not sure to be honest. It might be, though I'm not fully convinced that the notion of learning objects, as relatively complex packages of other objects, is compelling and/or simple enough to really succeed. Can something like Jorum really take on the likes of Slideshare, Flickr and YouTube?
I spent a large part of last week the week before last (Tuesday, Wednesday & Friday) at the Open Repositories 2008 conference at the University of Southampton.
There were something around 400 delegates there, I think, which I guess is an indicator of the considerable current level of interest around the R-word. Interestingly, if I recall conference chair Les Carr's introductory summary of stats correctly, nearly a quarter of these had described themselves as "developers", so the repository sphere has become a locus for debate around technical issues, as well as the strategic, policy and organisational aspects. The JISC Common Repository Interfaces Group (CRIG) had a visible presence at the conference, thanks to the efforts of David Flanders and his comrades, centred largely around the "Repository Challenge" competition (won by Dave Tarrant, Ben O’Steen and Tim Brody with their "Mining with ORE" entry).
The higher than anticipated number of people did make for some rather crowded sessions at times. There was a long queue for registration, though that was compensated for by the fact that I came away from that process with exactly two small pieces of paper: a name badge inside an envelope on which were printed the login details or the wireless network. (With hindsight, I could probably have done with a one page schedule of what was on in which location - there probably was one which I missed picking up!) Conference bags (in a rather neat "vertical" style which my fashion-spotting companions reliably informed me was a "man bag") were available, but optional. (I was almost tempted, as I do sport such an accessory at weekends, and it was black rather than dayglo orange, but decided to resist on the grounds that there was a high probability of it ending up in the hotel wastepaper bin as I packed up to leave.) Nul points, however, to those advertisers who thought it was a good idea to litter every desktop surface in the crowded lecture theatre with their glossy propaganda, with the result that a good proportion of it ended up on the floor as (newly manbagged-up) delegates squeezed their way to their seats.
The opening keynote was by Peter Murray-Rust of the Unilever Centre for Molecular Informatics, University of Cambridge. With some technical glitches to contend with, which must have been quite daunting in the circumstances - Peter has posted a quick note on his view of the experience! "I have no idea what I said" :-)) - , Peter delivered a somewhat "non-linear" but always engaging and entertaining overview of the role of repositories for scientific data. He noted the very real problem that while ever increasing quantities of data are being generated, very little of it is being successfully captured, stored and made accessible to others. Peter emphasised that any attempt to capture this data effectively must fit in with the existing working practices of scientists, and must be perceived as supporting the primary aims of the scientist, rather than introducing new tasks which might be regarded as tangential to those aims. And the practices of those scientists may, in at least some areas of scientific research, be highly "locally focused" i.e. the scientists see their "allegiances" as primarily to a small team with whom data is shared - at least in the first instance, an approach categorised as "long tail science" (a term attributed to Peter's colleague Jim Downing). Peter supported his discussion with examples drawn from several different e-Chemistry projects and initiatives, including the impressive OSCAR-3 text mining software which extracts descriptions of chemical compounds from documents)
Most of the remainder of the Tuesday and Wednesday I spent in paper sessions. The presentation I enjoyed most was probably a presentation by Jane Hunter from the University of Queensland on the work of the HarvANA project on a distributed approach to annotation and tagging of resources from the Picture Australia collection (in the first instance at least - at the end, Jane whipped through a series of examples of applying the same techniques to other resources). Jane covered a model for annotation on tagging based on the W3C Annotea model, a technical architecture for gathering and merging distributed annotations/taggings (using OAI-PMH to harvest from targets at quite short time intervals (though those intervals could be extended if preferred/required)), browser-based plug-in tools to perform annotation/tagging, and also touched on the relationships between tagging and formally-defined ontologies. The HarvANA retrieval system currently uses an ontology to enhance tag-based retrieval - "ontology-based or ontology-directed folksonomy" - , but the tags provided could also contribute to the development/refinement of that ontology, "folksonomy-directed ontology". Although it was in many ways a repository-centric approach and Jane focused on the use of existing, long-established technologies, she also succeeded in placing repositories firmly in the context of the Web: as systems which enable us to expose collections of resources (and collections of descriptions of those resources), which then enter the Web of relationships with other resources managed and exposed by other systems - here, the collections of annotations exposed by the Annotea servers, but potentially other collections too.
At Wednesday lunch time, (once I managed to find the room!) I contributed to a short "birds of a feather" session co-ordinated by Rosemary Russell of UKOLN and Julie Allinson of the University of York on behalf of the Dublin Core Scholarly Communications Community. We focused mainly on the Scholarly Works Application Profile and its adoption of a FRBR-based model, and talked around the extension of that approach to other resource types which is under consideration in a number of sibling projects currently being funded by JISC. (Rather frustratingly for me, this meeting clashed with another BoF session on Linked Data which I would really have liked to attend!)
I should also mention the tremendously entertaining presentation by Johan Bollen of the Los Alamos National Laboratory on the research into usage metrics carried out by the MESUR project. Yes, I know, "tremendously entertaining" and "usage statistics" aren't the sort of phrases I expect to see used in close proximity either. Johan's base premise was, I think, that seeking to illustrate impact through blunt "popularity" measures was inadequate, and he drew a distinction between citation - the resources which people announce in public that they have read - and usage - the actual resources they have downloaded. Based on a huge dataset of usage statistics provided by a range of popular publishers and aggregators, he explored a variety of other metrics, comparing the (surprisingly similar) rankings of journals obtained via several of these metrics with the rankings provided by the citation-based Thomson impact factor. I'm not remotely qualified to comment on the appropriateness of Johan's choice of algorithms, but the fact that Johan kept a large audience engaged at the end of a very long day was a tribute to his skill as a presenter. (Though I'd still take issue with the Britney (popular but insubstantial?)/Big Star (low-selling but highly influential/lauded by the cognoscenti) opposition: nothing by Big Star can compare with the strutting majesty of "Toxic". No, not even "September Gurls".)
All in all - give or take a few technical hiccups - it was a successful conference, I think (and thanks to Les and his team for their hard work) - perhaps more so in terms of the "networking" that took place around the formal sessions, and the general "buzz" there seemed to be around the place, than because of any ground-breaking presentations.
And yet, and yet... at the end of the week I did come away from some of the sessions with my niggling misgivings about the "repository-centric" nature of much of the activity I heard described slightly reinforced. Yes, I know: what did I expect to hear at a conference called "Open Repositories"?! :-) But I did feel an awful lot of the emphasis was on how "repository systems" communicate with each other (or how some other app communicates with one repository system and then with another repository system ) e.g. how can I "get something out" of your repository system and "put it into" my repository system, and so on. It seems to me that - at the technical level at least - we need to focus less on seeing repository systems as "specific" and "different" from other Web applications, and focus more on commonalities. Rather than concentrating on repository interfaces we should ensure that repository systems implement the uniform interface defined by the RESTful use of the HTTP protocol. And then we can shift our focus to our data, and to
the models or ontologies (like FRBR and the CIDOC Conceptual Reference Model, or even basic one-object-is-made-available-in-multiple-formats models) which condition/determine the sets of resources we expose on the Web, and see the use of those models as choices we make rather than something "technologically determined" ("that's just what insert-name-of-repository-software-app-of-choice does");
the practical implementation of formalisms like RDF which underpin the structure of our representations describing instances of the entities defined by those models, through the adoption of conventions such as those advocated by the Linked Data community
In this world, the focus shifts to "Open (Managed) Collections" (or even "Open Linked Collections"), collections of documents, datasets, images, of whatever resources we choose to model and expose to the world. And as a consumer of those resources I (and, perhaps more to the point, my client applications) really don't need to know whether the system that manages and exposes those collections is a "repository" or a "content management system" or something else (or if the provider changes that system from one day to the next): they apply the same principles to interactions with those resources as they do to any other set of resources on the Web.
In the tradition of ’slow food’ we have decided to do a slow release of
content with an initial 200 historic images of Sydney and surrounds
available through the Commons on Flickr
and a promise of another 50 new fresh images each week! These initial
images are drawn from the Tyrrell Collection. Representing some of the
most significant examples of early Australian photography, the Tyrrell
Collection is a series of glass plate negatives by Charles Kerry
(1857-1928) and Henry King (1855-1923), two of Sydney’s principal
photographic studios at the time.
Looking at the announcement text, I am slightly worried about the licences under which the resulting digitised resources will be made available. Yes, I know I bang on about this all the time but we seem to have a well ingrained habit in this country (the UK more so than the US I think) of publicly funding digitisation projects which result in resources being freely available on the Web, but not being open. I, for one, would feel reassured if such things were made more explicit.
Now, the word open is used in multiple ways, so I should explain. I'm using it here as in open content (from Wikipedia):
[Open content is] any kind of creative work
published in a format that explicitly allows copying and modifying of
its information by anyone, not exclusively by a closed organization,
firm or individual.
This usually implies the use of an explicit open content licence, such as those provided by Creative Commons. Free content on the other hand, is typically available only for viewing by the end-user, with copyright and/or other restrictions typically limiting other usage to 'personal educational' use at best.
Based on the minimal information provided about the five projects, only one explicitly mentions the use of Creative Commons, one mentions the development of open source software and one talks about results being freely available (though as mentioned above, being free and being open are two different things).
Why does this matter? Well, it seems to me that whenever possible (and I accept that there may be situations in which it is not possible) publicly funded digitisation of our cultural heritage should result in resources that can be re-purposed freely by other people. That means, for example, that any lecturer or teacher who wants to take the digitised cultural heritage resource and build it into a learning object in their VLE, or an exhibit in Second Life, or whatever, can do so freely, without needing to contact the content provider.
Open content is what makes the Web truly mashable, and we should look to the cultural heritage sector for our richest and most valued mashable content. Free content is not sufficient.
There is probably a useful debate to be had around whether the cultural resources produced by publicly funded digitisation should be able to be re-used in commercial activities as well as non-profit ones. My personal view is that anything that adds value is fair game, including commercial activities, but I accept that there are other views on this issue. Whatever, re-use for non-profit purposes is an absolute minimum.
To conclude... I really hope that I'm wasting blog space here, and that the conditions of funding in this case mandated that the resulting resources be made open rather than just free. And further, that such a condition is already (or rapidly becomes) the norm for publicly funded digitisation of our cultural heritage everywhere. I'm keeping my fingers crossed.
The BL have made a digitised copy of the Magna Carta available on the Web:
Magna Carta is one of the most celebrated documents in history. Examine the British Library's copy close-up, translate it into English, hear what our curator says about it, and explore a timeline.
So says the introductory blurb.
Well... if it's so "celebrated" and important can someone please explain why the digitised version has been hidden behind a Shockwave viewer that makes it pretty much impossible to do anything other than browse it on the BL's Web site? Yes, there is a simple version, which does not require a browser plugin, but the copyright statement and complete lack of CC licence (or anything remotely like it) makes it clear that re-use wasn't high on the BL's agenda.
Shame on them.
Come on BL, you can spend our money better then this!
I spent last week in Melbourne, Australia at the VALA 2008 Conference - my first trip over to Australia and one that I thoroughly enjoyed. Many thanks to all those locals and non-locals that made me feel so welcome.
I was there, first and foremost, to deliver the opening keynote, using it as a useful opportunity to think and speak about repositories (useful to me at least - you'll have to ask others that were present as to whether it was useful for anyone else).
It strikes me that repositories are of interest not just to those librarians in the academic sector who have direct responsibility for the development and delivery of repository services. Rather they represent a microcosm of the wider library landscape - a useful case study in the way the Web is
evolving, particularly as manifest through Web 2.0 and social networking, and
what impact those changes have on the future of libraries, their spaces and
My keynote attempted to touch on many of the issues in this area - issues around the future of metadata standards and library cataloguing
practice, issues around ownership, authority and responsibility, issues
around the impact of user-generated content, issues around Web 2.0, the
Web architecture and the Semantic Web, issues around individual vs.
institutional vs. national, vs. international approaches to service
In speaking first I allowed myself the luxury of being a little provocative and, as far as I can tell from subsequent discussion, that approach was well received. Almost inevitably, I was probably a little too technical for some of the audience. I'm a techie at heart and a firm believer that it is not possible to form a coherent strategic view in this area without having a good understanding of the underlying technology. But perhaps I am also a little too keen to inflict my world-view on others. My apologies to anyone who felt lost or confused.
I can sum up my talk in three fairly simple bullet points:
Firstly, that our current preoccupation with the building and filling of 'repositories' (particularly 'institutional repositories') rather than the act of surfacing scholarly material on the Web means that we are focusing on the means rather than the end (open access). Worse, we are doing so using language that is not intuitive to the very scholars whose practice we want to influence.
Secondly, that our focus on the 'institution' as the home of repository services is not aligned with the social networks used by scholars, meaning that we will find it very difficult to build tools that are compelling to those people we want to use them. As a result, we resort to mandates and other forms of coercion in recognition that we have not, so far, built services that people actually want to use. We have promoted the needs of institutions over the needs of individuals. Instead, we need to focus on building and/or using global scholarly social networks based on global repository services. Somewhat oddly, ArXiv (a social repository that predates the Web let alone Web 2.0) provides us with a good model, especially when combined with features from more recent Web 2.0 services such as Slideshare.
Finally, that the 'service oriented' approaches that we have tended to adopt in standards like the OAI-PMH, SRW/SRU and OpenURL sit uncomfortably with the 'resource oriented' approach of the Web architecture and the Semantic Web. We need to recognise the importance of REST as an architectural style and adopt a 'resource oriented' approach at the technical level when building services.
I'm pretty sure that this last point caused some confusion and is something that Pete or I need to return to in future blog entries. Suffice to say at this point that adopting a 'resource oriented' approach at the technical level does not mean that one is not interested in 'services' at the business or function level.
[Image: artwork outside the State Library of Victoria]
In addition to the travels that Andy mentioned, we've also been grappling with the disruption caused by a relocation to a different office, so I seem to have accumulated a number of half-written posts which I'll try to find the time to get out this week.
For now, a brief pointer to a nice post by Roo Reynolds in which he compares the character and functionality of the UK government's Hansard Web site (which provides access to the "official" " edited verbatim report of proceedings" in the two houses of the UK Parliament) and two independent sites, TheyWorkForYou.com and The Public Whip, which take advantage of the availability of that data to provide more "social" functionality around the same information:
While the text is the same, the simple addition of some additional markup, links and photos brings it to life. The addition of user comments turns the whole thing into a social application, allowing us to discuss what our MPs and Lords are shouting across their respective aisles at each other every day.
In addition, Roo highlights the importance of underpinning such applications with an entity-/object-based approach - what I would probably call a resource-oriented approach:
Social software designers talk about the 'atoms', (or objects, or entities) of an application. For example, YouTube’s atoms include videos (of course) but also comments, playlists and users. Flickr’s atoms include photos, comments, users, groups and notes. TheyWorkForYou’s atoms are speeches and comments. Don’t get the impression that ’speech’ necessarily means a long speech. It could be a question, an interruption, an answer or a statement. Sometimes even standing up to speak is enough to get an entry in Hansard.
In his discussion of The Public Whip, Roo emphasises that such entities include people and also 'abstract resources' such as 'divisions' and 'policies'. I guess I might add that such entities aren't necessarily 'atomic' in the traditional sense of that word, indicating something 'indivisible': a collection or list of other entities/resources can also be an entity/resource in its own right, and indeed such entities are visible in those services.
But it's a good post, highlighting very simply and clearly the value of open data and what the "social" dimension can bring to an application.
Note that this report, and survey on which it is based, only reflects those individuals that participated (107 respondents in all), and does not purport to represent the entire sector. That said, it mildly surprises me that about half of those completing the survey hadn't heard of Creative Commons or Creative Archive licences. It also struck me as interesting to note that only about half the respondents have "an in-house legal department or designated person that deals with copyright issues" and that a similar proportion do not have "a copyright policy publicly stated on its website".
I've argued before that it is too hard to re-use cultural heritage content in the UK for anything other than personal educational use (particularly in comparison with the US). Moving towards making copyright and licensing terms explicit would be a big step in the right direction.
The Net's most adored lawyer brings together John Philip Sousa, celestial copyrights, and the "ASCAP cartel" to build a case for creative freedom. He pins down the key shortcomings of our dusty, pre-digital intellectual property laws, and reveals how bad laws beget bad code. Then, in an homage to cutting-edge artistry, he throws in some of the most hilarious remixes you've ever seen.
This presentation works on a number of levels - it is thought-provoking, inspirational and very funny and is given using a presentational style that makes it a joy to watch. Well worth the 30 minutes or so that it will take to view it.
Meanwhile, over on the Guardian Unlimited Technology blog, Cory Doctorow pokes fun at the National Portrait Gallery, Warhol is turning in his grave, by highlighting the irony of putting on an exhibition of pop art, an art movement that to a large extent celebrated "nicking the work of others, without permission, and transforming it to
make statements and evoke emotions never countenanced by the original
creators", in an environment adorned with copyright-induced restrictions.
Does this show - paid for with public money, with some works that are
themselves owned by public institutions - seek to inspire us to become
21st century pop artists, armed with cameraphones, websites and mixers,
or is it supposed to inform us that our chance has passed and we'd best
settle for a life as information serfs who can't even make free use of
what our eyes see and our ears hear?
Peter Suber has an interesting article in the current SPARC Open Access Newsletter, issue #114, in which he discusses an idea originally put forward by Mark Rowse (previously CEO of Ingenta) for how current toll access journals can become open access journals by 'flipping' their consortia subscriptions for readers to consortia subscriptions for authors.
Peter's analysis starts from some rather simplistic assumptions about the penetration of consortia subscription models in the US but quickly moves to firmer ground, assessing both the likely viability of 'flipped' business models and some of the potential benefits such an approach might bring to readers, authors, institutions, publishers and research in general.
I don't know how new these ideas will be to those of you steeped in the political discussions around open access, but I found it an interesting read - one made better by its acknowledgment that the sustainability of publisher services is an important consideration in the move towards OA.
Copying of individual articles is governed by international
copyright law. Users may print off or make single copies of web pages
for personal use. Users may also save web pages other than individual
articles electronically for personal use. Electronic dissemination or
mailing of articles is not permitted, without prior permission from the
Conference of European of National Librarians and/or the National
Seems a shame. Surely some the material found through the TEL portal could be made available on a more open basis?
As someone that would like to build experimental virtual exhibitions of European cultural heritage materials in Second Life, I'm scuppered at the first hurdle - I can't easily work out what is available for re-use. Worse in fact - it looks like nothing is available for re-use!
As I've noted before, the US seems way ahead of us in terms of making digitised cultural heritage material openly available.
I mention this partly because I helped set up the technical infrastructure for the journal using the Open Journal System, an open source journal management and
publishing system developed by the
Public Knowledge Project, while I was still at UKOLN - so I have a certain fondness for it.
Odd though, for a journal that is only ever (as far as I know) intended to be published online, to offer the articles using PDF rather than HTML. Doing so prevents any use of lightweight 'semantic' markup within the articles, such as microformats, and tends to make re-use of the content less easy.
In short, choosing to use PDF rather than HTML tends to make the content less open than it otherwise could be. That feels wrong to me, especially for an open access journal! One could just about justify this approach for a journal destined to be published both on paper and online (though even in that case I think it would be wrong) but surely not for an online-only 'open' publication?
digitalPreservation, HTML, microformats, openAccess, PDF
If you have responsibility in this area, please consider filling in the survey. Cultural heritage organisations include museums, galleries, libraries,
and archives, as well as radio and television broadcasters, and film
and video organisations. Even if you do not fall into one of these
groups, but conduct cultural heritage activities, you are invited to
take the survey.
We anticipate that completing it will take less than 10 minutes. By completing the survey you will have a chance to win one of three iPod Shuffles, pre-filled with Creative Commons licensed material.
Prior to this addition the only scheme available was Dublin Core, which
as a metadata schema for describing article content is woefully
inadequate. (Dublin Core, of course, was never designed to handle the
complexity of the description of an average article.)
I think the reference here to "Dublin Core" is really to the specific "DC application profile" (or description set profile, as we are starting to refer to these things) commonly known as "Simple DC", i.e. the use of (only) the 15 properties of the Dublin Core Metadata Element Set with literal values, for which the oai_dc XML format defined by the OAI-PMH spec provides a serialisation. On that basis, I'd be inclined to agree that the Simple DC profile is not the tool for the task at hand: the Simple DC profile is intended to support simple, general descriptions of a wide range of resources, and it doesn't in itself offer the "expressiveness" that may be required to support all the requirements of individual communities, or more detailed description specific to particular resource types.
However, the framework provided by the DCMI Abstract Model provides the sort of extensibility which enables communities to develop other profiles to meet those requirements for richer, more specific descriptions.
I guess DCMI still has its work cut out to try to convey the message that "Dublin Core" doesn't begin and end with the DCMES.
But perhaps more specifically pertinent to the topic of the DOAJ format is the fact that the work carried out last year on the ePrints DC Application Profile, led by Andy and Julie Allinson of UKOLN, applied exactly this approach for the area of scholarly works, including journal articles. From the outset, the initiative recognised that the Simple DC profile was insufficient to meet the requirements which had been articulated, and shifted their focus to the development of a new profile, based on applying a subset of the FRBR entity-relational model to the "eprint" domain.
I haven't yet compared the DOAJ format and the ePrints DCAP closely enough to say whether the latter would support the representation of all the information represented by the former. I guess it's quite likely that the two initiatives were simply not aware of each other's efforts. Or it may be that the DOAJ folks felt that the ePrints DCAP was more complex than they needed for the task at hand.
But it does seem a pity that we seem to have ended up with two specs, developed at almost the same time, and applying to pretty much the same "space", leaving implementers harvesting data from multiple providers with the probability of needing to work across both.
(Hmmm, it occurs to me that a quick spot of GRDDL-ing might make that less painful than it appears... Watch this space.)
The OA way of the present and future is for researchers to deposit their articles in their own Institutional Repositories.
Is this the one true OA way? I'm not convinced. Let's focus on what is important, the 'open' and the 'access' - and let the way of the future determine itself based on what actually helps to achieve those aims.
Nature Precedings is a place for researchers to share pre-publication
research, unpublished manuscripts, presentations, posters, white
papers, technical papers, supplementary findings, and other scientific
documents. Submissions are screened by our professional curation team
for relevance and quality, but are not subjected to peer review. We
welcome high-quality contributions from biology, medicine (except
clinical trials), chemistry and the earth sciences.
Interesting. As one might expect, blog reaction is mixed... for example, the positive reception by David Weinberger draws some negative comment from those on the institutional repository side of the fence, who argue that repositories (despite the fact that they are largely empty!) already do all of this.
The announcement of Precedings echoes almost exactly the point I was trying to make in my talk at the JISC Repositories Conference and in subsequent posts - we need to stop thinking institutionally and instead develop or use naturally compelling services, such as Precedings, that position researchers directly in a globally social context.
Of course, it remains to be seen whether Nature have got Precedings right, but I think it is an interesting development that deserves close attention as it grows.
Yes, we can acknowledge our failure to put services in place that
people find intuitively compelling to use by trying to force their use
thru institutional or national mandates? But wouldn't it be nicer to
build services that people actually came to willingly?
Steven Harnard, in his response to Sally, notes that:
But if "compulsion" is indeed the right word for mandating self-archiving, I wonder whether Sally was ever curious about why publication itself had to be mandated by researchers' institutions and funders ("publish or perish"), despite its substantial benefits to researchers?
I don't consider myself a real researcher [tm] so I probably shouldn't comment but I've always assumed that "publish or perish" resulted at least as much from social pressure as from policy pressure. Self-archiving should be the same - it should be the expected norm because it is the obvious and intuitive thing for researchers to do to gain impact.
I'm pleased to say that it has got 'real' and now offers the ability to download the PPT or PDF file for each presentation, as well as making the slides available thru the embedded display facility. Nice.
Note that you have to manually enable this feature for any existing presentations in Slideshare - but doing so doesn't mean uploading the presentation again, just selecting a new tick box on the 'edit' page.
Tony Hurst asks "why doesn't JISC fund the equivalent of Scribd for the academic community?" in a post on the OUseful blog to which one is tempted to ask, "why would they when such things already exist out on the Web?".
Of course, in reality there are good reasons why they might, partly because of the specific requirements of scholarly documents (as opposed to just any old documents) and partly because of assurances about persistence of services, quality assurance, and so on.
I'm minded to ask a different question. One that I've asked before on a number of occasions, not least in the context of the current ORE project, which is "why don't scholarly repositories look more like Scribd?". Why do we continue to develop and use digital library specific solutions, rather than simply making sue that our repositories integrate tightly with the main fabric of the Web (read Web 2.0)?
What does that mean? Essentially it means assigning 'http' URIs to everything of interest, using the HTTP protocol and content negotiation to serve appropriate representations of that stuff, using sitemaps to steer crawlers to the important information, and using JSON to surface stuff flexibly in other places?
By the way, Tony also asks whether there is any sort of cross-search of UK repositories available, to which the answer is that JISC are funding Intute to develop such a thing (a development of the previous ePrints UK project I think). And there are the global equivalents such as OAIster.
A blog about the Web, cloud infrastructure, linked data, big data, open access, digital libraries, metadata, learning, research, government, online identity, access management and anything else that takes our fancy by Pete Johnston and Andy Powell.