July 09, 2009

e-Framework - time to stop polishing guys!

The e-Framework for Education and Research has announced a couple of new documents, the e-Framework Rationale and the e-Framework Technical Model, and have invited the community to comment on them.

In looking around the e-Framework website I stumbled on a definition for the 'Read' Service Genre. Don't know what a Service Genre is? Join the club... but for the record, they are defined as follows:

Service Genres describe generic capabilities expressed in terms of their behaviours, without prescribing how to make them operational.

The definition of Read runs to 9 screen's worth of fairly dense text in my browser window, summarised as:

Retrieve a known business object from a collection.

I'm sorry... but how is this much text of any value to anyone? What is being achieved here? There is public money (from several countries) being spent on this (I have no way of knowing how much) with very, very little return on investment. I can't remember how long the e-Framework activity has been going on but it must be of the order of 5 years or so? Where are the success stories? What things have happened that wouldn't have happened without it?

When you raise these kind of questions, as I did on Twitter, the natural response is, "please take the time to comment on our documents and tell us what is wrong". The trouble is, when something is so obviously broken, it's hard to justify taking time to comment on it. Or as I said on Twitter:

i'm sorry to be so blunt - i appreciate this is people's baby - but you're asking the community to help polish a 5 year old turd

it's time to kick the turd into the gutter and move on

(For those of you that think I'm being overly rude here, the use of this expression is reasonably common in IT circles!)

Of course, one is then asked to justify why the e-Framework is a 'turd' :-(.

For me, the lack of any concrete value speaks for itself. There comes a time when you just have to bite the bullet and admit that nothing is being achieved.  Trying to explain why something is broken isn't necessary - it just is! The JISC don't even refer to the e-Framework in their own ITTs anymore (presumably because they have given up trying to get projects to navigate the maze of complex terminology in order to contribute the odd Service Usage Model (SUM) or two). It doesn't matter... there are very few Service Usage Models anyway, and even fewer Service Expressions. In fact, as far as I can tell the e-Framework consists only of a half-formed collection of unusable 'service' descriptions.

So, how come this thing still has any life left in it?

July 02, 2009

Investigating the "Scott Cantor is a member of the IEEE problem"

The UK Access Management Federation and other similar initiatives worldwide provide a SAML-based single sign-on solution for access to online resources for the education and research community.  Typically, a user must sign-on to their home institution, using their local username and password, before being granted access to a remote online resource.  In the main, this prevents the user from having to remember a separate username and password for each online resource that they wish to access.  However, there is a perceived problem that some users have several affiliations (their university, their employer, the NHS, their professional body, etc.), each of which may grant access to a different set of online resources, and that, currently, online services are not able to make seamless decisions about which resources a given user is entitled to access because they lack knowledge about these multiple affiliations.

We have recently funded Simon McLeish at LSE to undertake an investigation into this area, commonly known as the Scott Cantor is a member of the IEEE problem. (Scott Cantor is lead developer of the Shibboleth software and an editor of the SAML 2.0 specification).  This investigation will try to discover the extent of this problem in UK HE - who is affected, how serious stakeholders perceive it to be, and what is expected from a solution - in order to inform future work in this area.

More information about this study can be found thru the project's Wiki.  As usual, the final report will be made openly available to the community under a Creative Commons licence.

July 01, 2009

RESTful Design Patterns, httpRange-14 & Linked Data

Stefan Tilkov recently announced the availability of the video of a presentation he gave a few months ago on design patterns (& anti-patterns) for REST. I recommend having a look at it, as it covers a lot of ground and has lots of useful examples, and I find his presentational style strikes a nice balance of technical detail and reflection. If you haven't got time to listen, the slides are also available in PDF (though I do think hearing the audio clarifies quite a lot of the content).

One of the questions that this presentation (and other similar ones) planted at the back of my mind is that of how some of the patterns presented might be impacted by the W3C TAG's httpRange-14 resolution and the Cool URIs conventions for distinguishing between what it calls "real world objects" and "Web documents", some of which describe those "real world objects". The Cool URIs document focuses on the implications of this distinction on the use of the HTTP protocol to request representations of resources, using the GET method, but does not touch on the question of whether/how it affects the use of HTTP methods other than GET.

In the early part of his presentation, Stefan introduces the notion of "representation" and the idea that a single resource may have multiple representations. Some of the resources referred to in his examples, like "customers" (slide 16 in the PDF; slide 16 in the video presentation), when seen from the perspective of the Cool URIs document, fall, I think, into the category of "real world objects" - things which may be described (by distinct resources) but are not themselves represented on the Web. So, following the Cool URIs guidelines, the URI of a customer would be a "hash URI" (URI with fragment id) or a URI for which the response to an HTTP GET request is a 303 redirect to the (distinct) URI of a document describing the customer.

But what about non-"read-only" interactions, and using methods other than GET? The third "design pattern" in the presentation is one for "resource creation" (slide 55 in the PDF; slide 98 in the video presentation). Here a client POSTs a representation of a resource to a "collection resource" (slide 50 in the PDF; slide 93 in the video presentation). The example of a "collection resource" used is a collection of customers, with the implication, I think, that the corresponding "resource creation" example would involve the posting of a representation of a customer, and the server responding 201 with a new URI for the customer.

I think (but I'm not sure, so please do correct me!) that the implication of the httpRange-14 resolution is that in this example, the "collection resource", the resource to which a POST is submitted, would be a collection of "customer descriptions", and the thing posted would be a representation of a customer description for the new customer, and the URI returned for the newly created resource would be the URI of a new customer description. And a GET for the URI of the description would return a representation which included the URI of the new customer.

Restcool

(In the diagram above, http://example.org/customers/123 is the URI of a customer; http://example.org/docs/customers/123 is the URI of a document describing that customer

And, finally, a GET for the URI of the customer (assuming it isn't a "hash URI") would - following the Cool URIs conventions - return a 303 redirect to the URI of the description.

There is some discussion of this is in a short post by Richard Cyganiak, and I think the comments there bear out what I'm suggesting here, i.e. that POST/PUT/DELETE are applied to "Web documents" and not to "real-world objects".

The comment by Leo Sauermann on that post refers to the use of a SPARQL endpoint for updates - the SPARQL Update specification certainly addresses this area. It talks in terms of adding/deleting triples to/from a graph, and adding/deleting graphs to/from a "graph store". I think the "adding a graph to a graph store" case is pretty close to the requirement that is being addressed by the "post representation to Collection Resource" pattern. But I admit I struggle slightly to reconcile the SPARQL Update approach with Stefan's design pattern - and indeed, he highlights the "endpoint" notion, with different methods embedded in the content of the representation, as part of one of his "anti-patterns", their presence typically being an indicator that an architecture is not really RESTful.

I should emphasise that I'm trying to avoid seeming to adopt a "purist" position here: I recognise that "RESTfulness" is a choice rather than an absolute requirement. However, interest in the RESTful use of HTTP has grown considerably in recent years (to the extent that some developers seem keen to apply the label "RESTful", regardless of whether their application meets the design constraints specified by the architectural style or not). And now the "linked data" approach - which of course makes use of the httpRange-14 conventions - also seems to be gathering momentum, not least following the announcement by the UK government that Tim Berners-Lee would be advising them on opening up government data (and his issuing of a new note in his Design Issues series focussed explicitly on government data). It seems to me it would be helpful to be clear about how/where these two approaches intersect, and how/where they diverge (if indeed they do!). Purely from a personal perspective, I would like to be clearer in my own mind about whether/how the sort of patterns recommended by Stefan apply in the post-httpRange-14/linked data world.

June 29, 2009

Making the UK Federation usable

About a year ago, in Bye, bye Athens... hello UK Federation, I questioned the rather grand claims being made about the then new UK Access Management Federation for Education and Research, notably that it would deliver "improved services to users", and wondered what the reality would be like.

I think we're still waiting to find out to be honest but there doesn't yet seem to be much evidence that anything has really improved over what we had before - certainly not in terms of usability for the end-user!

Last week I attended a meeting set up by the JISC-funded Service Provider Interface Study project, looking specifically at usability issues within the federation as things currently stand.

Firstly... hats off to both the project team and JISC for being so open about the issues. The meeting was a real eye-opener (for me at least), not only in that it demonstrated just how poor usability is across all the players that make up the federation, but also in the realisation that, actually, most service provider access control is done via IP address checking rather than by SAML-based authentication, in part because the usability issues are so great. For most users, SAML only comes into play when they are off-site (at least according to the service providers present at the meeting). Note: I appreciate that this was also the case under the old Athens system... I mention it here only because it seems to me that the continued use of IP address checking hasn't been widely acknowledged in the way the federation is generally presented, so it came as something of a surprise (to me at least).

Usability problems hit almost every aspect of the Federation as it is currently deployed, from the point that the end-user is initially asked to sign-on right thru to the ways in which service provider services are personalised (or not). Overall usability is made worse because the end-to-end experience is distributed across several players - the service provider, the identity provider, and (optionally) a 'where are you from?' service - each of which can, and do, make different decisions about naming and design. The result is a confusing experience for the end-user, combining poor usability of the individual components in the system coupled with a lack of consistency between the different parts, leading to a situation where it must be near impossible, for example, to write user-support documentation (i.e. help pages) covering everything in a comprehensive form. Even trivial issues, such as what 'sign-on' is called and where it is positioned on the page, are handled differently by different players.

It seems to me that privacy and security issues seem to have driven much of the thinking behind the Federation in its current form. At one point I asked the meeting whether anyone had actually asked real end-users whether they would like to have the option of sharing more information between the identity provider and the service provider in order to enjoy a more seamless and usable experience overall (even if it theoretically compromised their privacy in some way)? I'm not sure I got a clear answer... but it is hard not to draw the conclusion that the Federation has been designed by people with a primary interest in the technology rather than the user experience. A bit like the 'good old days' when we let the techies have full control over firewall policies, disregarding the fact that people actually had jobs that they needed to get done.

I'm sorry if all this seems very blunt but the current deployments are so un-friendly that something has got to be done - otherwise we might as well just bite the bullet and go back to having separate login accounts for every service we access (something that most people are perfectly accustomed to these days anyway!).

So... I want to focus on two, related, aspects of usability for the remainder of this post: naming the authentication process and discovering the identity provider.

Firstly... what do you call the process by which you gain access to a service provider in a SAML-based world? How are things 'branded' if you like? This is a non-trivial question to answer but a great example of how largely technical considerations (like the need for federations) have been allowed to get in the way of user-oriented usability issues. It's also something that the OpenID crowd have got cracked. That's not to say that there aren't other problems with OpenID - there are - but at least they have a single global brand (and associated logo) which makes it easy for any user, anywhere in the world to realise when they are being asked to sign-in using their OpenID.

What's the equivalent brand in the SAML world? There isn't one. Nobody pushes the use of a "SAML sign-on" (quite rightly in my opinion) since it would be meaningless to people. Shibboleth, as I've argued before, names a particular bit of software rather than an approach, and so is inappropriate. Some service providers in the UK still use 'Athens' (because it's what end-users are used to!) - again, clearly wrong in a SAML world. That leaves branding at the level of the federation... but who on earth wants to refer to their "UK Access Management Federation login" - what a horrible mouthful that is. And remember... most service providers offer their services globally, so if we brand things at the federation level then service providers have to start asking their users which federation they are part of - something that I suspect most of us neither know nor care!

That brings us on to my second usability issue. In a SAML-based world, service providers have to direct the end-user back to their institution in order that they can sign-in using their institutional username and password before being redirected back to the target service. This is typically done using a 'where are you from' service, either stand-alone on the network or embedded into the service provider website. Typically, this involves the end-user selecting from a pull-down list of identity providers (there are over 500 in the UK Federation currently), optionally preceded by a pull-down list of possible federations.  This is horrible.

I'd like to propose a new rule of thumb for the design of user-interfaces in a SAML world... if we ever have to explicitly ask the end-user to choose from a list which federation they are part of, then we have a totally borked approach and we need to do something different. This seems obvious to me - yet it is exactly the direction we are heading in right now :-( .

I'd actually go much further and say that if we ever have to explicitly ask the end-user to tell us which institution they are a member of just so they can sign-in to something, then we have similarly broken the system - but I appreciate that is a more difficult part of the process to remove given the current technology. We can, however, make the selection of the institution rather easier than scrolling thru a list of 500 (or 1000, or 5000) names. How about looking at the way TheTrainLine let you select a station name? How about using the JQuery Auto-Complete function to narrow down the list of available names as the end-user begins to type? Here's a demo of just that. Much more intuitive than a pull-down list.  (Thanks to my colleague, Mike Edwards, for the sample code to build this, based on the JISC "what do we do?" page.)

It'll be interesting to see what recommendations the Service Provider Interface Study project comes up with.  Here are mine.  Let's stop thinking in terms of asking the end-user what federation they belong to and think instead of questions they are likely to know the answers to.  What is the name of the institution you belong to? What's the URL of your institutional website? What country are you in?  Let's make the machines do the hard work of sorting out which federation is relevent.  In short, let's start building user-interfaces, no... scrub that... let's start designing the system as a whole such that usability and the needs of the end-user are put first rather than being tacked on as an after-thought!

Finally... I'm surprised that publishers, and other service providers, aren't driving this much more forcibly than they appear to have done to date.  There was a strong feeling in the meeting that things have got much worse (in usability terms) since the demise of Athens - yet the publishers present seemed rather defeatest about what they could do about it.  Every time the usability of the system breaks, a service provider somewhere stands to lose a customer and while they are not typically paying for content individually it ultimately all adds up.  Publishers should be pressing the UK Federation (and all other federations) for a system that works end-to-end, not just because of their own self-interests, but because of the interests of their primary user-base.  I also think that they should be working much more closely together to bring greater consistency to the way that SAML-based sign-on is presented to the end-user.

June 25, 2009

Twitter for idiots

I'm just back from giving a 30 minute "Twitter for idiots" tutorial for one of our senior management team here at Eduserv.  Note that the title isn't intended to be offensive - in fact, he chose it - but it certainly sums up the level of what I had to say.  It reminds me that yesterday I tweeted rather negatively about the fact that CILIP are offering a Twitter for Librarians training course:

good grief... do #cilip really run a twitter course? - http://tinyurl.com/mxabo3 - speaks volumes methinks

Phil Bradley, who is running the course, quite rightly came back at me with a challenge to explain what, and who, it "speaks volumes" about.

So... two things. Firstly, it was an off the cuff remark - essentially a joke - but like all such things I guess there is a serious point behind it. The idea of running a half-day course to teach people how to tweet just struck me as funny! It's an anachronysm. In that sense, it says something about both the library community and CILIP I guess. Paying to sit in a room in order to find out how to create a "a good, rounded and effective Twitter profile", for example, smacks of a '1980s-style mainframe user-support application training programme' mentality that just doesn't sit comfortably with the way the Web works today. IMHO.

That doesn't mean that there aren't learning needs and opportunities around our use of Twitter by the way, I think there probably are, but I also think that people have to get Twitter before even thinking about such things and I'm not totally sure that you can teach people to get Twitter? People get Twitter by using it.

Secondly (and very much related to the last point), there is a visitors vs. residents issue here (to borrow David White's categorisation of online users). Twitter is a tool for residents. It's about people being immersed. It's about people "living a percentage of their life online". When visitors get hold of Twitter they see it as a tool to get a job done when the need arises - to push out an occassional marketing message for example. This is when things have the potential to go badly wrong (as seen recently with Habitat's use of Twitter). Again, the real issue here is whether you can teach/train visitors to become residents.

Note that I am not using the resident vs visitor divide in a judgemental way here. I'm happy to accept that the world is split into two types of people (no, not those who divide the world into two types of people and those who don't!) and I'm happy to accept that both approaches to the world are perfectly valid. But they are different approaches and I don't know how often people cross from one to the other, nor whether such changes come as the result of attending a course or workshop?

June 24, 2009

The lifecycle of a URI

Via a post to the W3C Linked Open Data mailing list recently, I came across a short document by David Booth on The URI Lifecycle in Semantic Web Architecture.

It's particularly interesting, I think, because it highlights that different agents have varying relationships to, or play various roles with respect to, a URI, and those different roles bring with them varying responsibilities for maintaining the stability of the URI as an identifier. And it is the collective action of these different parties that serve to preserve that stability (or not).

The foundations of the principles articulated here are, of course, those presented in the W3C's Architecture of the World Wide Web. But David's document amplifies these base guidelines by emphasising both the social and the temporal dimensions of URI "minting", use (both by authors/writers of data and consumers/readers of data), and, potentially, obsoleting.

As the introduction emphasises, the lifecycle of a URI is quite distinct from that of the resource identified by that URI:

a URI that denotes the Greek philosopher Plato may be minted long after Plato has died. Similarly, one could mint a URI to denote one's first great-great-grandson even though such a child has not been conceived yet.

For David, a key part of the "minting" stage is the publication of what he calls a URI declaration to be accessible via the "follow-your-nose" mechanisms of the Web and the Cool URIs conventions. It is this which forms the basis of a "social expectation that the URI will be used in a way that is consistent with its declaration"; it "anchors the URI's meaning". (More precisely, the documents refer to the "core assertions" of such a declaration.)

An author using/citing that URI in their data is then responsible for using that URI in ways which are consistent with the owner's URI declaration. And a consumer of that data should make an interpretation of the URI consistent with the declaration. However, there is a temporal aspect to these actions: a URI declaration may be changed in the period between the point an author cites a URI and the point at which a consumer reads/processes that data, and in that case the author's implied commitment is to the declaration at the time their statement was written, and the consumer may also choose to select that specific version of the declaration over the most recent one.

In the discussion of the document on the public-lod mailing list, there's some analysis of what happens when actors do not meet such expectations or discharge these responsibilities, and indeed to what extent the existence of these expectations and responsibilities leaves room for flexibility. Dan Brickley describes the case of the "semantic drift" of a FOAF property called foaf:schoolHomepage, where the URI owners' initial intent was that this referred to the home pages of the institutions UK residents know as "schools" (i.e. the institutions you typically attend up to the age of 16 or 18), but which authors from outside the UK interpreted as having a broader scope (one in which the notion of "schools" included, e.g., universities) and deployed in that way in their data. As a consequence, the "URI declaration" was updated to reflect the community's interpretation.

There is a tension here between "nailing down" meaning and allowing for the sort of "evolution" that takes place in human languages, and the scope for accommodating this sort of variability was, I think, perhaps the main point of contention in the discussion. In conclusion, David emphasises:

The point of a URI declaration is not to forbid semantic variability, it is to permit the bounds of that variability to be easily prescribed and determined.

All in all, it's a clear, thoughtful document which addresses several complexities in our management of URI stability over time in a social Web.

June 23, 2009

Virtual World Watch publishes new Snapshot report

Yesterday, John Kirriemuir announced the publication by the Virtual World Watch project of a new issue of the "snapshot" survey reports he has been collating covering the use of virtual worlds in UK higher and further educational institutions.

In his introductory section, John highlights a couple of points:

  • In terms of subject areas, the health and medical science sector appears to be developing a high profile in terms of its use of virtual worlds. I've noticed this from my own fairly cursory tracking of activity via mailing lists and weblogs. I was slightly surprised that some of this functionality (simulations etc) isn't covered by existing software applications, but there seems to be a gap which - in some cases at least - is being addressed through the use of virtual worlds.
  • Although some technical challenges remain, in comparison with previous surveys, reports of technical obstacles to the use of virtual worlds software are diminishing. John attributes this to the dual influence of growing institutional support in some cases and unsupported individuals abandoning their efforts in others. My own occasional experience of using Second Life (which John notes remains "the virtual world of choice" in UK universities and colleges) has been that the platform seems vastly more stable than it was a couple of years ago when John embarked on these surveys - though ironically last weekend saw one of the most widespread and prolonged disruptions that I can recall in a long time.

As a footnote, I'd highlight John's point that for the next survey he is placing more emphasis on gathering information in-world, both in Second Life and in other virtual worlds. It'll be interesting to see how well this works out, as I have to admit I find the in-world discovery and communication tools somewhat limited, and I find myself relying heavily on Web-based sources (weblogs, microblogging services, Flickr, YouTube etc) to find resources of interest (and get rather frustrated when I come across interesting in-world resources that aren't promoted well on the Web!).

Anyway, as with previous installments, the report provides a large amount of detail and insights into what UK educators are doing in virtual worlds and what they are saying about their experiences.

June 22, 2009

Influence, connections and outputs

Martin Weller wrote an interesting blog post on Friday, Connections versus Outputs - interesting in the sense that I strongly disagreed with it - that discussed a system for assessing an individual's "prominence in the online community of their particular topic" by measuring their influence, betweenness and hubness (essentially their 'connectedness' to others in that community). Martin had used the system to assess the prominence of people and organisations working in the area of 'distance learning', suggesting that it might form a useful basis for further work looking at metrics for the new forms of scholarly communication that are enabled by the social Web. The algorithm adopted by the system was not available for discussion so one was left reacting to the results it generated.

I reacted somewhat negatively, largely on the basis that the system ranked Brian Kelly's UK Web Focus blog 6th most influential in that particular subject area. This is not a criticism of Brian (who is clearly influential in other areas), but the fact remains that Brian's blog contains only three posts where the phrase 'distance learning' appears, two of which are in comments left by other people and one of which is in a guest post - hardly indicative of someone who is highly influential in that particular subject area?

In passing, I note that Brian has now also commented on this and Martin has written a follow-up post.

Why does Brian's blog appear in the list? Probably because he is very well connected to people who do write about distance learning. Unfortunately, that connectedness is not sufficient, on its own, to draw conclusions about his level of influence on that particular topic, so the whole process breaks down very quickly.

My concern is that if we present these kinds of rather poor metrics in any way seriously in counterpoint to more traditional (though still flawed) metrics like the REF we will ultimately do harm in trying to move forward any discussion around the assessment of scholarly communication in the age of the social Web.

To cut a long story short (you can see my fuller comments on the original post) I ended by suggesting that if we really want to develop "some sensible measure of scholarly impact on the social Web" then we have to step back and consider three questions:

  • what do we want to measure?
  • what can we measure?
  • how can bring these two things close enough together to create something useful?

To try and answer my own questions I'll start with the first. I suggest that we want to try and measure two aspects of 'impact':

  • the credibility of what an individual has to say on a topic,
  • and the engagement of an individual within their 'subject' community and their ability to expose their work to particular audiences.

These two are clearly related, at least in the sense that someone's level of engagement in a community (their connectedness if you like) clearly increases the exposure of their work but is also indicative of the credibility they have within that community.

Having said that, my gut feeling is that credibility, at least for the purposes of scholarly communication, can only really be measured by some kind of a peer-review (i.e. human) process. Of course, on the Web, we are now very used to infering credibility based on the weighted number of inbound links that a resource receives, not least in the form of Google's PageRank algorithm. This works well enough for mainstream Web searching but I wouldn't want it used, at least not at any trivial level, to assess scholarly credibility or impact. Why not? Well a couple of things immediately spring to mind...

Firstly, a link is typically just a link at the moment, whether it's a hyperlink between two resources or the link between people in a social network. The link carries no additional semantics. If paper A critiques paper B then we don't want to link between them to result in paper B being measured as having more credibility/impact than it otherwise would have done had the critique not been written.  (This is also true of traditional citations between journal articles of course, except that peer review mechanisms stop (most of) the real dross from ever seeing the light of day.  On the Web, everything is there to be cited.)

Secondly, if we just consider blogging for a moment, the way a blog is written will have a big impact on how people react to it and that, in turn, might affect how we measure it. Blogs written in a more 'tabloid' style for example might well result in more commenting or inbound links than those written in a more academic style. We presumably don't want to end up measuring scholarly impact as though we were measuring newspaper circulation?

Thirdly, any metrics that we choose to use in the future will ultimately influence the way that scholarly communication happens. Take blog comments for example. A comment is typically not a first class Web object - comments don't have URIs for example. One can therefore make the argument that writing a comment on someone else's blog post is less easily measurable than writing a new blog post that cites the original. One might therefore expect to see less commenting and more blog-post writing (under a given set of metrics). While this isn't necessarily a bad thing, it seems to me that our behaviour should be driven by what works best for 'scholarly communication' not by what can be most easily 'measured'.

As I said in my first comment on Martin's post, "connectedness is cheap". On that basis, we have to be very careful before using any metrics that are wholly or largely based on measures of connectedness. The point is that the things we can measure easily (to return to the second part of my question above) are likely to be highly spammable (i.e. they can be gamed, either intentionally or by accident). Yes, OK... all measures are spammable, but some are more spammable than others! If we want to start assessing academics in terms of their engagement and output as part of the social Web then I think we need to start by answering my questions above rather than by showcasing rather poor examples of what can be automated now, except as a way of saying, "look, this is hard"!

The rise of green?

I attended the Terena Networking Conference 2009 in Malaga a couple of weeks ago where several of the keynote talks focused on the environment, global warming, the impact that data centres and ICT more generally have on this, and the potential for cloud-based solutions to help.  The talks were all really interesting actually, though I must confess I was slightly confused as to why they appeared so heavily in that particular conference. I particularly liked Bill St. Arnaud's suggestion that facilities powered by sources of renewable energy (wind, wave or solar for example) will be subject to periods of non-availability, meaning that network routing architectures will have to be devised to move compute and storage resources around the network dynamically in response.

We've just announced our new Data Centre facility in Swindon and I initially commented (internally) that I felt the environmental statement to be a little weak. My interest is partly environmental (I want the organisation I work for to be as environmentally neutral as possible) and partly business-related (if the ICT green agenda is on the rise then one can reasonably expect that HEI business decisions around outsourcing will increasingly be made on the back of it). On that basis, I want Eduserv's messaging on environmental issues to be as transparent as possible. (I think this is true for any concerned individual working for any organisation - I'm not picking on my employer here).  It is worth noting that we now have an internal 'green team' with a remit to consider environmental issues across Eduserv as a whole.

Based on a completely trivial and unscientific sample of 4 'data centre'-related organisations in the UK - Edina, Eduserv, Mimas and ULCC - I make the following, largely unhelpful, observations...

  • It's marginally easier to find an accessibility statement than it is to find an environmental statement (not surprising I guess) though ULCC's Green Statement is quite prominent,
  • it's not easy for Joe Average (i.e. me!) to work out what the hell it all means in any practical sense.

On balance, and despite my somewhat negative comment above about its weakness, the fact that we are making any kind of statement about our impact on the environment is a step in the right direction.

June 19, 2009

Repositories and linked data

Last week there was a message from Steve Hitchcock on the UK jisc-repositories@jiscmail.ac.uk mailing list noting Tim Berners-Lee's comments that "giving people access to the data 'will be paradise'". In response, I made the following suggestion:

If you are going to mention TBL on this list then I guess that you really have to think about how well repositories play in a Web of linked data?

My thoughts... not very well currently!

Linked data has 4 principles:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.

Of these, repositories probably do OK at 1 and 2 (though, as I’ve argued before, one might question the coolness of some of the http URIs in use and, I think, the use of cool URIs is implicit in 2).

3, at least according to TBL, really means “provide RDF” (or RDFa embedded into HTML I guess), something that I presume very few repositories do?

Given lack of 3, I guess that 4 is hard to achieve. Even if one was to ignore the lack of RDF or RDFa, the fact that content is typically served as PDF or MS formats probably means that links to other things are reasonably well hidden?

It’d be interesting (academically at least), and probably non-trivial, to think about what a linked data repository would look like? OAI-ORE is a helpful step in the right direction in this regard.

In response, various people noted that there is work in this area: Mark Diggory on work at DSpace, Sally Rumsey (off-list) on the Oxford University Research Archive and parallel data repository (DataBank), and Les Carr on the new JISC dotAC Rapid Innovation project. And I'm sure there is other stuff as well.

In his response, Mark Diggory said:

So the question of "coolness" of URI tends to come in second to ease of implementation and separation of services (concerns) in a repository. Should "Coolness" really be that important? We are trying to work on this issue in DSpace 2.0 as well.

I don't get the comment about "separation of services". Coolness of URIs is about persistence. It's about our long term ability to retain the knowledge that a particular URI identifies a particular thing and to interact with the URI in order to obtain a representation of it. How coolness is implemented is not important, except insofar as it doesn't impact on our long term ability to meet those two aims.

Les Carr also noted the issues around a repository minting URIs "for things it has no authority over (e.g. people's identities) or no knowledge about (e.g. external authors' identities)" suggesting that the "approach of dotAC is to make the repository provide URIs for everything that we consider significant and to allow an external service to worry about mapping our URIs to "official" URIs from various "authorities"". An interesting area.

As I noted above, I think that the work on OAI-ORE is an important step in helping to bring repositories into the world of linked data. That said, there was some interesting discussion on Twitter during the recent OAI6 conference about the value of ORE's aggregation model, given that distinct domains will need to layer their own (different) domain models onto those aggregations in order to do anything useful. My personal take on this is that it probably is useful to have abstracted out the aggregation model but that the hypothesis still to be tested that primitive aggregation is useful despite every domain needing own richer data and, indeed, that we need to see whether the way the ORE model gets applied in the field turns out to be sensible and useful.

About

Powered by TypePad
Add to Technorati Favorites