« June 2009 | Main | August 2009 »

July 31, 2009

Care in the community

I was at UKOLN's Institutional Web Management Workshop 2009 event at the University of Essex earlier this week to run a workshop session with Ed Barker and Simon Bradley (of SIRC) entitled Care in the community... how do you manage your Web content?. The session, and the workshop more generally for that matter, went pretty well I think.  We used our 90 minutes for a mix of presentation, Simon giving a whirlwind tour of the major findings of the Investigation into the management of website content in higher education institutions that they've been undertaking on our behalf, and group discussion.

For the discussion groups we split people randomly into 3 groups to discuss a range of propositions based loosely on the findings of the investigation. The groups were asked to consider each proposition and to either agree with it or to offer an alternative version. They were than asked to write down 3 consequences (issues, actions or conclusions) that arose from their agreed proposition.

16 propositions were available, inside sealed envelopes labelled with one of 5 broad topic areas:

  • The Web Team,
  • Institutional Issues,
  • CMS,
  • End Users,
  • The Future.

Of the available propositions, 13 were discussed by the groups in the time available. Note that the propositions were chosen to stimulate discussion. They do not necessarily represent the views of Eduserv or SIRC. Perhaps more importantly, they should not be taken as a direct representation of the findings of the study.

The outputs from the group discussions are now available on Google Docs. The report of the investigation itself will be published on Thursday 6th August.

July 29, 2009

Enhanced PURL server available

A while ago, I posted about plans to revamp the PURL server software to (amongst other things) introduce support for a range of HTTP response codes. This enables the use of PURLs for the identification of a wider range of resources than "Web documents" using the interaction patterns specified by current guidelines provided by the W3C in the TAG's httpRange-14 resolution and the Cool URIs for the Semantic Web document.

Lorcan Dempsey posted on Twitter yesterday to indicate that OCLC have deployed the new software, developed by Zepheira, on the OCLC purl.org service. Although I've just had time for a quick poke around, and I need to read the documentation more carefully to understand all the new features, it looks like it does the job quite nicely.

This should mean that existing PURL owners who have used PURLs to identify things other than "Web documents" - like DCMI, who use PURLs like http://purl.org/dc/terms/title to identify their "metadata terms", "conceptual" resources - should be able to adjust the appropriate entries on the purl.org server so that interactions follow those guidelines. It also offers a new option for those who wish to set up such redirects but perhaps don't have suitable levels of access to configure their own HTTP server to perform those redirects.

July 23, 2009

Give me an R

My rather outspoken outburst against the e-Framework a while back resulted in a mixed response, most of it in private replies for various reasons, largely supportive but some of which suggested that my negative tone was not conducive to encouraging debate.  For what it's worth, I agree with this latter point - I probably shouldn't have used the language that I did - but sometimes you just end up in a situation where you feel like letting off steam.  As a result, I didn't achieve any public debate about the value of the e-Framework and my negativity remains.  I would genuinely like to be set straight on this because I want to understand better what it is for and what its benefits have been (or will be).

Since writing that post I have been thinking, on and off, about why I feel so negative about it. In the main I think it comes down to a fundamental change in architectural thinking over the last few years. The e-Framework emerged at a time when 'service oriented' approaches (notably the WS- stack) were all the rage. The same can also be said to a certain extent about the JISC Information Environment of course (actually the JISC IE predated the rise of 'SOA' terminology but like most digital library initiatives it was certainly 'service oriented' in nature - I can remember people using phrases like, "we don't expose the resource, we expose services on the resource" very clearly) which I think explains some of my discomfort when I look back at that work now.

Perhaps SOA is still all the rage in 'enterprise' thinking, which is where the e-Framework seems to be most heavily focused, I don't know?1 It's not a world in which I dabble. But in the wider Web world it seems to me that all the interesting architectural thinking (at the technical level) is happening around REST and Linked Data at the moment. In short, the architectural focus has shifted from the 'service' to the 'resource'.

So, it's just about fashion then? No, not really - it's about the way architectural thinking has evolved over the last 10 years or so.  Ultimately, it’s that it seems more useful to think in resource-centric ways than it is to think in service-centric ways. Note that I'm not arguing that service oriented approaches have no role to play in our thinking at a business level, clearly they do - even in a resource oriented world. But I would suggest that if you adopt a technical architectural perspective that the world is service oriented at the outset then it is very hard to think in resource oriented terms later on and ultimately that is harmful to our use of the Web.

I think there are other problems with the e-Framework - the confusing terminology, the amount of effort that has gone into describing the e-Framework itself (as opposed to describing the things that are really of interest), the lack of obvious benefits and impact, and the heavyweight nature of the resulting ‘service’ descriptions – but it is this architectural difference that lies at the heart of my problem with it.

1) For what it's worth, I don't see why a resource oriented approach (at the technical level) shouldn't be adopted inside the enterprise as well as outside.

July 22, 2009

What's a tweet worth?

One of the successful aspects of Twitter is its API and the healthy third-party 'value-add' application environment that has grown up around it. This environment has seen the development not just of new clients but of all sorts of weird and wonderful, serious and trivial, applications for enhancing your Twitter experience.

In the good old days, third-party applications gave you the option of tweeting your followers about how wonderful you thought their shiny new application was.  The use of such an option was typically left to your discretion and no incentives were given to encourage you to do so - other than that you thought the information might be useful to those around you.  Such an approach kind of worked when we all had relatively low numbers of followers and there were relatively few apps.

More recently I've noticed a new 'business model' emerging on Twitter which can be summed up as, "spam all your followers with a single tweet about us and we'll reward you in some way".  The rewards vary but might include a free entry into a prize draw, or money off the full subscription rate for the application in question.

Unfortunately, in a twitterverse where lots of people follow lots of other people, every person's "single tweet" quickly turns into a "deluge of tweets" for those people who follow a reasonably large number of other twitterers.

One recent example of this (in my Twitter stream at least) was people's use of the #moonfruit hashtag in order to enter into a prize draw for a Macbook, leading to a collective series of tweets that quickly became very annoying.

More recently I've noticed a similar thing, though so far much less widespread, arising from the BackupMyTweets service.  This service is somewhat more interesting than the Moonfuit example.  For a start, it isn't as mainstream as Moonfruit (I don't suppose that most people give two hoots about whether their tweets are backed-up or not!) and therefore hasn't given rise to the same level of problem.  Conversely, being more academic in nature notionally gives people a more credible reason to tweet about it.

The trouble is... BackupMyTweets is offering one year's free subscription to their service if you send one tweet about them (and they offer a facility to make doing so very easy with a stock set of phrases about how useful they are).  One year's free service is worth (US)$10 so that's quite an incentive.

However, users of this facility (and others like it) need to remember that the real cost of tweeting about it (even if that tweet is intended genuinely) lies in the trust people place in their future tweets.  If I know that someone is willing to tweet about how good something is just because they are getting paid to do so, what does that tell me about their future recommendations?

Does one such tweet have any impact on someone's credibility?  No, of course not.  But if there is a trend towards this kind of thing (as I suspect there is) then it will become more of an issue.  This is particularly true where it is a more 'corporate' Twitter account sending the tweet (as, for example, the Institutional Web Management Workshop Twitter account did yesterday).  People follow such accounts on the basis that they want to keep up to date with an event or organisation - they don't want to see them used to send spam about other people's tools and services.

Assuming this trend continues I guess we'll soon start to see the addition of a 'block more tweets like this' button in Twitter clients, followed (presumably) by some kind of Twitter equivalent of RBL?  Maybe I'm making a mountain out of a molehill here, though people probably thought the same in the early days of email spam?  Remember, the only reason these kinds of approaches work is because we so easily fall into the trap of using them.  The problem is ours and can be fixed by us.

July 21, 2009

Linked data vs. Web of data vs. ...

On Friday I asked what I thought would be a pretty straight-forward question on Twitter:

is there an agreed name for an approach that adopts the 4 principles of #linkeddata minus the phrase, "using the standards (RDF, SPARQL)" ??

Turns out not to be so straight-forward, at least in the eyes of some of my Twitter followers. For example, Paul Miller responded with:

@andypowe11 well, personally, I'd argue that Linked Data does NOT require that phrase. But I know others disagree... ;-)


@andypowe11 I'd argue that the important bit is "provide useful information." ;-)

Paul has since written up his views more thoughtfully in his blog, Does Linked Data need RDF?, a post that has generated some interesting responses.

I have to say I disagree with Paul on this, not in the sense that I disagree with his focus on "provide useful information", but in the sense that I think it's too late to re-appropriate the "Linked Data" label to mean anything other than "use http URIs and the RDF model".

To back this up I'd go straight to the horses mouth, Tim Berners-Lee, who gave us his personal view way back in 2006 with his 'design issues' document on Linked Data. This gave us the 4 key principles of Linked Data that are still widely quoted today:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
  4. Include links to other URIs. so that they can discover more things.

Whilst I admit that there is some wriggle room in the interpretation of the 3rd point - does his use of "RDF, SPARQL" suggested these as possible standards or is the implication intended to be much stronger? - more recent documents indicate that the RDF model is mandated. For example, in Putting Government Data online Tim Berners-Lee says (refering to Linked Data):

The essential message is that whatever data format people want the data in, and whatever format they give it to you in, you use the RDF model as the interconnection bus. That's because RDF connects better than any other model.

So, for me, Linked Data implies use of the RDF model - full stop. If you put data on the Web in other forms, using RSS 2.0 for example, then you are not doing Linked Data, you're doing something else! (Addendum: I note that Ian Davis makes this point rather better in The Linked Data Brand).

Which brings me back to my original question - "what do you call a Linked Data-like approach that doesn't use RDF?" - because, in some circumstances, adhering to a slightly modified form of the 4 principles, namely:

  1. use URIs as names for things
  2. use HTTP URIs so that people can look up those names
  3. when someone looks up a URI, provide useful information
  4. include links to other URIs. so that they can discover more things

might well be a perfectly reasonable and useful thing to do. As purists, we can argue about whether it is as good as 'real' Linked Data but sometimes you've just got to get on and do whatever you can.

A couple of people suggested that the phrase 'Web of data' might capture what I want. Possibly... though looking at Tom Coates' Native to a Web of Data presentation it's clear that his 10 principles go further than the 4 above.  Maybe that doesn't matter? Others suggested "hypermedia" or "RESTful information systems" or "RESTful HTTP" none of which strikes me as quite right.

I therefore remain somewhat confused. I quite like Bill de hÓra's post on "links in content", Snowflake APIs, but, again, I'm not sure it gets us closer to an agreed label?

In a comment on a post by Michael Hausenblas, What else?, Dan Brickley says:

I have no problem whatsoever with non-RDF forms of data in “the data Web”. This is natural, normal and healthy. Stastical information, geographic information, data-annotated SVG images, audio samples, JSON feeds, Atom, whatever.

We don’t need all this to be in RDF. Often it’ll be nice to have extracts and summaries in RDF, and we can get that via GRDDL or other methods. And we’ll also have metadata about that data, again in RDF; using SKOS for indicating subject areas, FOAF++ for provenance, etc.

The non-RDF bits of the data Web are – roughly – going to be the leaves on the tree. The bit that links it all together will be, as you say, the typed links, loose structuring and so on that come with RDF. This is also roughly analagous to the HTML Web: you find JPEGs, WAVs, flash files and so on linked in from the HTML Web, but the thing that hangs it all together isn’t flash or audio files, it’s the linky extensible format: HTML. For data, we’ll see more RDF than HTML (or RDFa bridging the two). But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much! :)

Count me in as a worrier then!

I ask because, as a not-for-profit provider of hosting and Web development solutions to the UK public sector, Eduserv needs to start thinking about the implications of Tim Berners-Lee's appointment as an advisor to the UK government on 'open data' issues on the kinds of solutions we provide.  Clearly, Linked Data is going to feature heavily in this space but I fully expect that lots of stuff will also happen outside the RDF fold.  It's important for us to understand this landscape and the impact it might have on future services.

July 20, 2009

On names

There's was a brief exchange of messages on the jisc-repositories mailing list a couple of weeks ago concerning the naming of authors in institutional repositories.  When I say naming, I really mean identifying because a name, as in a string of characters, doesn't guarantee any kind of uniqueness - even locally, let alone globally.

The thread started from a question about how to deal with the situation where one author writes under multiple names (is that a common scenario in academic writing?) but moved on to a more general discussion about how one might assign identifiers to people.

I quite liked Les Carr's suggestion:

Surely the appropriate way to go forward is for repositories to start by locally choosing a scheme for identifying individuals (I suggest coining a URI that is grounded in some aspect of the institution's processes). If we can export consistently referenced individuals, then global services can worry about "equivalence mechanisms" to collect together all the various forms of reference that.

This is the approach taken by the Resist Knowledgebase, which is the foundation for the (just started) dotAC JISC Rapid Innovation project.

(Note: I'm assuming that when Les wrote 'URI' he really meant 'http URI').

Two other pieces of current work seem relevant and were mentioned in the discussion. Firstly the JISC-funded Names project which is working on a pilot Names Authroity Service. Secondly, the RLG Networking Names report.  I might be misunderstanding the nature of these bits of work but both seem to me to be advocating rather centralised, registry-like, approaches. For example, both talk about centrally assigning identifiers to people.

As an aside, I'm constantly amazed by how many digital library initiatives end up looking and feeling like registries. It seems to be the DL way... metadata registries, metadata schema registries, service registries, collection registries. You name it and someone in a digital library will have built a registry for it.

May favoured view is that the Web is the registry. Assign identifiers at source, then aggregate appropriately if you need to work across stuff (as Les suggests above).  The <sameAs> service is a nice example of this:

The Web of Data has many equivalent URIs. This service helps you to find co-references between different data sets.

As Hugh Glaser says in a discussion about the service:

Our strong view is that the solution to the problem of having all these URIs is not to generate another one. And I would say that with services of this type around, there is no reason.

In thinking about some of the issues here I had cause to go back and re-read a really interesting interview by Martin Fenner with Geoffrey Bilder of CrossRef (from earlier this year).  Regular readers will know that I'm not the world's biggest fan of the DOI (on which CrossRef is based), partly for technical reasons and partly on governence grounds, but let's set that aside for the moment.  In describing CrossRef's "Contributor ID" project, Geoff makes the point that:

... “distributed” begets “centralized”. For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.). This gets us back to square one and makes me think the real issue is - how do you make the centralized system that eventually emerges accountable?

I think this is a fair point but I also think there is a very significant architectural difference between a centralised service that aggregates identifiers and other information from a distributed base of services, in order to provide some useful centralised function for example, vs. a centralised service that assigns identifiers which it then pushes out into the wider landscape. It seems to me that only the former makes sense in the context of the Web.

July 09, 2009

e-Framework - time to stop polishing guys!

The e-Framework for Education and Research has announced a couple of new documents, the e-Framework Rationale and the e-Framework Technical Model, and have invited the community to comment on them.

In looking around the e-Framework website I stumbled on a definition for the 'Read' Service Genre. Don't know what a Service Genre is? Join the club... but for the record, they are defined as follows:

Service Genres describe generic capabilities expressed in terms of their behaviours, without prescribing how to make them operational.

The definition of Read runs to 9 screen's worth of fairly dense text in my browser window, summarised as:

Retrieve a known business object from a collection.

I'm sorry... but how is this much text of any value to anyone? What is being achieved here? There is public money (from several countries) being spent on this (I have no way of knowing how much) with very, very little return on investment. I can't remember how long the e-Framework activity has been going on but it must be of the order of 5 years or so? Where are the success stories? What things have happened that wouldn't have happened without it?

When you raise these kind of questions, as I did on Twitter, the natural response is, "please take the time to comment on our documents and tell us what is wrong". The trouble is, when something is so obviously broken, it's hard to justify taking time to comment on it. Or as I said on Twitter:

i'm sorry to be so blunt - i appreciate this is people's baby - but you're asking the community to help polish a 5 year old turd

it's time to kick the turd into the gutter and move on

(For those of you that think I'm being overly rude here, the use of this expression is reasonably common in IT circles!)

Of course, one is then asked to justify why the e-Framework is a 'turd' :-(.

For me, the lack of any concrete value speaks for itself. There comes a time when you just have to bite the bullet and admit that nothing is being achieved.  Trying to explain why something is broken isn't necessary - it just is! The JISC don't even refer to the e-Framework in their own ITTs anymore (presumably because they have given up trying to get projects to navigate the maze of complex terminology in order to contribute the odd Service Usage Model (SUM) or two). It doesn't matter... there are very few Service Usage Models anyway, and even fewer Service Expressions. In fact, as far as I can tell the e-Framework consists only of a half-formed collection of unusable 'service' descriptions.

So, how come this thing still has any life left in it?

July 02, 2009

Investigating the "Scott Cantor is a member of the IEEE problem"

The UK Access Management Federation and other similar initiatives worldwide provide a SAML-based single sign-on solution for access to online resources for the education and research community.  Typically, a user must sign-on to their home institution, using their local username and password, before being granted access to a remote online resource.  In the main, this prevents the user from having to remember a separate username and password for each online resource that they wish to access.  However, there is a perceived problem that some users have several affiliations (their university, their employer, the NHS, their professional body, etc.), each of which may grant access to a different set of online resources, and that, currently, online services are not able to make seamless decisions about which resources a given user is entitled to access because they lack knowledge about these multiple affiliations.

We have recently funded Simon McLeish at LSE to undertake an investigation into this area, commonly known as the Scott Cantor is a member of the IEEE problem. (Scott Cantor is lead developer of the Shibboleth software and an editor of the SAML 2.0 specification).  This investigation will try to discover the extent of this problem in UK HE - who is affected, how serious stakeholders perceive it to be, and what is expected from a solution - in order to inform future work in this area.

More information about this study can be found thru the project's Wiki.  As usual, the final report will be made openly available to the community under a Creative Commons licence.

July 01, 2009

RESTful Design Patterns, httpRange-14 & Linked Data

Stefan Tilkov recently announced the availability of the video of a presentation he gave a few months ago on design patterns (& anti-patterns) for REST. I recommend having a look at it, as it covers a lot of ground and has lots of useful examples, and I find his presentational style strikes a nice balance of technical detail and reflection. If you haven't got time to listen, the slides are also available in PDF (though I do think hearing the audio clarifies quite a lot of the content).

One of the questions that this presentation (and other similar ones) planted at the back of my mind is that of how some of the patterns presented might be impacted by the W3C TAG's httpRange-14 resolution and the Cool URIs conventions for distinguishing between what it calls "real world objects" and "Web documents", some of which describe those "real world objects". The Cool URIs document focuses on the implications of this distinction on the use of the HTTP protocol to request representations of resources, using the GET method, but does not touch on the question of whether/how it affects the use of HTTP methods other than GET.

In the early part of his presentation, Stefan introduces the notion of "representation" and the idea that a single resource may have multiple representations. Some of the resources referred to in his examples, like "customers" (slide 16 in the PDF; slide 16 in the video presentation), when seen from the perspective of the Cool URIs document, fall, I think, into the category of "real world objects" - things which may be described (by distinct resources) but are not themselves represented on the Web. So, following the Cool URIs guidelines, the URI of a customer would be a "hash URI" (URI with fragment id) or a URI for which the response to an HTTP GET request is a 303 redirect to the (distinct) URI of a document describing the customer.

But what about non-"read-only" interactions, and using methods other than GET? The third "design pattern" in the presentation is one for "resource creation" (slide 55 in the PDF; slide 98 in the video presentation). Here a client POSTs a representation of a resource to a "collection resource" (slide 50 in the PDF; slide 93 in the video presentation). The example of a "collection resource" used is a collection of customers, with the implication, I think, that the corresponding "resource creation" example would involve the posting of a representation of a customer, and the server responding 201 with a new URI for the customer.

I think (but I'm not sure, so please do correct me!) that the implication of the httpRange-14 resolution is that in this example, the "collection resource", the resource to which a POST is submitted, would be a collection of "customer descriptions", and the thing posted would be a representation of a customer description for the new customer, and the URI returned for the newly created resource would be the URI of a new customer description. And a GET for the URI of the description would return a representation which included the URI of the new customer.


(In the diagram above, http://example.org/customers/123 is the URI of a customer; http://example.org/docs/customers/123 is the URI of a document describing that customer

And, finally, a GET for the URI of the customer (assuming it isn't a "hash URI") would - following the Cool URIs conventions - return a 303 redirect to the URI of the description.

There is some discussion of this is in a short post by Richard Cyganiak, and I think the comments there bear out what I'm suggesting here, i.e. that POST/PUT/DELETE are applied to "Web documents" and not to "real-world objects".

The comment by Leo Sauermann on that post refers to the use of a SPARQL endpoint for updates - the SPARQL Update specification certainly addresses this area. It talks in terms of adding/deleting triples to/from a graph, and adding/deleting graphs to/from a "graph store". I think the "adding a graph to a graph store" case is pretty close to the requirement that is being addressed by the "post representation to Collection Resource" pattern. But I admit I struggle slightly to reconcile the SPARQL Update approach with Stefan's design pattern - and indeed, he highlights the "endpoint" notion, with different methods embedded in the content of the representation, as part of one of his "anti-patterns", their presence typically being an indicator that an architecture is not really RESTful.

I should emphasise that I'm trying to avoid seeming to adopt a "purist" position here: I recognise that "RESTfulness" is a choice rather than an absolute requirement. However, interest in the RESTful use of HTTP has grown considerably in recent years (to the extent that some developers seem keen to apply the label "RESTful", regardless of whether their application meets the design constraints specified by the architectural style or not). And now the "linked data" approach - which of course makes use of the httpRange-14 conventions - also seems to be gathering momentum, not least following the announcement by the UK government that Tim Berners-Lee would be advising them on opening up government data (and his issuing of a new note in his Design Issues series focussed explicitly on government data). It seems to me it would be helpful to be clear about how/where these two approaches intersect, and how/where they diverge (if indeed they do!). Purely from a personal perspective, I would like to be clearer in my own mind about whether/how the sort of patterns recommended by Stefan apply in the post-httpRange-14/linked data world.



eFoundations is powered by TypePad