« April 2009 | Main | June 2009 »

May 22, 2009

Google Rich Snippets

As ever, I'm slow off the mark with this, but last week's big news within the Web metadata and Semantic Web communities was the announcement by Google of a feature they are calling Rich Snippets, which provides support for the parsing of structured data within HTML pages - based on a selection of microformats and on RDFa using a specified RDF vocabulary - and the surfacing of that data in Google search result sets. In the first instance, at least, only a selected set of sources are being indexed, with the hope of extending participation soon (see the discussion in the O'Reilly interview with Othar Hansson and Guha.)

A number of commentators, including Ian Davis, Tom Scott, and Jeni Tennison have pointed out that Google's support for RDFa, at least as currently described, is somewhat partial, and its reliance on a centralised Google-owned URIspace for terms is at odds with RDF's support for the distributed creation of vocabularies - and indeed in coining that Google vocabulary, Google appears to have ignored the existence of some already widely deployed vocabularies.

And of course, Yahoo was ahead of the game here with their (complete) support for RDFa in their Search Monkey platform (which I mentioned here).

Nevertheless, it's hard to disagree with Timothy O'Brien's recognition of the huge power that Google wields in this space:

Google is certainly not the first search engine to support RDFa and Microformats, but it certainly has the most influence on the search market. With 72% of the search market, Google has the influence to make people pay attention to RDFa and Microformats.

Or, to put it another way, we may be approaching a period in which, to quote Dries Buytaert of the Drupal project, "structured data is the new search engine optimization" - with, I might add, both the pros and cons that come with that particular emphasis!

One of the challenges to an approach based on consuming structured data from the open Web is, of course, that of dealing with inaccuracies, spammers and "gamers" - see, for example, Cory Doctorow's "metacrap" piece, from back in 2001. But as Jeni Tennison notes towards the end of her post, having Google in a position where they have an interest in tackling this problem must be a good thing for the data web community more widely:

They will now have a stake in answering the difficult questions around trust, confidence, accuracy and time-sensitivity of semantic information.

Google's announcement is also one of the topics discussed in the newly released Semantic Web Gang podcast from Talis, and in that conversation - which is well worth a listen as it covers many of the issues I've mentioned here and more besides - Tom Tague from Thomson-Reuters highlights another potential outcome when he expresses optimism that the interest in embedded metadata generated by the Google initiative will also provide an impetus for the development of other tools to consume that data, such as browser plug-ins.

Thinking about activities that I have some involvement in, I think the use of RDFa in general is an area that should be entering on the radar of the "repositories" community in their efforts to improve access to the outputs of scholarly research.

It's also an area that I think the Dublin Core Metadata Initiative should be engaging with. Embedding metadata in HTML pages with the intent of facilitating the discovery of those pages using search engines was probably one of the primary motivating cases, at least in the early days of work on DC, though of course there has historically been little support from the global search engines for the approach, in large part because of the sort of problems identified by Doctorow. The current DCMI recommendation for doing this makes use of an HTML metadata profile (associated with a GRDDL namespace transformation). While on the one hand, RDFa is "just another syntax for RDF", it might be useful for DCMI to produce a short document illustrating the use of RDFa (and perhaps to consider the use of RDFa in its own documents). Of course, as far as the use of DCMI's own RDF vocabularies in data exposed to Google is concerned, it remains to be seen whether support for RDF vocabularies other than Google's own will be introduced. (Having said that, it's also worth noting that one of the strengths of RDFa is that the attribute-based syntax is fairly amenable to the use of multiple vocabularies in combination.)

Finally, I think this is an area which Eduserv should be tracking carefully with regard to its relevance to the services it provides to the education sector and to the wider public sector in the UK: it seems notable that, as I mentioned a few weeks ago, some of the early deployments of RDFa have been within UK government services.

May 18, 2009

SharePoint study

We're commissioning a study looking at the uptake and use of Microsoft SharePoint by Higher Education Institutions and currently have an ITT available.

This is an unusual study for us - in the sense that it focuses on an individual product - a fact that hasn't gone unnoticed either internally or in the community. When we announced the study on the WEBSITE-INFO-MGT@JISCMAIL.AC.UK mailing list, David Newman (Queen's University Management School, Belfast) responded with:

What a remarkably narrow research scope. It would be interesting to find out what groupware HEI institutions are using to support particular functions (co-ordinating international research projects, helping students work together in group projects, joint report editing, keeping track of expenses, ...). But just one product from one supplier?

I think David is right to raise this as an issue but there are reasons why we've done things in the way that we have and I think those reasons are worth sharing. Here's a copy of my response to David:

Hi David,
Firstly, I agree with you that this looks to be a rather narrowly scoped piece of work. It is the kind of study that we haven't funded to date and it's something that we didn't fund without a certain amount of internal angst! On that basis, I think it is worth me trying to explain where we are coming from with it.

You should note that this study comes out of our new Research Programme


rather than the previous Eduserv Foundation (which has now been wrapped up, except in the sense that we are continuing to support projects that we previously funded under the Foundation). Our previously announced ITT for a study looking at the way Web content is managed by HEIs (currently being undertaken by SIRC)


came from the same place.

The change from a Foundation to a Research Programme brought with it a subtle, but significant, change of emphasis. Eduserv is a non-profit IT services company. We have a charitable mission to "realise the benefits of ICT for learners and researchers", something we believe we do most effectively thru the services we deliver, e.g. those provided for the education community (particularly HE). Because of that, we felt we would get better 'value' from our research funding (more bang-per-buck if you like) if we tried to align it more closely with the kinds of services we offer. That is what we are trying to do thru the new Research Programme.

Our services to HE currently include OpenAthens and Chest, though we have a desire to improve our Web hosting/development offer within the sector as well (something we currently sell primarily into the public sector). For info... we are also in the final stages of developing a new data centre in Swindon and we hope to use that as the basis for new services to the HE sector in the future.

As a service provider, we sense a significant (and growing) interest in the use of MS SharePoint as the basis for the provision of a fairly wide range of solutions. This is particularly true in the public sector, where we also operate, but also in HE (for example, the HEA are just in the process of initiating a SharePoint project). Please note, I'm not saying this is necessarily good thing - my personal view is that it is not (though my personal view on all this is largely irrelevant!).

We tried to broaden the scope of the ITT in line with the kind of "groupware" suggestion you make below [above] but ultimately we felt that in doing so it was hard to capture the breadth of things that people are trying to do in SharePoint without ending up with something quite fuzzy and unfocused. On that basis, we reluctantly narrowed in on a specific technology - something we are not used to doing.

Let me be quite clear. We are not looking for a study that says MS SharePoint is the answer to everything (or indeed anything). Nor, that it is the answer to nothing. We are looking to understand what people in HE are doing with SharePoint, what they think works well, what they think is broken, why they have considered but rejected it and so on.

In that sense, it is a piece of market research... pure and simple. However, we believed (perhaps wrongly?) that the community would also be interested in this topic, which is why the findings of the work will be made openly available under a CC licence. The intention is to help both us and the community make better long term deployment decisions and, rightly or wrongly, we felt that decisions about one particular piece of software, i.e. SharePoint, was a significant enough part of that in this particular case to make the study worthwhile.

Hope that helps?

Note, I'm very happy to continue to hear if people think we have gone badly wrong on this because it will help us to spend our money more wisely (i.e. more effectively for the benefit of both us and the community) in the future.



Symposium live-streaming and social media

We are providing a live video stream from our symposium again this year, giving people who have not registered to attend in person a chance to watch all the talks and discussion and to contribute their own thoughts and questions via Twitter and a live chat facility (this year based on ScribbleLive).

Our streaming partner for this year is Switch New Media and we are looking forward to working with them on the day.  Some of you will probably be familiar with them because they provided streaming from this year's JISC Conference and the JISC Libraries of the Future event in Oxford.

If you plan on watching all or part of the stream, please sign up for the event’s social network so that we (and others) know who you are.  The social network has an option to indicate whether you are attending the symposium in person or remotely.

Also, for anyone tweeting, blogging or sharing other material about the event, remember that the event tag is ‘esym09’ (‘#esym09’ on Twitter).  If you want to follow the event on Twitter, you can do so using the Twitter search facility.

May 15, 2009

URIs and protocol dependence

In responding to my recent post, The Nature of OAI, identifiers and linked data, Herbert Van de Sompel says:

There is a use case for both variants: the HTTP variant is the one to use in a Web context (for obvious reasons; so totally agreed with both Andy and PeteJ on this), and the doi: variant is technology independent (and publishers want that, I think, if only to print it on paper).

(My emphasis added).

I'm afraid to say that I could not disagree more with Herbert on this. There is no technological dependence on HTTP by http URIs. [Editorial note: in the first comment below Herbert complains that I have mis-represented him here and I am happy to conceed that I have and apologise for it.  By "technology independent" Herbert meant "independednt of URIs", not "independent of HTTP".  I stand by my more general assertion in the remainder of this post that a mis-understanding around the relationship between http URIs and HTTP (the protocol) led the digital library community into a situation where it felt the need to invent alternative approaches to identification of digital conetent and, further, that those alternative approaches are both harmful to the Web and harmful to digital libraries.  I think those mis-understandings are still widely held in the digital library community and I disagree with those people who continue to promote relatively new non-http forms of URIs for 'scholarly' content (by which I primarily mean info URIs and doi URIs) as though their use was good practice.  On that basis, this blog post may represent a disagreement between Herbert and me but it may not.  See the comment thread for further discussion.  Note also that my reference to Stu Weibel below is intended to indicate only what he said at the time, not his current views (about which I know nothing).]  As I said to Marco Streefkerk in the same thread:

there is nothing in the 'http' prefix of the http URI that says, "this must be dereferenced using HTTP". In that sense, there is no single 'service' permanently associated with the http URI - rather, there happens to be a current, and very helpful, default de-referencing mechanism available.

At the point that HTTP dies, which it surely will at some point, people will build alternative de-referencing mechanisms (which might be based on Handle, or LDAP, or whatever replaces HTTP). The reason they will have to build something else is that the weight of deployed http URIs will demand it.

That's the reasoning behind my, "the only good long term identifier, is a good short term identifier" mantra.

The mis-understanding that there is a dependence between the http URI and HTTP (the protocol) is endemic in the digital library community and has been the cause of who knows how many wasted person-years, inventing and deploying unnecessary, indeed unhelpful, URI schemes like 'oai', 'doi' and 'info'. Not only does this waste time and effort but it distances the digital library community from the mainstream Web - something that we cannot afford to happen.

As the Architecture of the World Wide Web, Volume One (section 3.1) says:

Many URI schemes define a default interaction protocol for attempting access to the identified resource. That interaction protocol is often the basis for allocating identifiers within that scheme, just as "http" URIs are defined in terms of TCP-based HTTP servers. However, this does not imply that all interaction with such resources is limited to the default interaction protocol.

This has been re-iterated numerous times in discussion, not least by Roy Fielding:

"Really, it has far more to do with a basic misunderstanding of web architecture, namely that you have to use HTTP to get a representation of an "http" named resource. I don't think there is any simple solution to that misbelief aside from just explaining it over and over again."

"However, once named, HTTP is no more inherent in "http" name resolution than access to the U.S. Treasury is to SSN name resolution."

"The "http" resolution mechanism is not, and never has been, dependent on DNS. The authority component can contain anything properly encoded within the defined URI syntax. It is only when an http identifier is dereferenced on a local host that a decision needs to be made regarding which global name resolution protocol should be used to find the IP address of the corresponding authority. It is a configuration choice."

The (draft) URI Schemes and Web Protocols Tag Finding makes similar statements (e.g. section 4.1):

A server MAY offer representations or operations on a resource using any protocol, regardless of URI scheme. For example, a server might choose to respond to HTTP GET requests for an ftp resource. Of course, this is possible only for protocols that allow references to URIs in the given scheme.

I know that some people don't like this interpretation of http URIs, claiming that W3C (and presumably others) have changed their thinking over time. I remember Stu Weibel starting a presentation about URIs at a DCC meeting in Glasgow with the phrase:

Can you spell revisionist?

I disagree with this view. The world evolves. Our thinking evolves. This is a good thing isn't it? It's only not a good thing if we refuse to acknowledge the new because we are too wedded to the old.

May 14, 2009

The role of universities in a Web 2.0 world?

Brian Kelly, writing about the Higher Education in a Web 2.0 World report, ends by referring to the recommendation to "explore issues and practice in the development of new business models that exploit Web 2.0 technologies" (Area 3: Infrastructure), suggesting that it has to do with "best practices for institutional engagement (or not) with Web 2.0". I don't know what the report intended by this statement but, to me at least, it seems like business models are a pretty fundamental issue... potentially much more fundamental than Brian's interpretation.

I noted a similar issue in the CILIP2 discussions of a few weeks ago. Asking "how should CILIP use Web 2.0 to engage with its members?" ignores the more fundamental question, "what is the role of an organisation like CILIP in a Web 2.0 world?". It's a bit like asking an independent high-street bookshop to think about how it uses Web 2.0 to engage with its customers, ignoring that fact that Amazon might well have just trashed its business model entirely!

Luckily for universities there isn't (yet?) the equivalent of an Amazon in the HE sector so I accept that the situation isn't quite the same. Indeed, there are strong hints in the report that aspects of the traditional university, face to face tutor time for example, are well liked by their customers (I know many people hate the term 'customers' but it strikes me that is increasingly what the modern HE student has become). Nonetheless, I think that particular recommendation would be better interpretted as having more to do with "what is the role for universities in a Web 2.0 world?" than with "how do universities best use Web 2.0 to enhance their current practice?"?

Or, to put it a different way, if Web 2.0 changes everything, I see no reason why that doesn't apply as much to professional bodies and universities as it does to high street bookshops.

May 13, 2009

Identity in a Web 2.0 world?

In the flurry of Twitter comments about the Higher Education in a Web 2.0 World report yesterday I noticed the following tweet from Nicole Harris at the JISC:

#clex09 disappointed by lack of attention to identity issues in the report-despite it being included in the definition IDM hardly mentioned

I have to say that I share Nicole's disappointment.  Having now read thru the whole report I can find little reference to identity or identity management.  Identity doesn't appear in the index, nor in the list of critical issues.

This seems very odd to me.  The management of identity (in both a technology sense and a political/social sense) is one of the key aspects of the way that the social Web has evolved, witness the growth of OpenID, OAuth, Google OpenSocial and Friend Connent, Facebook Connect and the rest.  If the social Web is destined to have a growing influence on teaching and learning (and research) in HE then we have to understand what impact that has in terms of identity management.

There are two aspects to this.  I touched on the first yesterday, which is that understanding identity forms a critical part of digital literacy.  It therefore worries me that the report seems to focus more heavily on information literacy, a significantly narrower topic.  The second has to do with technology.

Let me give you a starter for 10... identity in a Web 2.0 world is not institution-centric, as manifest in the current UK Federation, nor is it based on the currently deployed education-specific identity and access management technologies.  Identity in a Web 2.0 world is user-centric - that means the user is in control.

Now... I should note two things.  Firstly, that Nicole and I might well have parted company in terms of our thinking at this point but I won't try to speak on her behalf and I don't know what lay behind her tweet yesterday.  Secondly, that user-centric might mean OpenID, but it might mean something else.  The important point is that learners (and staff) will come into institutions with an existing identity, they will increasingly expect to use that identity while they are there (particularly in their use of services 'outside' the institution) and that they will continue using it after they have left.  As a community, we therefore have to understand what impact that has on our provision of services and the way we support learning and research.  It's a shame that the report seems to have missed this point.

May 12, 2009

HE in a Web 2.0 world?

The Higher Education in a Web 2.0 World report, which is being launched in London this evening, crossed my horizon this morning and I ended up twittering about it on and off for most of the day.

Firstly, I should confess that I've only had a chance to read the report summary, not the full thing, so if my comments below are out of line, I apologise in advance.

It strikes me that the report has a rather unhelpful title because it doesn't seem to me to be about "higher education" per se.  Rather, it is about teaching and learning in HE. For example, there's nothing in it about research practice as far as I can tell. Nor is it really about "Web 2.0" (whatever that means!).  It is about the social Web and the impact that social software might have on the way learning happens in HE.

The trouble with using the phrase "Web 2.0" in the title is that it is confusing, as evidenced by the Guardian's coverage of the report which talks, in part, about universities outsourcing email to Google.  Hello... email is about as old skool as it gets in terms of social software and completely orthogonal to the main thrust of the report itself.

And, while I'm at it, I have another beef with the Guardian's coverage.  Why, oh why, does the mainstream media insist on making stupid blanket statements about the youth of today and their use of social media?  Here are two examples from the start of the article:

The "Google generation" of today's students has grown up in a digital world. Most are completely au fait with the microblogging site Twitter...

Modern students are happy to share...

I don't actually believe either statement and would like to see some evidence backing them up.  Students might well be happy to share their music?  They might well be happy to share their photos on Facebook?  Does that make them happy to share their coursework?  In some cases, possibly...  but in the main? I doubt it.

I'm nervous about this kind of thing only because it reminds me of the early days of HE's interest in Second Life, where people were justifying their in-world activities with arguments like, "we need to be in SL because that's where the kids are", a statement that wasn't true then, and isn't true now :-(

Anyway, I digress... despite the naff title, I found the report's recommendations to be reasonably sensible. I have a nagging doubt that the main focus is on social software as a means to engender greater student/tutor engagement and/or as a pedagogic tool whereas I would prefer to see more emphasis on the institution as platform, enabling student to student collaboration and then dealing with the consequences.  In short, I want the focus to be on learning rather than teaching I suppose.  However, perhaps that is my mis-reading of the summary.

I also note that the report doesn't seem to use the words "digital literacy" (at least, not in the summary), instead using "information literacy" and "web awareness" separately. I think this is a missed opportunity to help focus some thinking and effort on digital literacy. I'm not arguing that information literacy is not important... but I also think that digital literacy skills, understanding the issues around online identity and the long term consequences of material in social networks for example, are also very important and I'm not sure that comes out of this report clearly enough.

Anyway, enough for now... the report (or at least the summary) seems to me to be well worth reading.

May 08, 2009

The Nature of OAI, identifiers and linked data

In a post on Nascent, Nature's blog on web technology and science, Tony Hammond writes that Nature now offer an OAI-PMH interface to articles from over 150 titles dating back to 1869.

Good stuff.

Records are available in two flavours - simple Dublin Core (as mandated by the protocol) and Prism Aggregator Message (PAM), a format that Nature also use to enhance their RSS feeds.  (Thanks to Scott Wilson and TicTocs for the Jopml listing).

Taking a quick look at their simple DC records (example) and their PAM records (example) I can't help but think that they've made a mistake in placing a doi: URI rather than an http: URI in the dc:identifier field.

Why does this matter?

Imagine you are a common-or-garden OAI aggregator.  You visit the Nature OAI-PMH interface and you request some records.  You don't understand the PAM format so you ask for simple DC.  So far, so good.  You harvest the requested records.  Wanting to present a clickable link to your end-users, you look to the dc:identifier field only to find a doi: URI:


If you understand the doi: URI scheme you are fine because you'll know how to convert it to something useful:


But if not, you are scuppered!  You'll just have to present the doi: URI to the end-user and let them work it out for themselves :-(

Much better for Nature to put the http: URI form in dc:identifier.  That way, any software that doesn't understand DOIs can simply present the http: URI as a clickable link (just like any other URL).  Any software that does understand DOIs, and that desperately wants to work with the doi: URI form, can do the conversion for itself trivially.

Of course, Nature could simply repeat the dc:identifier field and offer both the http: URI form and the doi: URI form side-by-side.  Unfortunately, this would run counter the the W3C recommendations not to mint multiple URIs for the same resource (section 2.3.1 of the Architecture of the World Wide Web):

A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.

On balance I see no value (indeed, I see some harm) in surfacing the non-HTTP forms of DOI:




both of which appear in the PAM record (somehwat redundantly?).

The http: URI form


is sufficient.  There is no technical reason why it should be perceived as a second-class form of the identifier (e.g. on persistence grounds).

I'm not suggesting that Nature gives up its use of DOIs - far from it.  Just that they present a single, useful and usable variant of each DOI, i.e. the http: URI form, whenever they surface them on the Web, rather than provide a mix of the three different forms currently in use.

This would be very much in line with recommended good practice for linked data:

  • Use URIs as names for things
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information.
  • Include links to other URIs. so that they can discover more things.



eFoundations is powered by TypePad