January 10, 2012

Introducing Bricolage

My last post here (two months ago - yikes, must do better...) was an appeal to anyone who might be interested in my making a contribution to a project under the JISC call 16/11: JISC Digital infrastructure programme to get in touch. I'm pleased to say that Jasper Tredgold of ILRT at the University of Bristol contacted me about a proposal he was putting together for a project called Bricolage, with the prospect of my doing some consultancy. The proposal was to work with the University Library's Special Collections department and the University Geology Museum to make available metadata for two collections - the archive of Penguin Ltd. and the specimen collection of the Geology Museum - as Linked Open Data.

And just before I went off for the Christmas break, Jasper let me know that the proposal had been accepted and the project was being funded. I'm very pleased to have another opportunity to carry on applying some of what I've learned in the other JISC-funded projects I've contributed to recently, and also to explore some new categories of data. It's also nice to be working with a local university - I worked on a few projects with folks from ILRT during my time at UKOLN, and from a selfish perspective I look forward to project meetings which involve a twenty-minute walk up the hill for me rather than a 7am start and a three or four hour train journey!

The project will start in February and run through to July. I'm sure there'll be a project blog once we get going and I'll add a URI here when it is available.

November 04, 2011

JISC Digital Infrastructure programme call

JISC currently has a call, 16/11: JISC Digital infrastructure programme, open for project proposals in a number of "strands"/areas, including the following:

  • Resource Discovery: "This programme area supports the implementation of the resource discovery taskforce vision by funding higher education libraries archives and museums to make open metadata about their collections available in a sustainable way. The aim of this programme area is to develop more flexible, efficient and effective ways to support resource discovery and to make essential resources more visible and usable for research and learning."

This strand advances the work of the UK Discovery initiative, and is similar to the "Infrastructure for Resource Discovery" strand of the JISC 15/10 call under which the SALDA project (in which I worked with the University of Sussex Library on the Mass Observation Archive data) was funded. There is funding for up to ten projects of between £25,000 and £75,000 per project in this strand

First, I should say this is a great opportunity to explore this area of work and I think we're fortunate that JISC is able to fund this sort of activity. A few particular things I noticed about the current call:

  • a priority for "tools and techniques that can be used by other institutions"
  • a focus on unique resources/collections not duplicated elsewhere
  • should build on lessons of earlier projects, but must avoid duplication/reinvention
  • a particular mention of "exploring the value of integrating structured data into webpages using microformats, microdata, RDFa and similar technologies" as an area in scope
  • an emphasis on sharing the experience/lessons learned: "The lessons learned by projects funded under this call are expected to be as important as the open metadata produced. All projects should build sharing of lessons into their plans. All project reporting will be managed by a project blog. Bidders should commit to sharing the lessons they learn via a blog"

Re that last point, as I've said before, one of the things I most enjoyed about the SALDA and LOCAH projects was the sense that we were interested in sharing the ideas as well as getting the data out there.

I'm conscious the clock is ticking towards the submission deadline, and I should have posted this earlier, but if anyone reading is considering a proposal and thinks that I could make a useful contribution, I'd be interested to hear from you. My particular areas of experience/interest are around Linked Data, and are probably best reflected by the posts I made on the LOCAH and SALDA blogs, i.e. data modelling, URI pattern design, identification/selection of useful RDF vocabularies, identification of potential relationships with things described in other datasets, construction of queries using SPARQL, etc. I do have some familiarity with RDFa, rather less with microdata and microformats. I'm not a software developer, but I can do a little bit of XSLT (and maybe enough PHP to be dangerous hack together rather flakey basic demonstrators). And I'm not a technical architect, but I did get some experience of working with triple stores in those recent projects.

My recent work has been mainly with archival metadata, and I'd be particularly interested in further work which complements that. I'm conscious of the constraint in the call of not repeating earlier work, so I don't think "reapplying" the sort of EAD to RDF work I did with LOCAH and SALDA would fit the bill. (I'd love to do something around the event/narrative/storytelling angle that I wrote about recently here, for example.) Having said that, I certainly don't mean to limit myself to archival data. Anyway, if you think I might be able to help, please do get in touch (pete.johnston@eduserv.org.uk).

October 21, 2011

Two UK government consultations related to open data

This is just a very quick note to highlight that there are two UK government consultations in the area of open data currently in progress and due to close very shortly - next week on 27 October 2011:

  • Making Open Data Real, from the Cabinet Office, on the Transparency and Open Data Strategy, and "establishing a culture of openness and transparency in public services".
  • A Consultation on Data Policy for a Public Data Corporation, from BIS, on the role of the planned Public Data Corporation and "key aspects of data policy – charging, licensing and regulation of public sector information produced by the PDC for re-use – that will determine how a PDC can deliver against all its objectives".

Below a few pointers to notes and comments I've seen around and about recently via Twitter:

Related to the former consultation is a very interesting report by Kieron O'Hara from the University of Southampton, published by the Cabinet Office as Transparent Government, not Transparent Citizens on the the issues for privacy raised by the government‘s transparency programme, and on reconciling the desire for openness from government with the privacy of individuals, which makes the argument that "privacy is a necessary condition for a successful transparency programme".

October 05, 2011

Storytelling, archives and linked data

Yesterday on Twitter I saw Owen Stephens (@ostephens) post a reference to a presentation titled "Every Story has a Beginning", by Tim Sherratt (@wragge), "digital historian, web developer and cultural data hacker" from Canberra, Australia.

The presentation content is available here, and the text of the talk is here. I think you really need to read the text in one window and click through the presentation in another. I found it fascinating, and pretty inspiring, from several perspectives.

First, I enjoyed the form of the presentation itself. The content is built up incrementally on the screen, with an engaging element of "dynamism" but kept simple enough to avoid the sort of vertiginous barrage that seems to characterise the few Prezi presentations I've witnessed. And perhaps most important of all, the presentation itself is very much "a thing of the Web": many of the images are hyperlinked through to the "live" resources pictured, providing not only a record of "provenance" for the examples, but a direct gateway into the data sources themselves, allowing people to explore the broader context of those individual records or fragments or visualisations.

Second, it provides some compelling examples of how digitised historical materials and data extracted or derived from them can be brought together in new combinations and used to uncover and (re)tell stories - and stories not just of the "famous", the influential and powerful, but of ordinary people whose life events were captured in historical records of various forms. (Aside: Kate Bagnall has some thoughtful posts looking at some of the ethical issues of making people who were "invisible" "visible").

Finally, what really intrigued me from the technical perspective was that - if I understand correctly - the presentation is being driven by a set of RDF data. (Tim said on Twitter he'd post more explaining some of the mechanics of what he has done, and I admit I'm jumping the gun somewhat in writing this post, so I apologise for any misinterpretations.) In his presentation, Tim says:

What we need is a data framework that sits beneath the text, identifying people, dates and places, and defining relationships between them and our documentary sources. A framework that computers could understand and interpret, so that if they saw something they knew was a placename they could head off and look for other people associated with that place. Instead of just presenting our research we’d be creating a whole series of points of connection, discovery and aggregation.

Sounds a bit far-fetched? Well it’s not. We have it already — it’s called the Semantic Web.

The Semantic Web exposes the structures that are implicit in our web pages and our texts in ways that computers can understand. The Linked Data movement takes the basic ideas of the Semantic Web and turns them into a collaborative activity. You share vocabularies, so that other people (and computers) know when you’re talking about the same sorts of things. You share identifiers, so that other people (and computers) know that you’re talking about a specific person, place, object or whatever.

Linked Data is Storytelling 101 for computers. It doesn’t have the full richness, complexity and nuance that we invest in our narratives, but it does at least help computers to fit all the bits together in meaningful ways. And if we talk nice to them, then they can apply their newly-acquired interpretative skills to the things that they’re already good at — like searching, aggregating, or generating the sorts of big pictures that enable us to explore the contexts of our stories.

So, if we look at the RDF data for Tim's presentation, it includes "descriptions" of many different "things", including people, like Alexander Kelley, the subject of his first "story" (to save space, I've skipped the prefix declarations in these snippets but I hope they convey the sense of the data):

story:kelley a foaf1:Person ;
     bio:death story:kelley_death ;
     bio:event
         story:kelley_cremation,
         story:kelley_discharge,
         story:kelley_enlistment,
         story:kelley_reunion,
         story:kelley_wounded_1,
         story:kelley_wounded_2 ;
     foaf1:familyName "Kelley"@en-US ;
     foaf1:givenName "Alexander"@en-US ;
     foaf1:isPrimaryTopicOf story:kelley_moa ;
     foaf1:name "Alexander Dewar Kelley"@en-US ;
     foaf1:page 
       <http://discontents.com.au/shoebox/every-story-has-a-beginning> . 

There is data about events in his life:

story:kelley_discharge a bio:Event ;
     rdfs:label 
       "Discharged from the Australian Imperial Force."@en-US ;
     dc:date "1918-11-22"@en-US . 

story:kelley_enlistment a bio:Event ;
     rdfs:label 
       "Enlistment in the Australian Imperial Force for 
        service in the First World War."@en-US ;
     dc:date "1916-01-22"@en-US . 
     
story:kelley_ww1_service a bio:Interval ;
     bio:concludingEvent story:kelley_discharge ;
     bio:initiatingEvent story:kelley_enlistment ;
     foaf1:isPrimaryTopicOf story:kelley_ww1_record . 

and about the archival materials that record/describe those events:

story:kelley_ww1_record a locah:ArchivalResource ;
     locah:accessProvidedBy 
       <http://dbpedia.org/resource/National_Archives_of_Australia> ;
     dc:identifier "B2455, KELLEY ALEXANDER DEWAR"@en-US ;
     bibo:uri 
       "http://www.aa.gov.au/cgi-bin/Search?O=I&Number=7336927"@en-US . 

The presentation itself, the conference at which it was presented, various projects and researchers mentioned - all of these are also "things" described in the data.

I'd be interested in hearing more about how this data was created, the extent to which it was possible to extract the description of people, events, archival resources etc directly from existing data sources and the extent to which it was necessary to "hand craft" parts of it.

But I get very excited when I think about the potential in this sort of area if (when!?) we do have the data for historical records available as linked data (and available under open licences that support its free use).

Imagine having a "story building tool" which enables a "narrator" to visit a linked data page provided by the National Archives of Australia or the Archives Hub or one of the other projects Tim refers to, and to select and "intelligently clip" a chunk of data which you can then arrange into the "story" you are constructing - in much the way that bookmarklets for tools like Tumblr and Posterous enable you to do for text and images now. That "clipped chunk of data" could include a description of a person and some of their life events and metadata about digitised archival resources, including URIs of images - as in Tim's examples. You might follow pointers to other datasets from which additional data could be pulled. You might supplement the "clipped" data with your own commentary. Then imagine doing the same with data from the BBC describing a TV programme or radio broadcast related to the same person or events, or with data from a research repository describing papers about the person or events. The tool could generate some "provenance data" for each "chunk" saying "this subgraph was part of that graph over there, which was created by agency A on date D" in much the way that the blogging bookmarklets provide backlinks to their sources.

And the same components might be reorganised, or recombined with others, to tell different stories, or variants of the same story.

Now, yes, I'm sure there are some thorny issues to grapple with here, and coming up with an interface that balances usability and the required degree of control may be a challenge - so maybe I'm getting carried way with my enthusiasm, but it doesn't seem to be an entirely unreasonable proposition.

I think it's important here that, as Tim emphasises towards the end of his text, it is the human "narrator", not an algorithm, who decides on the structure of the story and selects its components and brings them into (possibly new) relationships with each other.

I'm aware that there's other work in this area of narratives and stories, particularly from some of the people at the BBC, but I admit I haven't been able to keep up with it in detail. See e.g. Tristan Ferne on "The Mythology Engine" and Michael Smethurst's thoughtful "Storytellin'".

For me, Tim's very concrete examples made the potential of these approaches seem very real. They suggest a vision of Linked Data not as one more institutional "output", but as a basis for individual and collective creativity and empowerment, for the (re)telling of stories that have been at least partly concealed - stories which may even challenge the "dominant" stories told by the powerful. It seems all too infrequent these days that I come across something that reminds me why I bothered getting interested in metadata and data on the Web in the first place: Tim's presentation was one of those things.

October 03, 2011

Virtual World Watch taking submissions for new Snapshot Report

John Kirriemuir has put out a new call for contributions to a tenth Virtual World Watch "snapshot report" on the use of virtual worlds in education in the UK and, this time, in Ireland too. His deadline for submissions is November 14 2011.

The activity is no longer funded under the Eduserv Research Programme, but John has obtained "a small amount of independent funding to carry out another snapshot over the remainder of the year", and Andy and I continue to be members of an informal "advisory board" for the activity (which means, err, we get the occasional email from John which prods us into writing blog posts like this one!)

Part of John's plan is to try to draw attention to the resulting report (and to contributors' work covered in it) by "pushing" it to various agencies, including:

  • UK funding bodies who fund virtual world in education activities
  • Journalists who specialise in technology in education news
  • Relevant government and civil service departments
  • The owners/developers of key virtual worlds
  • Major research groups (worldwide) involved in virtual world in education research

Previous reports are available here

UMF Cloud Pilot update

As a quick update on where we are with work on the UMF Cloud Pilot, our emerging cloud offer for UK HE, here are the slides that I spoke to at the All Hands Meeting (AHM 2011) in York last week.

I wrote a longer trip report from the conference over on our UMF Cloud PIlot blog (which is where I do most of my blogging these days) so I won't repeat it here. It was a good conference, though more 'hands-on' than 'strategic', and much smaller, than I remember.

As to the UMF Cloud PIlot, the key things to note at this stage are that we'll be opening up the infrastructure to the UMF SaaS projects shortly, we'll make announcements about our pricing in a few weeks time, and the vCloud service will be more openly available from the end of the year (or early next year) - with an OpenStack Compute offer coming on stream beyond that.

September 19, 2011

Things & their conceptualisations: SKOS, foaf:focus & modelling choices

I thought I'd try to write up some thoughts around an issue which I've come across in a few different contexts recently, and which as a shorthand I sometimes think of as "the foaf:focus question". It was prompted mainly by:

  • my work on modelling the Archives Hub data during the LOCAH project, and looking at datasets to which we wanted to make links, such as VIAF;
  • looking at the data model for the recent Linked Open BNB dataset from the British Library, and how some Dublin Core properties were being used, and some email discussions I had around that;
  • a recent message by Dan Brickley to the foaf-dev mailing list, explaining how the design of some FOAF properties was conditioned by the context at the time, and reflecting on how that context had changed, and what the implications of those changes might be.

Rather by chance on Friday evening, just as I was about to try to tie up what had become a rather long and rambling post, I noticed a conversation on Twitter, initiated by John Goodwin (@gothwin), which I think addressed the broader issue which I'd been circling around without quite addressing it: the use of different "modelling styles" and the issues which arise as a result when we try to link or merge data.

After much chopping and changing, the post adopts a very roughly "chronological" approach. The initial parts cover areas and activities that I wasn't directly involved in at the time, so I am providing my own retrospective interpretation based on my reading of the sources rather than a first-hand account, and I apologise in advance for any omissions or misrepresentations.

FOAF and "interests"

The FOAF Project was an initiative launched by a community of Semantic Web enthusisasts back in 2000, which explored the use of the - then newly emerging - Resource Description Framework specification to express information about individuals, their contact details and their interests and projects - the sort of information that was typically presented on a "personal home page" - and also some of the practical considerations in providing and consuming such information on the Web as RDF. The principle "deliverable" of the project is the Friend of a Friend (FOAF) RDF vocabulary, which continues to evolve and is now very widely used.

As Dan Brickley notes in his recent post, when looking at some of the FOAF properties from the perspective of 2011, their design may seem slightly "unwieldy" to the newcomer, and this is in part because their design was shaped by the context of how the Web was being used at the time of their creation, perhaps nine or ten years ago. At that point, as Dan notes, while there were URIs available for Web documents, and a growing recognition of the importance of maintaining the stability of those URIs, the use of http URIs to identify things other than documents was much less widely adopted, and we often lacked stable URIs for those things of other types that we wanted to "talk about".

One of the use cases covered by FOAF is to express the "interests" of an individual - where that "interest" might be an idea, a place, a person, an event or series of events, anything at all. To work around the issue of the availability of a URI of that thing, FOAF adopted a convention of "indirection" in some of its early properties. So, for example, the foaf:interest property expresses a relation, not between agent and idea/place/person (etc), but between agent and document: it says "this agent is interested in the topic of this page - the thing the page is 'about'" - where that might be anything at all. Using this convention, the topic itself is not explicitly referenced, so the question of its URI does not arise.

So, for example, to express an interest in the Napoleonic Wars, one might make use of the URI of the Wikipedia page 'about' that thing, and say:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:interest
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

A second property, foaf:topic_interest, does allow for the expression of that "direct" relationship, linking the agent to the thing of interest - which again might be anything at all. (I'm not sure whether thse two properties were created at the same time or whether one preceded the other). Even in the absence of URIs for concepts and people and places, RDF allows for the use of "blank nodes" to refer to such things.

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest [ rdfs:label "Napoleonic Wars"@en ] .

However, a blank node is limited in its scope as an identifier to the graph within which it is found: if Fred provides a blank node for the notion of the Napoleonic Wars in his graph and Freda provides a blank node for an interest in her graph, I can't tell (from that information alone) whether those two nodes are references to the same thing or to two different things (e.g. the historical event and a book about the event). Again, historically, one solution to this problem was to introduce the URI of a document, together with some inferencing based on OWL:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest 
          [ rdfs:label "Napoleonic Wars"@en ;
            foaf:isPrimaryTopicOf 
              <http://en.wikipedia.org/wiki/Napoleonic_Wars> ] .

According to the FOAF documentation for foaf:isPrimaryTopicOf:

The isPrimaryTopicOf property is inverse functional: for any document that is the value of this property, there is at most one thing in the world that is the primary topic of that document. This is useful, as it allows for data merging

i.e. if Freda's graph says:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:freda 
        foaf:topic_interest 
          [ rdfs:label "Napoleonic Wars"@en ;
            foaf:isPrimaryTopicOf 
              <http://en.wikipedia.org/wiki/Napoleonic_Wars> ] .

then my application can conclude that they are indeed both interested in the same thing - though that does depend on that application having some "built-in knowledge" of the OWL inferencing rules (or access to another service which does).

http URIs for Things

The "httpRange-14 resolution" by the W3C Technical Architecture Group, the publication of the W3C Note on Cool URIs for the Semantic Web, the adoption of the principles of Linked Data and the emergence of a large number of datasets based on those principles has, of course, changed the landscape considerably, and the use of http URIs to identify things other than documents has become commonplace - even if there remain concerns about the practical challenges of implementing of some of the recommended techniques.

So, now DBpedia assigns a distinct http URI for the thing the Wikipedia page http://en.wikipedia.org/wiki/Napoleonic_Wars "is about", and provides a description of that thing in which it says (amongst other things):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://dbpedia.org/resource/Napoleonic_Wars> 
        rdfs:label "Napoleonic Wars"@en ;
        a <http://dbpedia.org/ontology/Event> ,
          <http://dbpedia.org/ontology/MilitaryConflict> ,
          <http://umbel.org/umbel/rc/Event> ,
          <http://umbel.org/umbel/rc/ConflictEvent> ;
        foaf:page 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

i.e. that thing, the topic of the page, is an event, a military conflict etc.

We could substitute this new DBpedia URI for the blank nodes in our foaf:topic_interest data above:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:fred 
        foaf:topic_interest 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

<http://dbpedia.org/resource/Napoleonic_Wars>
        rdfs:label "Napoleonic Wars"@en ;
        foaf:isPrimaryTopicOf 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix person: <http://example.org/id/person/> .

person:freda 
        foaf:topic_interest
          <http://dbpedia.org/resource/Napoleonic_Wars> .

<http://dbpedia.org/resource/Napoleonic_Wars>
        rdfs:label "Napoleonic Wars"@en ;
        foaf:isPrimaryTopicOf 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

When those two graphs are merged, the use of the common URI now makes it trivial to determine that Fred and Freda share the same interest.

Concept Schemes, SKOS and Document Metadata

The other factor Dan mentions in his message was the emergence of of the Simple Knowledge Organisation System (SKOS) RDF vocabulary, which after a long evolution became a W3C Recommendation in 2009.

SKOS is designed to provide an RDF representation of the various flavours of "knowledge organisation systems" and "controlled vocabularies" which information managers have traditionally used to organise information about various resources (books in libraries, objects in museums etc etc etc).

The core class in SKOS is that of the concept (skos:Concept). Each concept can be labelled with one or more names; documented with notes of various types; grouped into collections; related to other concepts through relationships such as "broader"/"narrower"/"related"; or mapped to other concepts in other collections.

The Library of Congress has published several library thesauri/classification schemes/controlled vocabularies as SKOS RDF data, including the Library of Congress Subject Headings, which includes a concept named "Napoleonic Wars, 1800-1815" with the URI http://id.loc.gov/authorities/subjects/sh85089767 (this is a subset of the actual data provided):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

lcsh:sh85089767
        a skos:Concept ;
        rdfs:label "Napoleonic Wars, 1800-1815"@en ;
        skos:prefLabel "Napoleonic Wars, 1800-1815"@en ;
        skos:altLabel "Napoleonic Wars, 1800-1814"@en ;
        skos:broader lcsh:sh85045703 ;
        skos:narrower lcsh:sh85144863 ;
        skos:inScheme 
          <http://id.loc.gov/authorities/subjects> .

A metadata creator coming from a bibliographic background and providing Dublin Core-based metadata for the Wikipedia page http://en.wikipedia.org/wiki/Napoleonic_Wars might well use this concept URI to provide the "subject" of that page:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars>
        a foaf:Document ;
        rdfs:label "Napoleonic Wars"@en ;
        dcterms:subject lcsh:sh85089767 .

And indeed the concept URI could (I think) also be used with the foaf:topic or foaf:primaryTopic properties:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix lcsh: <http://id.loc.gov/authorities/subjects/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars>
        a foaf:Document ;
        rdfs:label "Napoleonic Wars"@en ;
        foaf:topic lcsh:sh85089767 ;
        foaf:primaryTopic lcsh:sh85089767 .

Note that all three of these properties (dcterms:subject, foaf:topic, and foaf:primaryTopic) are defined in such a way that they are not limited to being used with concepts. We've seen this above where the DBpedia data includes:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://dbpedia.org/resource/Napoleonic_Wars> 
        foaf:page 
          <http://en.wikipedia.org/wiki/Napoleonic_Wars> .

which (since foaf:page and foaf:topic are inverse properties) implies:

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars> 
        foaf:topic 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

It is also true for the dcterms:subject property. Although traditionally the Dublin Core community has highlighted the use of formal classification schemes like LCSH, and it may well be true that the dcterms:subject property is often used to link things to concepts, it is not limited to taking concepts as values, and one could also link the document to the event using dcterms:subject:

@prefix dcterms: <http://purl.org/dc/terms/> .

<http://en.wikipedia.org/wiki/Napoleonic_Wars> 
        dcterms:subject 
          <http://dbpedia.org/resource/Napoleonic_Wars> .

I should acknowledge here that some in the Dublin Core community might disagree with my last example above, and argue that the values of dcterms:subject should be concepts. I think my position is backed up by the current DCMI documentation, and particularly by the fact that when they assigned ranges to the DCMI Terms properties in 2008, the DCMI Usage Board did not specify a range for the dcterms:subject property i.e. the intention is that dcterms:subject may link to a resource of any type.

I also note in passing that when DCMI created the DCMI Abstract Model in its attempt to reflect the "classical view" of Dublin Core (perhaps best expressed in Tom Baker's "A Grammar of Dublin Core") in an RDF-based model, the notion of a "Vocabulary Encoding Scheme" was defined as a set of things of any type, not specifically as a set of concepts.

Things and their Conceptualisations: foaf:focus

The SKOS approach and the SKOS Concept class introduce a new sort of "indirection" from our "things-in-the-world". As Dan puts it in a message to the W3C public-esw-thes list:

a SKOS "butterflies" concept is a social and technological artifact designed to help interconnect descriptions of butterflies, documents (and data) about butterflies, and people with interest or expertise relating to butterflies. I'm quite consciously avoiding saying what a "butterflies" concept in SKOS "refers to", because theories of reference are hard to choose between. Instead, I prefer to talk about why we bother building SKOS and what we hope can be achieved by it.

So, although both the DBpedia URI http://dbpedia.org/resource/Napoleonic_Wars) and the Library of Congress URI http://id.loc.gov/authorities/subjects/sh85089767) may be used in the triple patterns shown above, those two URIs identify two different resources - and both of them are distinct from the Wikipedia page which we cited in the examples back at the very start of this post.

i.e. we now have three separate URIs identifying three separate resources:

  1. a Wikipedia page, a document, created and modified by Wikipedia contributors between 2002 and the present identified by the URI http://en.wikipedia.org/wiki/Napoleonic_Wars
  2. the Napoleonic Wars as event taking place between 1800 and 1815, something with a duration in time, which occurred in physical locations, and in which human beings participated, identified by the DBpedia URI http://dbpedia.org/resource/Napoleonic_Wars
  3. a "conceptualisation of" the Napoleonic Wars, "a social and technological artifact designed to help interconnect", an "abstraction" created by the authors of LCSH editors for the purposes of classifying works; it has "semantic" relationships to other concepts, and is identified by the Library of Congress URI http://id.loc.gov/authorities/subjects/sh85089767

As we've seen, properties like dcterms:subject, foaf:topic/foaf:page, foaf:primaryTopic/foaf:isPrimaryTopicOf provide the vocabulary to express the relationships between the first and second of these resources, and between the first and third. But what about the relationship between the second and third, between the "thing in the world" and its conceptualisation in a classification scheme? Or to make the issue concrete, what happens if, in their "interest" graphs, Fred cites the DBpedia event URI and Frida cites the LCSH concept URI? How do we establish that their interests are indeed related? Can a publisher of an SKOS concept scheme indicate a relationship between a concept and a "conceptualised thing" (person, place, event etc)?

Dan provides a rather neat diagram illustrating the issue, using the example of Ronald Reagan. The arcs Dan labels "it" represent this "missing" (at the time the diagram was drawn) relationship type/property.

The resolution was to create a new property in the FOAF vocabulary, called foaf:focus. (See this page on the FOAF Project Wiki) for some of the discussion of its name).

The FOAF vocabulary specification says of the property:

The focus property relates a conceptualisation of something to the thing itself. Specifically, it is designed for use with W3C's SKOS vocabulary, to help indicate specific individual things (typically people, places, artifacts) that are mentioned in different SKOS schemes (eg. thesauri).

W3C SKOS is based around collections of linked 'concepts', which indicate topics, subject areas and categories. In SKOS, properties of a skos:Concept are properties of the conceptualization (see 2005 discussion for details); for example administrative and record-keeping metadata. Two schemes might have an entry for the same individual; the foaf:focus property can be used to indicate the thing in they world that they both focus on. Many SKOS concepts don't work this way; broad topical areas and subject categories don't typically correspond to some particular entity. However, in cases when they do, it is useful to link both subject-oriented and thing-oriented information via foaf:focus.

It's worth emphasising the point made in the penultimate sentence: not all concepts "have a focus"; some concepts are "just concepts" (poetry, slavery, conscientious objection, anarchism etc etc etc).

Dan summarises how he sees the new property being used in a message to the W3C public-esw-thes list:

The addition of foaf:topic is intended as a modest and pragmatic bridge between SKOS-based descriptions of topics, and other more entity-centric RDF descriptions. When a SKOS Concept stands for a person or agent, FOAF and its extensions are directly applicable; however we expect foaf:focus to also be used with places, events and other identifiable entities that are covered both by SKOS vocabularies as well as by factual datasets like wikipedia/dbpedia and Freebase.

A single "thing in the world" may be "the focus of" multiple concepts: e.g. several different library classification schemes may include concepts for the Napoleonic Wars or Ronald Reagan or Paris. Even within a single scheme, it may be that there are multiple concepts each reflecting different facets or aspects of a single entity.

VIAF

This aspect of the relationship between conceptualisation and "thing in the world" is illustrated in VIAF, the Virtual International Authority File, a service provided by OCLC. VIAF aggregates library "authority records" from multiple library "name authority files" maintained mainly by national libraries. Each record provides a "preferred form" of the name of a person or corporate entity, and multiple "alternate forms" - though that preferred form may vary from one file to the next. VIAF analyses and collates the aggregated data to establish which records refer to the same person or corporate entity, and presents the results as Linked Data.

Jeff Young of OCLC summarises the VIAF model in a post on the Outgoing blog. The post actually describes the transition between an earlier, slightly more complex model and the current model. For the purposes of this discussion, the thing to look at is the "Example (After)" graphic at the top right of the post (direct link)

Consider Dan's example of Ronald Reagan, identified by the VIAF URI http://viaf.org/viaf/76321889. The RDF description provided shows that there are eleven concepts linked to the person resource by a foaf:focus link. Each of those concepts (I think!) corresponds to a record in an authority file harvested by VIAF. Each concept has a preferred label (the preferred form of the name in that authority file) and may have a number of alternate labels. An abridged version of the VIAF description is below:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> . 

<http://viaf.org/viaf/76321889> a foaf:Person ;
        foaf:name "Reagan, Ronald" ,
                  "Reagan, Ronald W." , 
                  "Reagan, Ronald, 1911-2004" , 
                  "Reagan, Ronald W. 1911-2004" , 
                  "Reagan, Ronald Wilson 1911-2004" , 
                  "Reagan, Elvis 1911-2004" ;
# plus various other names!
        owl:sameAs 
          <http://d-nb.info/gnd/118598724> ,
          <http://dbpedia.org/resource/Ronald_Reagan> , 
          <http://libris.kb.se/resource/auth/237204> , 
          <http://www.idref.fr/027091775/id> .

<http://viaf.org/viaf/sourceID/BNE%7CXX1025345#skos:Concept>
        a skos:Concept ;
        skos:prefLabel "Reagan, Ronald, 1911-2004" ;
        skos:altLabel "Reagan, Elvis 1911-2004" , 
                      "Reagan, Ronald W. 1911-2004", 
                      "Reagan, Ronald Wilson 1911-2004" ;
        skos:inScheme 
          <http://viaf.org/authorityScheme/BNE> ;
        foaf:focus 
          <http://viaf.org/viaf/76321889> .

<http://viaf.org/viaf/sourceID/BNF%7C11921304#skos:Concept>
        a skos:Concept ;
        skos:prefLabel "Reagan, Ronald, 1911-2004" ;
        skos:altLabel "Reagan, Ronald Wilson 1911-2004" ;
        skos:inScheme 
          <http://viaf.org/authorityScheme/BNF> ;
        foaf:focus 
          <http://viaf.org/viaf/76321889> .

# plus nine other concepts

(There really is an alternate label of "Reagan, Elvis 1911-2004" in the actual data!)

Fig1

Note that the owl:sameAs links here are between the VIAF person resource and person resources in external datasets.

LOCAH and index terms

My own first engagement with foaf:focus came during the LOCAH project. In deciding how to represent the content of the Archives Hub EAD documents as RDF, we had to decide how to model the use of "index terms" provided using the EAD <controlaccess> element. Within the Hub EAD documents, those index terms are names of one of the following categories of resource:

  • Concepts
  • Persons
  • Families
  • Organisations
  • Places
  • Genres or Forms
  • Functions

The names are sometimes (but not always) drawn from some sort of "controlled list", which is also named in the data. In other cases, they are constructed using some specified set of rules, again named in the data.

For some of these categories (Concepts, Genres/Forms, Functions), the "thing" named is simply a concept, an abstraction; for others (Persons, Families, Organisations and Places), there is a second "non-abstract" entity "out there in the world". And for this second case, we chose to represent the two distinct things, each with their own distinct URI, linked by a triple using the foaf:focus property.

The LOCAH data model is illustrated in the diagram in this post. The Concept entity type is in the lower centre; directly below are four boxes for the related "conceptualised" types (Person, Family, Organisation and Place), each linked from the Concept by the foaf:focus property.

As in the VIAF case, for a single person/family/organisation/place, there may be multiple distinct "conceptualisations", reflecting the fact that different data providers have referred to the same "thing in the world" by citing entries from different "authority files". The nature of the process by which the LOCAH RDF data is generated - the EAD documents are processed on a "document by document" basis - means that in this case, multiple URIs for the person are generated, and the "reconciliation" of these URIs as co-references to a single entity is performed as a subsequent step.

The British Library Linked Data

The British Library recently announced the release of Linked Open BNB, a new Linked Data dataset covering a subset of the British National Bibliography. The approach taken is described in a post by Richard Wallis of Talis, who worked with the BL as consultants in preparing the dataset.

The data model for the BNB data shows quite extensive use of the Concept-foaf:focus-Thing pattern.

For the "subjects" of Bibliographic Resources, a dcterms:subject link is made to a skos:Concept, reflecting an authority file or classification scheme entry, and which is in turn linked using foaf:focus to a Person, Family, Organisation or Place. In other cases, the Bibliographic Resource is linked directly to the "thing-in-the-world" and a corresponding Concept is also provided, linking to the "thing-in-the world" using foaf:focus. This is the case for languages, for persons as creators or contributors to the bibliographic resource, and for the Dublin Core "spatial coverage property".

So for the example http://bnb.data.bl.uk/id/resource/009436036 (again, this is a very stripped-down version of the actual data):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix wgs84_pos: <http://www.w3.org/2003/01/geo/wgs84_pos#> .

<http://bnb.data.bl.uk/id/resource/009436036>
        a dcterms:BibliographicResource ;
        dcterms:creator 
          <http://bnb.data.bl.uk/id/person/KingGRD%28GeoffreyRD%29> ; 
        dcterms:subject
          <http://bnb.data.bl.uk/id/concept/place/lcsh/Aden%28Yemen%29> ;
        dcterms:spatial
          <http://bnb.data.bl.uk/id/place/Aden%28Yemen%29> ;
        dcterms:language 
          <http://lexvo.org/id/iso639-3/eng> .

<http://bnb.data.bl.uk/id/concept/place/lcsh/Aden%28Yemen%29> 
        a skos:Concept ;
        foaf:focus 
          <http://bnb.data.bl.uk/id/place/Aden%28Yemen%29> .

<http://bnb.data.bl.uk/id/place/Aden%28Yemen%29>        
        a dcterms:Location, wgs84_pos:SpatialThing .

In this case, the subject-concept and the spatial coverage-place happen to be linked to each other, but the point I wanted to illustrate was that the object of the dcterms:subject triple is the URI of a concept, and is the subject of an "outgoing" foaf:focus link to a "thing", but the object of the dcterms:spatial triple is the URI of a location/place, and is the object of an "incoming" foaf:focus link from a concept.

The two cases are perhaps best illustrated using the graph representation:

Fig2

Authority Files, Concepts, foaf:focus and Dublin Core

As discussed above, the dcterms:subject property is defined in such a way that it can be used to link to a thing of any type, although there may be a preference amongst some implementers to use dcterms:subject to link only to concepts.

For the other four properties I highlighted in the BL model, DCMI specifies an rdfs:range for the properties:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .

dcterms:creator rdfs:range dcterms:Agent .
dcterms:contributor rdfs:range dcterms:Agent .
dcterms:spatial rdfs:range dcterms:Location .
dcterms:language rdfs:range dcterms:LinguisticSystem .

Those three classes (dcterms:Agent, dcterms:LinguisticSystem, dcterms:Location) are described using RDFS, and although there is no formal statement that they are disjoint from the class skos:Concept, I think their human-readable descriptions, taken together with those of the properties, carry a fairly strong suggestion that instances of these classes are the "things in the world" rather than their conceptualisations - certainly for the first three cases at least. A dcterms:Agent is "A resource that acts or has the power to act", which a concept can not; a dcterms:Location is "A spatial region or named place", which again seems distinct from a concept.

The case of dcterms:LinguisticSystem, "A system of signs, symbols, sounds, gestures, or rules used in communication", seems a bit less clear, as this is a "conceptual thing" but I think one can argue that the actual linguistic system practiced by a community of language speakers is a distinct concept from the "conceptualisation" created within a classification scheme.

And this is, I think, reflected in the patterns used in the BL data model.

As I noted earlier, the Library of Congress has published SKOS representations of a number of controlled vocabularies, and these include:

In each case, the entries/members of the vocabularies are modelled as instances of skos:Concept. Following the argument I just constructed above, then, to use these vocabularies with the dcterms:spatial and dcterms:languageproperties, strictly speaking, one should adopt the patterns used in the BL model, where the concept URI is not the direct object, but is linked to a thing (location, linguistic system) URI by a foaf:focus link.

Finally, I'd draw attention to lexvo.org, which also provides Linked Data representations for ISO639-3 languages and ISO 3166-1 / UN M.49 geographical regions. In contrast to the Library of Congress SKOS representations, lexvo.org models "the things themselves", the languages and geographical regions, e.g. (again, a subset of the actual data)

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix lvont: <http://lexvo.org/ontology#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://lexvo.org/id/iso639-3/spa>
        a lvont:Language ;
        rdfs:label "Spanish"@en ;
        lvont:usedIn 
          <http://lexvo.org/id/iso3166/ES> ;
        owl:sameAs 
          <http://dbpedia.org/resource/Spanish_language> .
        
<http://lexvo.org/id/iso3166/ES>
        a lvont:GeographicRegion ;
        rdfs:label "Spain"@en ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/039> ;
        owl:sameAs 
          <http://sws.geonames.org/2510769> .
        
<http://lexvo.org/id/un_m49/039>
        a lvont:GeographicRegion ;
        rdfs:label "Southern Europe"@en ;
        lvont:hasMember 
          <http://lexvo.org/id/iso3166/ES> ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/150> .
 
<http://lexvo.org/id/un_m49/150>
        a lvont:GeographicRegion ;
        rdfs:label "Europe"@en ;
        lvont:hasMember 
          <http://lexvo.org/id/un_m49/039> ;
        lvont:memberOf 
          <http://lexvo.org/id/un_m49/001> .

In the BL data one finds lexvo.org language URIs used as objects of dcterms:language (which we do in the LOCAH data too).

The lexvo.org language URIs are the subjects of properties such as lvont:usedIn linking language to place where it is used or spoken and of owl:sameAs triples linking to language URIs in other datasets. And the geographic region URIs are the subjects of properties such as lvont:memberOf linking one region to another region of which it is part and of owl:sameAs triples linking to place URIs in other datasets.

Compare this with the relationships between the SKOS-based "conceptualisations of languages" and "conceptualisations of geographic in the Library of Congress dataset (again, a subset of the actual data):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<http://id.loc.gov/vocabulary/iso639-2/spa>
        a skos:Concept ;
        skos:prefLabel "Spanish | Castilian"@en ;
        skos:altLabel "Spanish"@en ,
                      "Castilian"@en ;
        skos:note "Bibliographic Code"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/languages/spa> ,
          <http://id.loc.gov/vocabulary/iso639-1/es> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/iso639-2> .
                        
<http://id.loc.gov/vocabulary/countries/sp>
        a skos:Concept ;
        skos:prefLabel "Spain"@en ;
        skos:altLabel "Balearic Islands"@en ,
                      "Canary Islands"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/geographicAreas/e-sp> ;
        skos:broadMatch 
          <http://id.loc.gov/vocabulary/geographicAreas/e> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/countries> .
        
<http://id.loc.gov/vocabulary/geographicAreas/e-sp>
        a skos:Concept ;        
        skos:prefLabel "Spain"@en ;
        skos:exactMatch 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        skos:broader 
          <http://id.loc.gov/vocabulary/geographicAreas/e> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/geographicAreas> .

<http://id.loc.gov/vocabulary/geographicAreas/e>
        a skos:Concept ;        
        skos:prefLabel "Europe"@en ;
        skos:narrower 
          <http://id.loc.gov/vocabulary/geographicAreas/e-sp> ;
        skos:narrowMatch 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        skos:inScheme 
          <http://id.loc.gov/vocabulary/geographicAreas> .
        

Here the types of relationships involved are the SKOS "semantic relations" and "mapping properties" between concepts: e.g. Spain-as-concept has-broader-concept Europe-as-concept, and so on.

Conclusions

If anyone has read this far, I can imagine it is not without some rolling of eyes at the pedantic distinctions I seem to be unpicking!

Given my understanding of SKOS, FOAF and Dublin Core, I do think the designers of the BL data have done a sterling job in trying to "get it right", carefully observing the way terms have been described by their owners and seeking to use those terms in ways that are consistent with those descriptions.

At the same time, I admit I can well imagine that to many Dublin Core implementers who see the Dublin Core properties as providing a relatively simple approach, this will seem over-complicated.

And I rather expect that we will see uses of the dcterms:language and dcterms:spatial properties which simply link directly to the concept:

@prefix dcterms: <http://purl.org/dc/terms/> .

<http://example.org/doc/1234>
        a dcterms:BibliographicResource ;
        dcterms:spatial 
          <http://id.loc.gov/vocabulary/countries/sp> ;
        dcterms:language 
          <http://id.loc.gov/vocabulary/iso639-2/spa> .

I think this is particularly likely for the case of dcterms:spatial as it is perceived as "similar" to dcterms:subject - amongst other things, it covers "aboutness" in the case where the topic is a place - particularly since, as in the BL example above, the concept URI may be used with dcterms:subject in the very same graph.

Returning to the recent post by Dan which in part prompted me to start this post, he makes a similar point, focusing on the FOAF properties with which I introduced the post. He recognises that "in some contexts having careful names for all these inter-relations is very important" but suggests

We should consider making foaf:interest vaguer, such that any of those three are acceptable. Publishers aren't going to go looking up obscure distinctions between foaf:interest and foaf:topic_interest and that other pattern using SKOS ... they just want to mark that this is a relationship between someone and a URI characterising one of their interests.

So I suggest we become more tolerant about the URIs acceptable as values of foaf:interest

(See the message in full for Dans examples.)

Part of me is tempted to suggest that similar reasoning might be applied to the Dublin Core Terms properties i.e. that consideration might be given to "relaxing" the range of dcterms:spatial to allow for the use of either the place or the concept, to resolve the dilemna that, with the current design, the patterns are different for dcterms:subject and dcterms:spatial. But where do we stop? Do we also relax dcterms:language? I really don't know. And I think I'd be quite uneasy making the suggestion for the "agent" properties.

The fundamental issue here is that the thesaurus-based approach introduces a layer of abstraction - Dan's "social and technological artifact[s] designed to help interconnect descriptions" - and that is reflected in SKOS's notion of the skos:Concept. Outside the "knowledge organisation" community, however, many RDFS/OWL modellers "model the world directly": they consider the language as system used by a community of speakers (as modelled by lexvo.org), the place as thing in space (as modelled by Geonames), and so on. Constructs such as foaf:focus help us bridge the two approaches - but not without some complexity.

On Twitter on Friday evening, I noticed a (slightly provocative!) comment on Twitter from John Goodwin (@gothwin) which I think is related:

coming to the conclusion that #SKOS is being waaay over used #linkeddata

It attracted some sympathetic responses, e.g. from Rob Styles (@mmmmmmrob) 1, 2:

@gothwin @juansequeda if you're not publishing an existing subject heading scheme on the cheap then SKOS is the wrong tool.

@johnlsheridan @juansequeda @gothwin Paris is not "narrower than" France, it's "capital of". That's the problem.

which I think echoes my examples above of Spain and Europe.

And from Leigh Dodds (@ldodds): 1, 2:

@gothwin that's all it was designed for: converting one specific class of datasets into RDF. It's just been mistaken for a modelling tool

@juansequeda @gothwin @kendall only use #skos if you're CONVERTING a taxonomy. Otherwise, just model the domain

These comments struck home with me, and I think I may have made exactly this sort of mistake in some other work I've been doing recently, and I need to revisit it, to check whether what I've done is really necessary or useful. As John suggests, I think sometimes I reach for SKOS as a tool for something that has a "controlled list" feel to it without thinking hard enough whether it is really the appropriate tool for the task at hand.

Having said that, I do also think we will need to manage the "mixed" cases - particularly in datasets coming from communities such as libraries where the use of thesauri is commonplace, and a growing number are available as SKOS - where we end up needing the sort of bridge which foaf:focus provides, so some of this complexity may be unavoidable.

In any case, I think it's an area where some guidance/best practice notes - with lots of examples! - would be very helpful.

August 15, 2011

Two ends and one start

The end of July formally marked the end of two projects I've been contributing to recently:

There are still some things to finish for LOCAH, particularly the publication of the Copac data.

A few (not terribly original or profound) thoughts, prompted mainly by my experience of working with the Archives Hub data:

  • Sometimes data is "clean" and consistent and "normalised", but more often than not it has content in inconsistent forms or is incomplete: it is "messy". Data aggregated from multiple sources over a long period of time is probably going to be messier. (With an XML format like EAD, with what I think of as its somewhat "hybrid" document/data character, the potential for messiness increases).
  • Doing things with data requires some sort of processing by software, and while there are off-the-shelf apps and various configurable tools and frameworks that can provide some functions, some element of software development is usually required.
  • It may be relatively easy to identify in advance the major tasks where developer effort is required, and to plan for that, but sometimes there are additional tasks which it's more difficult to anticipate; rather they emerge as you attempt some of the other processes (and I think messy data probably makes that more likely).
  • Even with developers on board, development work has to be specified, and that is a task in itself, and one that can be quite complex and time-consuming (all the more so if you find yourself trying to think through and describe what to do with a range of different inputs from a messy data source).

It's worth emphassing that most of the above is not specific to generating Linked Data: it applies to data in all sorts of formats, and it applies to all sorts of tasks, whether you're building a plain old Web site or exposing RSS feeds or creating some other Web application to do something with the data.

Sort of related to all of the above, I was reminded that my limited programming skills often leave me in the position where I can identify what needs doing but I'm not able to "make it happen", and that is something I'd like to try to change. I can "get by" in XSLT, and I can do a bit of PHP, but I'd like to get to the point where I can do more of the simpler data manipulation tasks and/or pull together simple presentation prototypes.

I've enjoyed working on both the projects. I'm pleased that we've managed to deliver some Linked Data datasets, and I'm particularly pleased to have been able to contribute to doing this for archival description data, as it's a field in which I worked quite a long time ago.

Both projects gave me opportunities to learn by actually doing stuff, rather than by reading about it or prodding at other people's data. And perhaps more than anything, I've enjoyed the experience of working together as a group. We had a lot of discussion and exchange of ideas, and I'd like to think that this process was valuable in itself. In an early post on the SALDA blog, Karen Watson noted:

it is perhaps not the end result that is most important but the journey to get there. What we hope to achieve with SALDA is skills and knowledge to make our catalogues Linked Data and use those skills and that knowledge to inform decisions about whether it would be beneficial for make all our data Linked Data.

Of course it wasn't all plain sailing, and particularly near the start there were probably times when we ran up against differences of perception and terminology. Jane Stevenson has written about some of these issues from her point of view on the LOCAH blog (e.g. here, here and here). As the projects progressed, I think we moved closer towards more of a shared understanding - and I think that is a valuable "output" in its own right, even if it is one which it may be rather hard to "measure".

So, a big thank you to everyone I worked with on both projects.

Looking forward, I'm very pleased to be able to say that Jane prepared a proposal for some further work with the Archives Hub data, under JISC's Capital Funded Service Enhancements initiative, and that bid has been successful, and I'll be contributing some work to that project as a consultant. The project is called "Linking Lives" and is focused on providing interfaces for researchers to explore the data (as well as making any enhancements/extensions to the data and the "back-end" processes that are required to enable that). More on that work to come once we get moving.

Finally, as I'm sure many of you are aware, JISC recently issued some new calls for projects, including call 13/11 for projects "to develop services that aim to make content from museums, libraries and archives more visible and to develop innovative ways to reuse those resources". If there are any institutions out there who are considering proposals and think that I could make a useful contribution - despite my lamenting my limitations above, I hope my posts on the LOCAH and SALDA blogs give an idea of the sorts of areas I can contribute to! - , please do get in touch.

June 24, 2011

Our HE Cloud Pilot - the real work begins

I'm spending an increasing amount of my time thinking about/talking about/in meetings about "the cloud" - well, specifically the HE Cloud Pilot that we are now putting in place as part of the JISC UMF Shared Services and the Cloud Programme.

This is both intimidating and exciting...

Intimidating because a lot of this stuff is new to me and, if I'm completely honest, the hardware side of things really doesn't do it for me. Luckily, we have some pretty good people now assigned to designing and building this stuff and I'm impressed at the progress being made very quickly. But there are times, as happened in an architectural design meeting this afternoon, where the discussion gets too much for me and I have to leave (about networking in this particular case). I figure that if I've sat in a meeting for 30 minutes and not understood a single word, and therefore couldn't summarise for someone else what that meeting was about, then I'm probably not being useful :-).

Exciting, because I think we have the potential to do something really useful and innovative here. That's not to say that success is guaranteed and I think that there are very real challenges for us in terms of building something that is of value to the education community and that has a sustainable business model associated with it. We spent an afternoon earlier this week (in the form of a premortem) brainstorming a pretty long list of everything that had gone wrong with the project from the perspective of 12 months hence - technology, usability, resourcing, business models and so on - and then thinking about the kinds of actions we wished we'd have taken to mitigate against them. It was a very useful exercise.

In delivering the infrastructure we'll start by targetting the various SaaS projects that are now being funded by the JISC (we've already made contact with most of them). But we'll also keep one eye on a much wider application of the infrastructure, to meet adminstrative and academic compute and storage requirements of institutions and individual researchers more generally because, long term, that is where our success lies.

Note that 'cloud' is a little narrow here since, on the compute side, our offer will include physical servers (though I think we'll try and discourage this to a certain extent), virtualisation in the form of VMware vCloud as well as true cloud in the form of OpenStack Compute.

I want to try and put something public in place where we can document the design decisions we are taking as we take them - partly as a sounding board for the community. I'm not sure where yet, though I suspect it won't be here. If you have an interest in what we are doing, keep an eye out.

May 25, 2011

Virtualisation and the cloud - the Eduserv Symposium 2011, a brief review

"I know what you're thinking... but don't worry, this is a talk on cloud computing and being lost is normal".

So started the Eduserv Symposium 2011 two weeks ago, with an opening keynote by Simon Wardley (Leading Edge Forum) that was anything but confusing. In fact, I heard several people comment after his talk that it was the best overview of cloud computing that they'd seen and that they couldn't wait for the video to be made available (below) so that they could show it to their senior management team as a way of highlighting the business and technological changes that are driving, and being driven by, the cloud.

Simon's opening talk was followed by a series of talks - Chris Cobb (Roehampton University), Kenji Takeda (University of Southampton), Phil Richards (Loughborough University) and Terry Harmer (Belfast e-Science Centre) - which provided an institutional perspective, both strategic and practical, on the ways in which shared services, virtualisation and the cloud might impact on administrative and research computing in UK universities and colleges.

And, in the middle of these, there were a short series of lightning (10 minute) talks by Rachel Bruce (JISC), Kevin Ashley (DCC), Dan Perry (JANET (UK)) and Matt Johnson (Eduserv) covering some aspects of what the JISC UMF Programme hopes to achieve over the next 12 months or so.

Rounding off the day was a great closing keynote by Armando Fox (UC Berkeley) (below), providing a US academic perspective that looked in some detail at the thinking around 'cloud' being done within the department of Electrical Engineering and Computer Science at UC Berkeley.

All the videos and presentation slides from the day are now available online.

The day was a significant challenge to put together. We wanted something that would try and cover the breadth of activity happening in the cloud space currently, particularly as it relates to the UK education community. We wanted something that would introduce people to what is likely to be happening over the next 12 months or so as part of the UMF Programme. And we wanted something that would challenge our thinking (both 'our' as in the education community and 'our' as in Eduserv) around the development and use of the cloud. Did we succeed? Yes, I think so... and, overall, I'm very happy with the way the day panned out. I'm particularly grateful to all our speakers.

The talks threw down some pretty significant challenges for those of us, like Eduserv, who are interested in building cloud services targeted at the education sector. I went in to the day with a question: is the education community a consumer of infrastructural services in the cloud or can it also be a sustainable provider? (By 'sustainable' I mean something other than 'funded centrally by public money'). Do I now know the answer? No. But I am significantly more aware of the challenges and issues that lie behind that question. This is something that I will return to in a future post.

In lieu of that deeper discussion, I have three rather more superficial take-home messages from the day. Firstly, that adopting the cloud (i.e. moving to commodity computing) is at least as much about changes to management structure, market competition and disruption as it is about technology (though I must admit that I don't quite understand how this might play out in, say, higher education). Secondly, that the adoption of cloud infrastructure should not be seen primarily as a way of saving money. Rather it is a way of enabling innovation and allowing things to be done that were not possible before. And thirdly, that the sustainability issues (for educational cloud providers) are at least as much about the ability to keep up with a rapidly changing and highly innovative environment as they are about price.

For us, the next 12 months look really interesting. As part of the UMF Programme we are receiving funding to build some pilot infrastructure for use by HE (some of which will be true 'cloud' infrastructure). This initial funding has, to a certain extent, reduced the risks for us in getting invloved in this space but we'll still have to work very hard to create a sustainable service in the longer term. Whatever else the UMF Programme achieves over the next 12 months or so, what I hope our involvement can do is to help build a better understanding of some of the issues and challenges laid out at the symposium.

As to my, as yet, unanswered question about whether an organisation like Eduserv can be a sustainable provider of cloud infrastructure... ask me again in 12 months :-)

About

Search

Loading
eFoundations is powered by TypePad