December 21, 2009

Scanning horizons for the Semantic Web in higher education

The week before last I attended a couple of meetings looking at different aspects of the use of Semantic Web technologies in the education sector.

On the Wednesday, I was invited to a workshop of the JISC-funded ResearchRevealed project at ILRT in Bristol. From the project weblog:

ResearchRevealed [...] has the core aim of demonstrating a fine-grained, access controlled, view layer application for research, built over a content integration repository layer. This will be tested at the University of Bristol and we aim to disseminate open source software and findings of generic applicability to other institutions.

ResearchRevealed will enhance ways in which a range of user stakeholder groups can gain up-to-date, accurate integrated views of research information and thus use existing institutional, UK and potentially global research information to better effect.

I'm not formally part of the project, but Nikki Rogers of ILRT mentioned it to me at the recent VoCamp Bristol meeting, and I expressed a general interest in what they were doing; they were also looking for some concrete input on the use of Dublin Core vocabularies in some of their candidate approaches.

This was the third in a series of small workshops, attended by representatives of the project from Bristol, Oxford and Southampton, and the aim was to make progress on defining a "core Research ontology". The morning session circled mainly around usage scenarios (support for the REF (and other "impact" assessment exercises), building and sustaining cross-institutional collaboration etc), and the (somewhat blurred) boundaries between cross-institutional requirements and institution-specific ones; what data might be aggregated, what might be best "linked to"; and the costs/benefits of rich query interfaces (e.g. SPARQL endpoints) v simpler literal- or URI-based lookups. In the afternoon, Nick Gibbins from the University of Southampton walked through a draft mapping of the CERIF standard to RDF developed by the dotAC project. This focused attention somewhat and led to some - to me - interesting technical discussions about variant ways of expressing information with differing degrees of precision/flexibility. I had to leave before the end of the meeting, but I hope to be able to continue to follow the project's progress, and contribute where I can.

A long train journey later, the following day I was at a meeting in Glasgow organised by the CETIS Semantic Technologies Working Group to discuss the report produced by the recent JISC-funded Semtech project, and to try to identify potential areas for further work in that area by CETIS and/or JISC. Sheila MacNeill from CETIS liveblogged proceedings here. Thanassis Tiropanis from the University of Southampton presented the project report, with a focus on its "roadmap for semantic technology adoption". The report argues that, in the past, the adoption of semantic technologies may have been hindered by a tendency towards a "top-down" approach requiring the widespread agreement on ontologies; in contrast the "linked data" approach encourages more of a "bottom-up" style in which data is first made available as RDF, and then later application-specific or community-wide ontologies are developed to enable more complex reasoning across the base data (which may involve mapping that initial data to those ontologies as they emerge). While I think there's a slight risk of overstating the distinction - in my experience many "linked data" initiatives do seem to demonstrate a good deal of thinking about the choice of RDF vocabularies and compatibility with other datasets - and I guess I see rather more of a continuum, it's probably a useful basis for planning. The report recommends a graduated approach which focusses initially on the development of this "linked data field" - in particular where there are some "low-hanging fruit" cases of data already made available in human-readable form which could relatively easily be made available in RDF, especially using RDFa.

One of the issues I was slightly uneasy with in the Glasgow meeting was that occasionally there were mentions of delivering "interoperability" (or "data interoperability") without really saying what was meant by that - and I say this as someone who used to have the I-word in my job title ;-) I feel we probably need to be clearer, and more precise, about what different "semantic technologies" (for want of a better expression) enable. What does the use of RDF provide that, say, XML typically doesn't? What does, e.g., RDF Schema add to that picture? What about convergence on shared vocabularies? And so on. Of course, the learners, teachers, researchers and administrators using the systems don't need to grapple with this, but it seems to me such aspects do need to be conveyed to the designers and developers, and perhaps more importantly - as Andy highlighted in his report of related discussions at the CETIS conference - to those who plan and prioritise and fund such development activity. (As an aside, I this is also something of an omission in the current version of the DCMI document on "Interoperability Levels": it tells me what characterises each level, and how I can test for whether an application meets the requirements of the level, but it doesn't really tell me what functionality each level provides/enables, or why I should consider level n+1 rather than level n.)

Rather by chance, I came across a recent presentation by Richard Cyganiak to the Vienna Linked Data Camp, which I think addresses some similar questions, albeit from a slightly different starting point: Richard asks the questions, "So, if we have linked data sources, what's stopping the development of great apps? What else do we need?", and highlights various dimensions of "heterogeneity" which may exist across linked data sources (use of identifiers, differences in modelling, differences in RDF vocabularies used, differences in data quality, differences in licensing, and so on).

Finally, I noticed that last Friday, Paul Miller (who was also at the CETIS meeting) announced the availability of a draft of a "Horizon Scan" report on "Linked Data" which he has been working on for JISC, as part of the background for a JISC call for projects in this area some time early in 2010. It's a relatively short document (hurrah for short reports!) but I've only had time for a quick skim through. It aims for some practical recommendations, ranging from general guidance on URI creation and the use of RDFa to more specific actions on particular resources/datasets. And here I must reiterate what Paul says in his post - it's a draft on which he is seeking comments, not the final report, and none of those recommendations have yet been endorsed by JISC. (If you have comments on the document, I suggest that you submit them to Paul (contact details here or comment on his post) rather than commenting on this post.)

In short, it's encouraging to see the active interest in this area growing within the HE sector. On reading Paul's draft document, I was struck by the difference between the atmosphere now (both at the Semtech meeting, and more widely) and what Paul describes as the "muted" conclusions of Brian Matthews' 2005 survey report on Semantic Web Technologies for JISC Techwatch. Of course, many of the challenges that Andy mentioned in his report of the CETIS conference session remain to be addressed, but I do sense that there is a momentum here - an excitement, even - which I'm not sure existed even eighteen months ago. It remains to be seen whether and how that enthusiasm translates into applications of benefit to the educational community, but I look forward to seeing how the upcoming JISC call, and the projects it funds, contribute to these developments.

December 08, 2009

UK government’s public data principles

The UK government has put down some pretty firm markers for open data in it's recent document, Putting the Frontline First: smarter government. The section entitled Radically opening up data and promoting transparency sets out the agenda as follows:

  1. Public data will be published in reusable, machine-readable form
  2. Public data will be available and easy to find through a single easy to use online access point (http://www.data.gov.uk/)
  3. Public data will be published using open standards and following the recommendations of the World Wide Web Consortium
  4. Any 'raw' dataset will be represented in linked data form
  5. More public data will be released under an open licence which enables free reuse, including commercial reuse
  6. Data underlying the Government's own websites will be published in reusable form for others to use
  7. Personal, classified, commercially sensitive and third-party data will continue to be protected.

(Bullet point numbers added by me.)

I'm assuming that "linked data" in point 4 actually means "Linked Data", given reference to W3C recommendations in point 3.

There's also a slight tension between points 4 and 5, if only because the use of the phrase, "more public data will be released under an open licence", in point 5 implies that some of the linked data made available as a result of point 4 will be released under a closed licence.  One can argue about whether that breaks the 'rules' of Linked Data but it seems to me that it certainly runs counter to the spirit of both Linked Data and what the government says it is trying to do here?

That's a pretty minor point though and, overall, this is a welcome set of principles.

Linked Data, of course, implies URIs and good practice suggests Cool URIs as the basic underlying principle of everything that will be built here.  This applies to all government content on the Web, not just to the data being exposed thru this particular initiative.  One of the most common forms of uncool URI to be found on the Web in government circles is the technology-specific .aspx suffix... hey, I work for an organisation that has historically provided the technology to mint a great deal of these (though I think we do a better job now).  It's worth noting, for example, that the two URIs that I use above to cite the Putting the Frontline First document both end in .aspx - ironic huh?

I'm not suggesting that cool URIs are easy, but there are some easy wins and getting the message across about not embedding technology into URIs is one of the easier ones... or so it seems to me anyway.

December 03, 2009

On being niche

I spoke briefly yesterday at a pre-IDCC workshop organised by REPRISE.  I'd been asked to talk about Open, social and linked information environments, which resulted in a re-hash of the talk I gave in Trento a while back.

My talk didn't go too well to be honest, partly because I was on last and we were over-running so I felt a little rushed but more because I'd cut the previous set of slides down from 119 to 6 (4 really!) - don't bother looking at the slides, they are just images - which meant that I struggled to deliver a very coherent message.  I looked at the most significant environmental changes that have occurred since we first started thinking about the JISC IE almost 10 years ago.  The resulting points were largely the same as those I have made previously (listen to the Trento presentation) but with a slightly preservation-related angle:

  • the rise of social networks and the read/write Web, and a growth in resident-like behaviour, means that 'digital identity' and the identification of people have become more obviously important and will remain an important component of provenance information for preservation purposes into the future;
  • Linked Data (and the URI-based resource-oriented approach that goes with it) is conspicuous by its absence in much of our current digital library thinking;
  • scholarly communication is increasingly diffusing across formal and informal services both inside and outside our institutional boundaries (think blogging, Twitter or Google Wave for example) and this has significant implications for preservation strategies.

That's what I thought I was arguing anyway!

I also touched on issues around the growth of the 'open access' agenda, though looking at it now I'm not sure why because that feels like a somewhat orthogonal issue.

Anyway... the middle bullet has to do with being mainstream vs. being niche.  (The previous speaker, who gave an interesting talk about MyExperiment and its use of Linked Data, made a similar point).  I'm not sure one can really describe Linked Data as being mainstream yet, but one of the things I like about the Web Architecture and REST in particular is that they describe architectural approaches that haven proven to be hugely successful, i.e. they describe the Web.  Linked data, it seems to me, builds on these in very helpful ways.  I said that digital library developments often prove to be too niche - that they don't have mainstream impact.  Another way of putting that is that digital library activities don't spend enough time looking at what is going on in the wider environment.  In other contexts, I've argued that "the only good long-term identifier, is a good short-term identifier" and I wonder if that principle can and should be applied more widely.  If you are doing things on a Web-scale, then the whole Web has an interest in solving any problems - be that around preservation or anything else.  If you invent a technical solution that only touches on scholarly communication (for example) who is going to care about it in 50 or 100 years - answer, not all that many people.

It worries me, for example, when I see an architectural diagram (as was shown yesterday) which has channels labelled 'OAI-PMH', XML' and 'the Web'!

After my talk, Chris Rusbridge asked me if we should just get rid of the JISC IE architecture diagram.  I responded that I am happy to do so (though I quipped that I'd like there to be an archival copy somewhere).  But on the train home I couldn't help but wonder if that misses the point.  The diagram is neither here nor there, it's the "service-oriented, we can build it all", mentality that it encapsulates that is the real problem.

Let's throw that out along with the diagram.

December 01, 2009

On "Creating Linked Data"

In the age of Twitter, short, "hey, this is cool" blog posts providing quick pointers have rather fallen out of fashion, but I thought this material was worth drawing attention to here. Jeni Tennison, who is contributing to the current work with Linked Data in UK government, has embarked on a short series of tutorial-style posts called "Creating Linked Data", in which she explains the steps typically involved in reformulating existing data as linked data, and discusses some of the issues arising.

Her "use case" is the scenario in which some data is currently available in CSV format, but I think much of the discussion could equally be applied to the case where the provider is making data available for the first time. The opening post on the sequence ("Analysing and Modelling") provides a nice example of working through the sort of "things or strings?" questions which we've tried to highlight in the context of designing DC Application Profiles. And as Jeni emphasises, this always involves design choices:

It’s worth noting that this is a design process rather than a discovery process. There is no inherent model in any set of data; I can guarantee you that someone else will break down a given set of data in a different way from you. That means you have to make decisions along the way.

And further on in the piece, she rationalises her choices for this example in terms of what those choices enable (e.g. "whenever there’s a set of enumerated values it’s a good idea to consider turning them into things, because to do so enables you to associate extra information about them").

The post on URI design offers some tips, not only on designing new URIs but also on using existing URIs where appropriate: I admit I tend to forget about useful resources like placetime.com "a URI space containing URIs that represent places and times" (and provides redirects to descriptions in various formats).

On a related note, the post on choosing/coining properties, classes and datatypes includes a pointer to the OWL Time ontology. This is something I was aware of, but only looked at in any detail relatively recently. At first glance it can seem rather complex; Ian Davis has a summary graphic which I found helpful in trying to get my head round the core concepts of the ontology.

It seems to me these sort of very common areas like time data are those around which some shared practice will emerge, and articles like these, by "hands-on" practitioners, are important contributions to that process.

November 20, 2009

COI guidance on use of RDFa

Via a post from Mark Birbeck, I notice that the UK Central Office for Information has published some guidelines called Structuring information on the Web for re-usability which include some guidance on the use of RDFa to provide specific types of information, about government consultations and about job vacancies.

This is exciting news as, as far as I know, this is the first formal document from UK central government to provide this sort of quite detailed, resource-type-specific guidance with recommendations on the use of particular RDF vocabularies - guidance of the sort I think will be an essential component in the effective deployment of RDFa and the Linked Data approach. It's also the sort of thing that is of considerable interest to Eduserv, as a developer of Web sites for several government agencies. The document builds directly on the work Mark has been doing in this area, which I mentioned a while ago.

As Mark notes in his post, the document is unequivocal in its expression of the government's commitment to the Linked Data approach:

Government is committed to making its public information and data as widely available as possible. The best way to make structured information available online is to publish it as Linked Data. Linked Data makes the information easier to cut and combine in ways that are relevant to citizens.

Before the announcement of these guidelines, I recently had a look at the "argot" for consultations - "argot" is Mark's term for a specification of how a set of terms from multiple RDF vocabularies is used to meet some application requirement; as I noted in that earlier post, I think it is similar to what DCMI calls an "application profile" - , and I had intended to submit some comments. I fear it is now somewhat late in the day for me to be doing this, but the release of this document has prompted me to write them up here. My comments are concerned primarily with the section titled "Putting consultations into Linked Data"

The guidelines (correctly, I think) establish a clear distinction between the consultation on the one hand and the Web page describing the consultation on the other by (in paragraphs 30/31) introducing a fragment identifier for the URI of the consultation (via the about="#this" attribute). The consultation itself is also modelled as a document, an instance of the class foaf:Document, which in turn "has as parts" the actual document(s) on which comment is being sought, and for which a reply can be sent to some agent.

I confess that my initial "instinctive" reaction to this was that this seemed a slightly odd choice, as a "consultation" seemed to me to be more akin to an event or a process, taking place during an interval of time, which had a as "inputs" to that process a set of documents on which comments were sought, and (typically at least) resulted in the generation of some other document as a "response". And indeed the page describing the Consultation argot introduces the concept as follows (emphasis added):

A consultation is a process whereby Government departments request comments from interested parties, so as to help the department make better decisions. A consultation will usually be focused on a particular topic, and have an accompanying publication that sets the context, and outlines particular questions on which feedback is requested. Other information will include a definite start and end date during which feedback can be submitted and contact details for the person to submit feedback to.

I admit I find it difficult to square this with the notion of a "document". And I think a "consultation-as-event" (described by a Web page) could probably be modelled quite neatly using the Event Ontology or the similar LODE ontology (with some specialisation of classes and properties if required).

Anyway, I appreciate that aspect may be something of a "design choice". So for the remainder of the comments here, I'll stick to the actual approach described by the guidelines (consultation as document).

The RDF properties recommended for the description of the consultation are drawn mainly from Dublin Core vocabularies, and more specifically from the "DC Terms" vocabulary.

The first point to note is that, as Andy noted recently, DCMI recently made some fairly substantive changes to the DC Terms vocabulary, as a result of which the majority of the properties are now the subject of rdfs:range assertions, which indicate whether the value of the property is a literal or a non-literal resource. The guidelines recommend the use of the publisher (paragraphs 32-37), language(paragraphs 38-39), and audience (paragraph 46) properties, all with literal values, e.g.

<span property="dc:publisher" content="Ministry of Justice"></span>

But according to the term descriptions provided by DCMI, the ranges of these properties are the classes dcterms:Agent, dcterms:LinguisticSystem and dcterms:AgentClass respectively. So I think that would require the use of an XHTML-RDFa construct something like the following, introducing a blank node (or a URI for the resource if one is available):

<div rel="dc:publisher"><span property="foaf:name" content="Ministry of Justice"></span></div>

Second, I wasn't sure about the recommendation for the use of the dcterms:source property (paragraph 37). This is used to "indicate the source of the consultation". For the case where this is a distinct resource (i.e. distinct from the consultation and this Web page describing it), this seems OK, but the guidelines also offer the option of referring to the current document (i.e. the Web page) as the source of the consultation:

<span rel="dc:source" resource=""></span>

DCMI's definition of the property is "A related resource from which the described resource is derived", but it seems to me the Web page is acting as a description of the consultation-as-document, rather than a source of it.

Third, the guidelines recommend the use of some of the DCMI date properties (paragraph 42):

  • dcterms:issued for the publication date of the consultation
  • dcterms:available for the start date for receiving comments
  • dcterms:valid for the closing date ("a date through which the consultation is 'valid'")

I think the use of dcterms:valid here is potentially confusing. DCMI's definition is "Date (often a range) of validity of a resource", so on this basis I think the implication of the recommended usage is that the consultation is "valid" only on that date, which is not what is intended. The recommendations for dcterms:issued and dcterms:available are probably OK - though I do think the event-based approach might have helped make the distinction between dates related to documents and dates related to consultation-as-process rather clearer!

Oh dear, this must read like an awful lot of pedantic nitpicking on my part, but my intent is to try to ensure that widely used vocabularies like those provided by DCMI are used as consistently as possible. As I said at the start I'm very pleased to see this sort of very practical guidance appearing (and I apologise to Mark for not submitting my comments earlier!)

November 16, 2009

The future has arrived

Cetis09

About 99% of the way thru Bill Thompson's closing keynote to the CETIS 2009 Conference last week I tweeted:

great technology panorama painted by @billt in closing talk at #cetis09

And it was a great panorama - broad, interesting and entertainingly delivered. It was a good performance and I am hugely in awe of people who can give this kind of presentation. However, what the talk didn't do was move from the "this is where technology has come from, this is where it is now and this is where it is going" kind of stuff to the "and this is what it means for education in the future". Which was a shame because in questioning after his talk Thompson did make some suggestions about the future of print news media (not surprising for someone now at the BBC) and I wanted to hear similar views about the future of teaching, learning and research.

As Oleg Liber pointed out in his question after the talk, universities, and the whole education system around them, are lumbering beasts that will be very slow to change in the face of anything. On that basis, whilst it is interesting to note that (for example) we can now just about store a bit on an atom (meaning that we can potentially store a digital version of all human output on something the weight of a human body), that we can pretty much wire things directly into the human retina, and that Africa will one-day overtake 'digital' Britain in the broadband stakes are interesting individual propositions in their own right, there comes a "so what?" moment where one is left wondering what it actually all means.

As an aside, and on a more personal note, I suggest that my daughter's experience of university (she started at Sheffield Hallam in September) is not actually going to be very different to my own, 30-odd years ago. Lectures don't seem to have changed much. Project work doesn't seem to have changed much. Going out drinking doesn't seem to have changed much. She did meet all her 'hall' flat-mates via Facebook before she arrived in Sheffield I suppose :-) - something I never had the opportunity to do (actually, I never even got a place in hall). There is a big difference in how it is all paid for of course but the interesting question is how different university will be for her children. If the truth is, "not much", then I'm not sure why we are all bothering.

At one point, just after the bit about storing a digital version of all human output I think, Thompson did throw out the question, "...and what does that mean for copyright law?". He didn't give us an answer. Well, I don't know either to be honest... though it doesn't change the fact that creative people need to be rewarded in some way for their endeavours I guess. But the real point here is that the panorama of technological change that Thompson painted for us, interesting as it was, begs some serious thinking about what the future holds.  Maybe Thompson was right to lay out the panorama and leave the serious thinking to us?

He was surprisingly positive about Linked Data, suggesting that the time is now right for this to have a significant impact.  I won't disagree because I've been making the same point myself in various fora, though I tend not to shout it too loudly because I know that the Semantic Web has a history of not quite making it.  Indeed, the two parallel sessions that I attended during the conference, University API and the Giant Global Graph both focused quite heavily on the kinds of resources that universities are sitting on (courses, people/expertise, research data, publications, physical facilities, events and so on) that might usefully be exposed to others in some kind of 'open' fashion.  And much of the debate, particularly in the second session (about which there are now some notes), was around whether Linked Data (i.e. RDF) is the best way to do this - a debate that we've also seen played out recently on the uk-government-data-developers Google Group.

The three primary issues seemed to be:

  • Why should we (universities) invest time and money exposing our data in the hope that people will do something useful/interesting/of value with it when we have many other competing demands on our limited resources?
  • Why should we take the trouble to expose RDF when it's arguably easier for both the owner and the consumer of the data to expose something simpler like a CSV file?
  • Why can't the same ends be achieved by offering one or more services (i.e. a set of one or more APIs) rather than the raw data itself?

In the ensuing debate about the why and the how, there was a strong undercurrent of, "two years ago SOA was all the rage, now Linked Data is all the rage... this is just a fashion thing and in two years time there'll be something else".  I'm not sure that we (or at least I) have a well honed argument against this view but, for me at least, it lies somewhere in the fit with resource-orientation, with the way the Web works, with REST, and with the Web Architecture.

On the issue of the length of time it is taking for the Semantic Web to have any kind of mainstream impact, Ian Davis has an interesting post, Make or Break for the Semantic Web?, arguing that this is not unusual for standards track work:

Technology, especially standards track work, takes years to cross the chasm from early adopters (the technology enthusiasts and visionaries) to the early majority (the pragmatists). And when I say years, I mean years. Take CSS for example. I’d characterise CSS as having crossed the chasm and it’s being used by the early majority and making inroads into the late majority. I don’t think anyone would seriously argue that CSS is not here to stay.

According to this semi-official history of CSS the first proposal was in 1994, about 13 years ago. The first version that was recognisably the CSS we use today was CSS1, issued by the W3C in December 1996. This was followed by CSS2 in 1998, the year that also saw the founding of the Web Standards Project. CSS 2.1 is still under development, along with portions of CSS3.

Paul Walk has also written an interesting post, Linked, Open, Semantic?, in which he argues that our discussions around the Semantic Web and Linked Data tend to mix up three memes (open data, linked data and semantics) in rather unhelpful ways. I tend to agree, though I worry that Paul's proposed distinction between Linked Data and the Semantic Web is actually rather fuzzier than we may like.

On balance, I feel a little uncomfortable that I am not able to offer a better argument against the kinds of anti-Linked Data views expressed above. I think I understand the issues (or at least some of them) pretty well but I don't have them to hand in a kind of this is why Linked Data is the right way forward 'elevator pitch'.

Something to work on I guess!

[Image: a slide from Bill Thompson's closing keynote to the CETIS 2009 Conference]

October 19, 2009

Helpful Dublin Core RDF usage patterns

In the beginning [*] there was the HTML meta element and we used to write things like:

<meta name="DC.Creator"content="Andy Powell">
<meta name="DC.Subject" content="something, something else, something else again">
<meta name="DC.Date.Available" scheme="W3CDTF" content="2009-10-19">
<meta name="DC.Rights" content="Open Database License (ODbL) v1.0">

Then came RDF and a variety of 'syntax' guidance from DCMI and we started writing:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dc:creator>Andy Powell</dc:creator>
    <dcterms:available>2009-10-19</dcterms:available>
    <dc:subject>something</dc:subject>
    <dc:subject>something else</dc:subject>
    <dc:subject>something else again</dc:subject>
    <dc:rights>Open Database License (ODbL) v1.0</dc:rights>
  </rdf:Description>
</rdf:RDF>

Then came the decision to add 15 new properties to the DC terms namespace which reflected the original 15 DC elements but which added a liberal smattering of domains and ranges.  So, now we write:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dcterms:creator>
      <dcterms:Agent>
        <rdf:value>Andy Powell</rdf:value>
        <foaf:name>Andy Powell</foaf:name>
      </dcterms:Agent>
    </dcterms:creator>
    <dcterms:available
rdf:datatype="http://purl.org/dc/terms/W3CDTF">2009-10-19</dcterms:available>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something else</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something else again</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:rights
rdf:resource="http://opendatacommons.org/licenses/odbl/1.0/" />
  </rdf:Description>
</rdf:RDF>

Despite the added verbosity and rather heavy use of blank nodes in the latter, I think there are good reasons why moving towards this kind of DC usage pattern is a 'good thing'.  In particular, this form allows the same usage pattern to indicate a subject term by URI or literal (or both - see addendum below) meaning that software developers only need to code to a single pattern. It also allows for the use of multiple literals (e.g. in different languages) attached to a single value resource.

The trouble is, a lot of existing usage falls somewhere between the first two forms shown here.  I've seen examples of both coming up in discussions/blog posts about both open government data and open educational resources in recent days.

So here are some useful rules of thumb around DC RDF usage patterns:

  • DC properties never, ever, start with an upper-case letter (i.e. dcterms:Creator simply does not exist).
  • DC properties never, ever, contain a full-stop character (i.e. dcterms:date.available does not exist either).
  • If something can be named by its URI rather than a literal (e.g. the ODbL licence in the above examples) do so using rdf:resource.
  • Always check the range of properties before use.  If the range is anything other than a literal (as is the case with both dc:subject and dc:creator for example) and you don't know the URI of the value, use a blank or typed node to indicate the value and rdf:value to indicate the value string.
  • Do not provide lists of separate keywords as a single dc:subject value.  Repeat the property multiple times, as necessary.
  • Syntax encoding schemes, W3CDTF in this case, are indicated using rdf:datatype.

See the Expressing Dublin Core metadata using the Resource Description Framework (RDF) DCMI Recommendation for more examples and guidance.

[*] The beginning of Dublin Core metadata obviously! :-)

Addendum

As Bruce notes in the comments below, the dcterms:subject pattern that I describe above applies in those situations where you do not know the URI of the subject term. In cases where you do know the URI (as is the case with LCSH for example), the pattern becomes:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dcterms:subject>
      <rdf:Description rdf:about="http://id.loc.gov/authorities/sh85101653#concept">
        <rdf:value>Physics</rdf:value>
      </rdf:Description>
    </dcterms:subject>
  </rdf:Description>
</rdf:RDF>

October 14, 2009

Open, social and linked - what do current Web trends tell us about the future of digital libraries?

About a month ago I travelled to Trento in Italy to speak at a Workshop on Advanced Technologies for Digital Libraries organised by the EU-funded CACOA project.

My talk was entitled "Open, social and linked - what do current Web trends tell us about the future of digital libraries?" and I've been holding off blogging about it or sharing my slides because I was hoping to create a slidecast of them. Well... I finally got round to it and here is the result:

Like any 'live' talk, there are bits where I don't get my point across quite as I would have liked but I've left things exactly as they came out when I recorded it. I particularly like my use of "these are all very bog standard... err... standards"! :-)

Towards the end, I refer to David White's 'visitors vs. residents' stuff, about which I note he has just published a video. Nice one.

Anyway... the talk captures a number of threads that I've been thinking and speaking about for the last while. I hope it is of interest.

September 22, 2009

VoCamp Bristol

At the end of the week before last, I spent a couple of days (well, a day and a half as I left early on Friday) at the VoCamp Bristol meeting, at ILRT, at the University of Bristol.

To quote the VoCamp wiki:

VoCamp is a series of informal events where people can spend some dedicated time creating lightweight vocabularies/ontologies for the Semantic Web/Web of Data. The emphasis of the events is not on creating the perfect ontology in a particular domain, but on creating vocabs that are good enough for people to start using for publishing data on the Web.

I admit that I went into the event slightly unprepared, as I didn't have any firm ideas about any specific vocabulary I wanted to work on, but happy to join in with anyone who was working on anything of interest. Some of the outputs of the various groups are listed on the wiki page.

As well as work on specific vocabularies, the opening discussions highlighted an interest in a small set of more general issues, which included the expression of "structural constraints" and "validation"; broader questions of collecting and interpreting vocabulary usage; representing RDF data using JSON; and the features available in OWL 2. Friday morning was set aside for those topics, which meant I had an opportunity to talk a little bit about the work being done within the Dublin Core Metadata Initiative on "Description Set Profiles", which I've mentioned in some recent posts here. I did hastily knock up a few slides, mainly as an aide memoire to make sure I mentioned various bits and pieces:

There was a useful discussion around various different approaches for representing such patterns of constraints at the level of the RDF graph, either based on query patterns, or on the use of OWL (with a "closed-world" assumption that the "world" in question is the graph at hand). Some of the new features in OWL 2, such as capabilities for expressing restrictions on datatypes seem to make it quite an attractive candidate for this sort of task.

I was asked about whether we had considered the use of OWL in the DCMI context. IIRC, we decided against it mostly because we wanted an approach that built explicitly on the description model of the DCMI Abstract Model (i.e. talked in terms of "descriptions" and "statements" and patterns of use of those particular constructs), though I think the "open-world" considerations were also an issue (See this piece for a discussion of some of the "gotchas" that can arise).

Having said that, it would seem a good idea to explore to what extent the constraint types permitted by the DSP model might be mapped into other form(s) of expressing constraints which might be adopted.

All in all, it was a very enjoyable couple of days: a fairly low-key, thoughtful, gentle sort of gathering - no "pitches", no prizes, no "dragons" in their "dens", or other cod-"bizniz" memes :-) - just an opportunity for people to chat and work together on topics that interested them. Thank you to Tom & Damian & Libby for doing the organisation (and introducing me to a very nice Chinese restaurant in Bristol on the Thursday night!)

September 16, 2009

Edinburgh publish guidance on research data management

The University of Edinburgh has published some local guidance about the way that research data should be managed, Research data management guidance, covering How to manage research data and Data sharing and preservation, as well as detailing local training, support and advice options.

One assumes that this kind of thing will become much more common at universities over the next few years.

Having had a very quick look, it feels like the material is more descriptive than prescriptive - which isn't meant as a negative comment, it just reflects the current state of play. The section on Data documentation & metadata for example, gives advice as simple as:

Have you created a "readme.txt" file to describe the contents of files in a folder? Such a simple act can be invaluable at a later date.

but also provides a link to the UK Data Archive's guidance on Data Documentation and Metadata, which at first sight appears hugely complex. I'm not sure what your average research will make of it?

(In passing, I note that the UKDA seem to be promoting the use of the Data Documentation Initiative standard at what they call the 'catalogue' level, a standard that I've not come across before but one that appears to be rooted firmly outside the world of linked data, which is a shame.)

Similarly, the section on Methods for data sharing lists a wide range of possible options (from "posting on a University website" thru to "depositing in a data repository") without being particularly prescriptive about which is better and why.

(As a second aside, I am continually amazed by this firm distinction in the repository world between 'posting on the website' and 'depositing in a repository' - from the perspective of the researcher, both can, and should, achieve the same aims, i.e. improved management, more chance of persistence and better exposure.)

As we have found with repositories of research publications, it seems to me that research data repositories (the Edinburgh DataShare in this case) need to hide much of this kind of complexity, and do most of the necessary legwork, in order to turn what appears to be a simple and obvious 'content management' workflow (from the point of view of the individual researcher) into a well managed, openly shared, long term resource for the community.

July 29, 2009

Enhanced PURL server available

A while ago, I posted about plans to revamp the PURL server software to (amongst other things) introduce support for a range of HTTP response codes. This enables the use of PURLs for the identification of a wider range of resources than "Web documents" using the interaction patterns specified by current guidelines provided by the W3C in the TAG's httpRange-14 resolution and the Cool URIs for the Semantic Web document.

Lorcan Dempsey posted on Twitter yesterday to indicate that OCLC have deployed the new software, developed by Zepheira, on the OCLC purl.org service. Although I've just had time for a quick poke around, and I need to read the documentation more carefully to understand all the new features, it looks like it does the job quite nicely.

This should mean that existing PURL owners who have used PURLs to identify things other than "Web documents" - like DCMI, who use PURLs like http://purl.org/dc/terms/title to identify their "metadata terms", "conceptual" resources - should be able to adjust the appropriate entries on the purl.org server so that interactions follow those guidelines. It also offers a new option for those who wish to set up such redirects but perhaps don't have suitable levels of access to configure their own HTTP server to perform those redirects.

July 21, 2009

Linked data vs. Web of data vs. ...

On Friday I asked what I thought would be a pretty straight-forward question on Twitter:

is there an agreed name for an approach that adopts the 4 principles of #linkeddata minus the phrase, "using the standards (RDF, SPARQL)" ??

Turns out not to be so straight-forward, at least in the eyes of some of my Twitter followers. For example, Paul Miller responded with:

@andypowe11 well, personally, I'd argue that Linked Data does NOT require that phrase. But I know others disagree... ;-)

and

@andypowe11 I'd argue that the important bit is "provide useful information." ;-)

Paul has since written up his views more thoughtfully in his blog, Does Linked Data need RDF?, a post that has generated some interesting responses.

I have to say I disagree with Paul on this, not in the sense that I disagree with his focus on "provide useful information", but in the sense that I think it's too late to re-appropriate the "Linked Data" label to mean anything other than "use http URIs and the RDF model".

To back this up I'd go straight to the horses mouth, Tim Berners-Lee, who gave us his personal view way back in 2006 with his 'design issues' document on Linked Data. This gave us the 4 key principles of Linked Data that are still widely quoted today:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
  4. Include links to other URIs. so that they can discover more things.

Whilst I admit that there is some wriggle room in the interpretation of the 3rd point - does his use of "RDF, SPARQL" suggested these as possible standards or is the implication intended to be much stronger? - more recent documents indicate that the RDF model is mandated. For example, in Putting Government Data online Tim Berners-Lee says (refering to Linked Data):

The essential message is that whatever data format people want the data in, and whatever format they give it to you in, you use the RDF model as the interconnection bus. That's because RDF connects better than any other model.

So, for me, Linked Data implies use of the RDF model - full stop. If you put data on the Web in other forms, using RSS 2.0 for example, then you are not doing Linked Data, you're doing something else! (Addendum: I note that Ian Davis makes this point rather better in The Linked Data Brand).

Which brings me back to my original question - "what do you call a Linked Data-like approach that doesn't use RDF?" - because, in some circumstances, adhering to a slightly modified form of the 4 principles, namely:

  1. use URIs as names for things
  2. use HTTP URIs so that people can look up those names
  3. when someone looks up a URI, provide useful information
  4. include links to other URIs. so that they can discover more things

might well be a perfectly reasonable and useful thing to do. As purists, we can argue about whether it is as good as 'real' Linked Data but sometimes you've just got to get on and do whatever you can.

A couple of people suggested that the phrase 'Web of data' might capture what I want. Possibly... though looking at Tom Coates' Native to a Web of Data presentation it's clear that his 10 principles go further than the 4 above.  Maybe that doesn't matter? Others suggested "hypermedia" or "RESTful information systems" or "RESTful HTTP" none of which strikes me as quite right.

I therefore remain somewhat confused. I quite like Bill de hÓra's post on "links in content", Snowflake APIs, but, again, I'm not sure it gets us closer to an agreed label?

In a comment on a post by Michael Hausenblas, What else?, Dan Brickley says:

I have no problem whatsoever with non-RDF forms of data in “the data Web”. This is natural, normal and healthy. Stastical information, geographic information, data-annotated SVG images, audio samples, JSON feeds, Atom, whatever.

We don’t need all this to be in RDF. Often it’ll be nice to have extracts and summaries in RDF, and we can get that via GRDDL or other methods. And we’ll also have metadata about that data, again in RDF; using SKOS for indicating subject areas, FOAF++ for provenance, etc.

The non-RDF bits of the data Web are – roughly – going to be the leaves on the tree. The bit that links it all together will be, as you say, the typed links, loose structuring and so on that come with RDF. This is also roughly analagous to the HTML Web: you find JPEGs, WAVs, flash files and so on linked in from the HTML Web, but the thing that hangs it all together isn’t flash or audio files, it’s the linky extensible format: HTML. For data, we’ll see more RDF than HTML (or RDFa bridging the two). But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much! :)

Count me in as a worrier then!

I ask because, as a not-for-profit provider of hosting and Web development solutions to the UK public sector, Eduserv needs to start thinking about the implications of Tim Berners-Lee's appointment as an advisor to the UK government on 'open data' issues on the kinds of solutions we provide.  Clearly, Linked Data is going to feature heavily in this space but I fully expect that lots of stuff will also happen outside the RDF fold.  It's important for us to understand this landscape and the impact it might have on future services.

About

Recent Mentions on Twitter

eFoundations is powered by TypePad
Add to Technorati Favorites