December 21, 2009

Scanning horizons for the Semantic Web in higher education

The week before last I attended a couple of meetings looking at different aspects of the use of Semantic Web technologies in the education sector.

On the Wednesday, I was invited to a workshop of the JISC-funded ResearchRevealed project at ILRT in Bristol. From the project weblog:

ResearchRevealed [...] has the core aim of demonstrating a fine-grained, access controlled, view layer application for research, built over a content integration repository layer. This will be tested at the University of Bristol and we aim to disseminate open source software and findings of generic applicability to other institutions.

ResearchRevealed will enhance ways in which a range of user stakeholder groups can gain up-to-date, accurate integrated views of research information and thus use existing institutional, UK and potentially global research information to better effect.

I'm not formally part of the project, but Nikki Rogers of ILRT mentioned it to me at the recent VoCamp Bristol meeting, and I expressed a general interest in what they were doing; they were also looking for some concrete input on the use of Dublin Core vocabularies in some of their candidate approaches.

This was the third in a series of small workshops, attended by representatives of the project from Bristol, Oxford and Southampton, and the aim was to make progress on defining a "core Research ontology". The morning session circled mainly around usage scenarios (support for the REF (and other "impact" assessment exercises), building and sustaining cross-institutional collaboration etc), and the (somewhat blurred) boundaries between cross-institutional requirements and institution-specific ones; what data might be aggregated, what might be best "linked to"; and the costs/benefits of rich query interfaces (e.g. SPARQL endpoints) v simpler literal- or URI-based lookups. In the afternoon, Nick Gibbins from the University of Southampton walked through a draft mapping of the CERIF standard to RDF developed by the dotAC project. This focused attention somewhat and led to some - to me - interesting technical discussions about variant ways of expressing information with differing degrees of precision/flexibility. I had to leave before the end of the meeting, but I hope to be able to continue to follow the project's progress, and contribute where I can.

A long train journey later, the following day I was at a meeting in Glasgow organised by the CETIS Semantic Technologies Working Group to discuss the report produced by the recent JISC-funded Semtech project, and to try to identify potential areas for further work in that area by CETIS and/or JISC. Sheila MacNeill from CETIS liveblogged proceedings here. Thanassis Tiropanis from the University of Southampton presented the project report, with a focus on its "roadmap for semantic technology adoption". The report argues that, in the past, the adoption of semantic technologies may have been hindered by a tendency towards a "top-down" approach requiring the widespread agreement on ontologies; in contrast the "linked data" approach encourages more of a "bottom-up" style in which data is first made available as RDF, and then later application-specific or community-wide ontologies are developed to enable more complex reasoning across the base data (which may involve mapping that initial data to those ontologies as they emerge). While I think there's a slight risk of overstating the distinction - in my experience many "linked data" initiatives do seem to demonstrate a good deal of thinking about the choice of RDF vocabularies and compatibility with other datasets - and I guess I see rather more of a continuum, it's probably a useful basis for planning. The report recommends a graduated approach which focusses initially on the development of this "linked data field" - in particular where there are some "low-hanging fruit" cases of data already made available in human-readable form which could relatively easily be made available in RDF, especially using RDFa.

One of the issues I was slightly uneasy with in the Glasgow meeting was that occasionally there were mentions of delivering "interoperability" (or "data interoperability") without really saying what was meant by that - and I say this as someone who used to have the I-word in my job title ;-) I feel we probably need to be clearer, and more precise, about what different "semantic technologies" (for want of a better expression) enable. What does the use of RDF provide that, say, XML typically doesn't? What does, e.g., RDF Schema add to that picture? What about convergence on shared vocabularies? And so on. Of course, the learners, teachers, researchers and administrators using the systems don't need to grapple with this, but it seems to me such aspects do need to be conveyed to the designers and developers, and perhaps more importantly - as Andy highlighted in his report of related discussions at the CETIS conference - to those who plan and prioritise and fund such development activity. (As an aside, I this is also something of an omission in the current version of the DCMI document on "Interoperability Levels": it tells me what characterises each level, and how I can test for whether an application meets the requirements of the level, but it doesn't really tell me what functionality each level provides/enables, or why I should consider level n+1 rather than level n.)

Rather by chance, I came across a recent presentation by Richard Cyganiak to the Vienna Linked Data Camp, which I think addresses some similar questions, albeit from a slightly different starting point: Richard asks the questions, "So, if we have linked data sources, what's stopping the development of great apps? What else do we need?", and highlights various dimensions of "heterogeneity" which may exist across linked data sources (use of identifiers, differences in modelling, differences in RDF vocabularies used, differences in data quality, differences in licensing, and so on).

Finally, I noticed that last Friday, Paul Miller (who was also at the CETIS meeting) announced the availability of a draft of a "Horizon Scan" report on "Linked Data" which he has been working on for JISC, as part of the background for a JISC call for projects in this area some time early in 2010. It's a relatively short document (hurrah for short reports!) but I've only had time for a quick skim through. It aims for some practical recommendations, ranging from general guidance on URI creation and the use of RDFa to more specific actions on particular resources/datasets. And here I must reiterate what Paul says in his post - it's a draft on which he is seeking comments, not the final report, and none of those recommendations have yet been endorsed by JISC. (If you have comments on the document, I suggest that you submit them to Paul (contact details here or comment on his post) rather than commenting on this post.)

In short, it's encouraging to see the active interest in this area growing within the HE sector. On reading Paul's draft document, I was struck by the difference between the atmosphere now (both at the Semtech meeting, and more widely) and what Paul describes as the "muted" conclusions of Brian Matthews' 2005 survey report on Semantic Web Technologies for JISC Techwatch. Of course, many of the challenges that Andy mentioned in his report of the CETIS conference session remain to be addressed, but I do sense that there is a momentum here - an excitement, even - which I'm not sure existed even eighteen months ago. It remains to be seen whether and how that enthusiasm translates into applications of benefit to the educational community, but I look forward to seeing how the upcoming JISC call, and the projects it funds, contribute to these developments.

December 03, 2009

On being niche

I spoke briefly yesterday at a pre-IDCC workshop organised by REPRISE.  I'd been asked to talk about Open, social and linked information environments, which resulted in a re-hash of the talk I gave in Trento a while back.

My talk didn't go too well to be honest, partly because I was on last and we were over-running so I felt a little rushed but more because I'd cut the previous set of slides down from 119 to 6 (4 really!) - don't bother looking at the slides, they are just images - which meant that I struggled to deliver a very coherent message.  I looked at the most significant environmental changes that have occurred since we first started thinking about the JISC IE almost 10 years ago.  The resulting points were largely the same as those I have made previously (listen to the Trento presentation) but with a slightly preservation-related angle:

  • the rise of social networks and the read/write Web, and a growth in resident-like behaviour, means that 'digital identity' and the identification of people have become more obviously important and will remain an important component of provenance information for preservation purposes into the future;
  • Linked Data (and the URI-based resource-oriented approach that goes with it) is conspicuous by its absence in much of our current digital library thinking;
  • scholarly communication is increasingly diffusing across formal and informal services both inside and outside our institutional boundaries (think blogging, Twitter or Google Wave for example) and this has significant implications for preservation strategies.

That's what I thought I was arguing anyway!

I also touched on issues around the growth of the 'open access' agenda, though looking at it now I'm not sure why because that feels like a somewhat orthogonal issue.

Anyway... the middle bullet has to do with being mainstream vs. being niche.  (The previous speaker, who gave an interesting talk about MyExperiment and its use of Linked Data, made a similar point).  I'm not sure one can really describe Linked Data as being mainstream yet, but one of the things I like about the Web Architecture and REST in particular is that they describe architectural approaches that haven proven to be hugely successful, i.e. they describe the Web.  Linked data, it seems to me, builds on these in very helpful ways.  I said that digital library developments often prove to be too niche - that they don't have mainstream impact.  Another way of putting that is that digital library activities don't spend enough time looking at what is going on in the wider environment.  In other contexts, I've argued that "the only good long-term identifier, is a good short-term identifier" and I wonder if that principle can and should be applied more widely.  If you are doing things on a Web-scale, then the whole Web has an interest in solving any problems - be that around preservation or anything else.  If you invent a technical solution that only touches on scholarly communication (for example) who is going to care about it in 50 or 100 years - answer, not all that many people.

It worries me, for example, when I see an architectural diagram (as was shown yesterday) which has channels labelled 'OAI-PMH', XML' and 'the Web'!

After my talk, Chris Rusbridge asked me if we should just get rid of the JISC IE architecture diagram.  I responded that I am happy to do so (though I quipped that I'd like there to be an archival copy somewhere).  But on the train home I couldn't help but wonder if that misses the point.  The diagram is neither here nor there, it's the "service-oriented, we can build it all", mentality that it encapsulates that is the real problem.

Let's throw that out along with the diagram.

December 01, 2009

On "Creating Linked Data"

In the age of Twitter, short, "hey, this is cool" blog posts providing quick pointers have rather fallen out of fashion, but I thought this material was worth drawing attention to here. Jeni Tennison, who is contributing to the current work with Linked Data in UK government, has embarked on a short series of tutorial-style posts called "Creating Linked Data", in which she explains the steps typically involved in reformulating existing data as linked data, and discusses some of the issues arising.

Her "use case" is the scenario in which some data is currently available in CSV format, but I think much of the discussion could equally be applied to the case where the provider is making data available for the first time. The opening post on the sequence ("Analysing and Modelling") provides a nice example of working through the sort of "things or strings?" questions which we've tried to highlight in the context of designing DC Application Profiles. And as Jeni emphasises, this always involves design choices:

It’s worth noting that this is a design process rather than a discovery process. There is no inherent model in any set of data; I can guarantee you that someone else will break down a given set of data in a different way from you. That means you have to make decisions along the way.

And further on in the piece, she rationalises her choices for this example in terms of what those choices enable (e.g. "whenever there’s a set of enumerated values it’s a good idea to consider turning them into things, because to do so enables you to associate extra information about them").

The post on URI design offers some tips, not only on designing new URIs but also on using existing URIs where appropriate: I admit I tend to forget about useful resources like placetime.com "a URI space containing URIs that represent places and times" (and provides redirects to descriptions in various formats).

On a related note, the post on choosing/coining properties, classes and datatypes includes a pointer to the OWL Time ontology. This is something I was aware of, but only looked at in any detail relatively recently. At first glance it can seem rather complex; Ian Davis has a summary graphic which I found helpful in trying to get my head round the core concepts of the ontology.

It seems to me these sort of very common areas like time data are those around which some shared practice will emerge, and articles like these, by "hands-on" practitioners, are important contributions to that process.

November 20, 2009

COI guidance on use of RDFa

Via a post from Mark Birbeck, I notice that the UK Central Office for Information has published some guidelines called Structuring information on the Web for re-usability which include some guidance on the use of RDFa to provide specific types of information, about government consultations and about job vacancies.

This is exciting news as, as far as I know, this is the first formal document from UK central government to provide this sort of quite detailed, resource-type-specific guidance with recommendations on the use of particular RDF vocabularies - guidance of the sort I think will be an essential component in the effective deployment of RDFa and the Linked Data approach. It's also the sort of thing that is of considerable interest to Eduserv, as a developer of Web sites for several government agencies. The document builds directly on the work Mark has been doing in this area, which I mentioned a while ago.

As Mark notes in his post, the document is unequivocal in its expression of the government's commitment to the Linked Data approach:

Government is committed to making its public information and data as widely available as possible. The best way to make structured information available online is to publish it as Linked Data. Linked Data makes the information easier to cut and combine in ways that are relevant to citizens.

Before the announcement of these guidelines, I recently had a look at the "argot" for consultations - "argot" is Mark's term for a specification of how a set of terms from multiple RDF vocabularies is used to meet some application requirement; as I noted in that earlier post, I think it is similar to what DCMI calls an "application profile" - , and I had intended to submit some comments. I fear it is now somewhat late in the day for me to be doing this, but the release of this document has prompted me to write them up here. My comments are concerned primarily with the section titled "Putting consultations into Linked Data"

The guidelines (correctly, I think) establish a clear distinction between the consultation on the one hand and the Web page describing the consultation on the other by (in paragraphs 30/31) introducing a fragment identifier for the URI of the consultation (via the about="#this" attribute). The consultation itself is also modelled as a document, an instance of the class foaf:Document, which in turn "has as parts" the actual document(s) on which comment is being sought, and for which a reply can be sent to some agent.

I confess that my initial "instinctive" reaction to this was that this seemed a slightly odd choice, as a "consultation" seemed to me to be more akin to an event or a process, taking place during an interval of time, which had a as "inputs" to that process a set of documents on which comments were sought, and (typically at least) resulted in the generation of some other document as a "response". And indeed the page describing the Consultation argot introduces the concept as follows (emphasis added):

A consultation is a process whereby Government departments request comments from interested parties, so as to help the department make better decisions. A consultation will usually be focused on a particular topic, and have an accompanying publication that sets the context, and outlines particular questions on which feedback is requested. Other information will include a definite start and end date during which feedback can be submitted and contact details for the person to submit feedback to.

I admit I find it difficult to square this with the notion of a "document". And I think a "consultation-as-event" (described by a Web page) could probably be modelled quite neatly using the Event Ontology or the similar LODE ontology (with some specialisation of classes and properties if required).

Anyway, I appreciate that aspect may be something of a "design choice". So for the remainder of the comments here, I'll stick to the actual approach described by the guidelines (consultation as document).

The RDF properties recommended for the description of the consultation are drawn mainly from Dublin Core vocabularies, and more specifically from the "DC Terms" vocabulary.

The first point to note is that, as Andy noted recently, DCMI recently made some fairly substantive changes to the DC Terms vocabulary, as a result of which the majority of the properties are now the subject of rdfs:range assertions, which indicate whether the value of the property is a literal or a non-literal resource. The guidelines recommend the use of the publisher (paragraphs 32-37), language(paragraphs 38-39), and audience (paragraph 46) properties, all with literal values, e.g.

<span property="dc:publisher" content="Ministry of Justice"></span>

But according to the term descriptions provided by DCMI, the ranges of these properties are the classes dcterms:Agent, dcterms:LinguisticSystem and dcterms:AgentClass respectively. So I think that would require the use of an XHTML-RDFa construct something like the following, introducing a blank node (or a URI for the resource if one is available):

<div rel="dc:publisher"><span property="foaf:name" content="Ministry of Justice"></span></div>

Second, I wasn't sure about the recommendation for the use of the dcterms:source property (paragraph 37). This is used to "indicate the source of the consultation". For the case where this is a distinct resource (i.e. distinct from the consultation and this Web page describing it), this seems OK, but the guidelines also offer the option of referring to the current document (i.e. the Web page) as the source of the consultation:

<span rel="dc:source" resource=""></span>

DCMI's definition of the property is "A related resource from which the described resource is derived", but it seems to me the Web page is acting as a description of the consultation-as-document, rather than a source of it.

Third, the guidelines recommend the use of some of the DCMI date properties (paragraph 42):

  • dcterms:issued for the publication date of the consultation
  • dcterms:available for the start date for receiving comments
  • dcterms:valid for the closing date ("a date through which the consultation is 'valid'")

I think the use of dcterms:valid here is potentially confusing. DCMI's definition is "Date (often a range) of validity of a resource", so on this basis I think the implication of the recommended usage is that the consultation is "valid" only on that date, which is not what is intended. The recommendations for dcterms:issued and dcterms:available are probably OK - though I do think the event-based approach might have helped make the distinction between dates related to documents and dates related to consultation-as-process rather clearer!

Oh dear, this must read like an awful lot of pedantic nitpicking on my part, but my intent is to try to ensure that widely used vocabularies like those provided by DCMI are used as consistently as possible. As I said at the start I'm very pleased to see this sort of very practical guidance appearing (and I apologise to Mark for not submitting my comments earlier!)

November 16, 2009

The future has arrived

Cetis09

About 99% of the way thru Bill Thompson's closing keynote to the CETIS 2009 Conference last week I tweeted:

great technology panorama painted by @billt in closing talk at #cetis09

And it was a great panorama - broad, interesting and entertainingly delivered. It was a good performance and I am hugely in awe of people who can give this kind of presentation. However, what the talk didn't do was move from the "this is where technology has come from, this is where it is now and this is where it is going" kind of stuff to the "and this is what it means for education in the future". Which was a shame because in questioning after his talk Thompson did make some suggestions about the future of print news media (not surprising for someone now at the BBC) and I wanted to hear similar views about the future of teaching, learning and research.

As Oleg Liber pointed out in his question after the talk, universities, and the whole education system around them, are lumbering beasts that will be very slow to change in the face of anything. On that basis, whilst it is interesting to note that (for example) we can now just about store a bit on an atom (meaning that we can potentially store a digital version of all human output on something the weight of a human body), that we can pretty much wire things directly into the human retina, and that Africa will one-day overtake 'digital' Britain in the broadband stakes are interesting individual propositions in their own right, there comes a "so what?" moment where one is left wondering what it actually all means.

As an aside, and on a more personal note, I suggest that my daughter's experience of university (she started at Sheffield Hallam in September) is not actually going to be very different to my own, 30-odd years ago. Lectures don't seem to have changed much. Project work doesn't seem to have changed much. Going out drinking doesn't seem to have changed much. She did meet all her 'hall' flat-mates via Facebook before she arrived in Sheffield I suppose :-) - something I never had the opportunity to do (actually, I never even got a place in hall). There is a big difference in how it is all paid for of course but the interesting question is how different university will be for her children. If the truth is, "not much", then I'm not sure why we are all bothering.

At one point, just after the bit about storing a digital version of all human output I think, Thompson did throw out the question, "...and what does that mean for copyright law?". He didn't give us an answer. Well, I don't know either to be honest... though it doesn't change the fact that creative people need to be rewarded in some way for their endeavours I guess. But the real point here is that the panorama of technological change that Thompson painted for us, interesting as it was, begs some serious thinking about what the future holds.  Maybe Thompson was right to lay out the panorama and leave the serious thinking to us?

He was surprisingly positive about Linked Data, suggesting that the time is now right for this to have a significant impact.  I won't disagree because I've been making the same point myself in various fora, though I tend not to shout it too loudly because I know that the Semantic Web has a history of not quite making it.  Indeed, the two parallel sessions that I attended during the conference, University API and the Giant Global Graph both focused quite heavily on the kinds of resources that universities are sitting on (courses, people/expertise, research data, publications, physical facilities, events and so on) that might usefully be exposed to others in some kind of 'open' fashion.  And much of the debate, particularly in the second session (about which there are now some notes), was around whether Linked Data (i.e. RDF) is the best way to do this - a debate that we've also seen played out recently on the uk-government-data-developers Google Group.

The three primary issues seemed to be:

  • Why should we (universities) invest time and money exposing our data in the hope that people will do something useful/interesting/of value with it when we have many other competing demands on our limited resources?
  • Why should we take the trouble to expose RDF when it's arguably easier for both the owner and the consumer of the data to expose something simpler like a CSV file?
  • Why can't the same ends be achieved by offering one or more services (i.e. a set of one or more APIs) rather than the raw data itself?

In the ensuing debate about the why and the how, there was a strong undercurrent of, "two years ago SOA was all the rage, now Linked Data is all the rage... this is just a fashion thing and in two years time there'll be something else".  I'm not sure that we (or at least I) have a well honed argument against this view but, for me at least, it lies somewhere in the fit with resource-orientation, with the way the Web works, with REST, and with the Web Architecture.

On the issue of the length of time it is taking for the Semantic Web to have any kind of mainstream impact, Ian Davis has an interesting post, Make or Break for the Semantic Web?, arguing that this is not unusual for standards track work:

Technology, especially standards track work, takes years to cross the chasm from early adopters (the technology enthusiasts and visionaries) to the early majority (the pragmatists). And when I say years, I mean years. Take CSS for example. I’d characterise CSS as having crossed the chasm and it’s being used by the early majority and making inroads into the late majority. I don’t think anyone would seriously argue that CSS is not here to stay.

According to this semi-official history of CSS the first proposal was in 1994, about 13 years ago. The first version that was recognisably the CSS we use today was CSS1, issued by the W3C in December 1996. This was followed by CSS2 in 1998, the year that also saw the founding of the Web Standards Project. CSS 2.1 is still under development, along with portions of CSS3.

Paul Walk has also written an interesting post, Linked, Open, Semantic?, in which he argues that our discussions around the Semantic Web and Linked Data tend to mix up three memes (open data, linked data and semantics) in rather unhelpful ways. I tend to agree, though I worry that Paul's proposed distinction between Linked Data and the Semantic Web is actually rather fuzzier than we may like.

On balance, I feel a little uncomfortable that I am not able to offer a better argument against the kinds of anti-Linked Data views expressed above. I think I understand the issues (or at least some of them) pretty well but I don't have them to hand in a kind of this is why Linked Data is the right way forward 'elevator pitch'.

Something to work on I guess!

[Image: a slide from Bill Thompson's closing keynote to the CETIS 2009 Conference]

October 19, 2009

Helpful Dublin Core RDF usage patterns

In the beginning [*] there was the HTML meta element and we used to write things like:

<meta name="DC.Creator"content="Andy Powell">
<meta name="DC.Subject" content="something, something else, something else again">
<meta name="DC.Date.Available" scheme="W3CDTF" content="2009-10-19">
<meta name="DC.Rights" content="Open Database License (ODbL) v1.0">

Then came RDF and a variety of 'syntax' guidance from DCMI and we started writing:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dc:creator>Andy Powell</dc:creator>
    <dcterms:available>2009-10-19</dcterms:available>
    <dc:subject>something</dc:subject>
    <dc:subject>something else</dc:subject>
    <dc:subject>something else again</dc:subject>
    <dc:rights>Open Database License (ODbL) v1.0</dc:rights>
  </rdf:Description>
</rdf:RDF>

Then came the decision to add 15 new properties to the DC terms namespace which reflected the original 15 DC elements but which added a liberal smattering of domains and ranges.  So, now we write:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dcterms:creator>
      <dcterms:Agent>
        <rdf:value>Andy Powell</rdf:value>
        <foaf:name>Andy Powell</foaf:name>
      </dcterms:Agent>
    </dcterms:creator>
    <dcterms:available
rdf:datatype="http://purl.org/dc/terms/W3CDTF">2009-10-19</dcterms:available>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something else</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:subject>
      <rdf:Description>
        <rdf:value>something else again</rdf:value>
      </rdf:Description>
    </dcterms:subject>
    <dcterms:rights
rdf:resource="http://opendatacommons.org/licenses/odbl/1.0/" />
  </rdf:Description>
</rdf:RDF>

Despite the added verbosity and rather heavy use of blank nodes in the latter, I think there are good reasons why moving towards this kind of DC usage pattern is a 'good thing'.  In particular, this form allows the same usage pattern to indicate a subject term by URI or literal (or both - see addendum below) meaning that software developers only need to code to a single pattern. It also allows for the use of multiple literals (e.g. in different languages) attached to a single value resource.

The trouble is, a lot of existing usage falls somewhere between the first two forms shown here.  I've seen examples of both coming up in discussions/blog posts about both open government data and open educational resources in recent days.

So here are some useful rules of thumb around DC RDF usage patterns:

  • DC properties never, ever, start with an upper-case letter (i.e. dcterms:Creator simply does not exist).
  • DC properties never, ever, contain a full-stop character (i.e. dcterms:date.available does not exist either).
  • If something can be named by its URI rather than a literal (e.g. the ODbL licence in the above examples) do so using rdf:resource.
  • Always check the range of properties before use.  If the range is anything other than a literal (as is the case with both dc:subject and dc:creator for example) and you don't know the URI of the value, use a blank or typed node to indicate the value and rdf:value to indicate the value string.
  • Do not provide lists of separate keywords as a single dc:subject value.  Repeat the property multiple times, as necessary.
  • Syntax encoding schemes, W3CDTF in this case, are indicated using rdf:datatype.

See the Expressing Dublin Core metadata using the Resource Description Framework (RDF) DCMI Recommendation for more examples and guidance.

[*] The beginning of Dublin Core metadata obviously! :-)

Addendum

As Bruce notes in the comments below, the dcterms:subject pattern that I describe above applies in those situations where you do not know the URI of the subject term. In cases where you do know the URI (as is the case with LCSH for example), the pattern becomes:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.net/something">
    <dcterms:subject>
      <rdf:Description rdf:about="http://id.loc.gov/authorities/sh85101653#concept">
        <rdf:value>Physics</rdf:value>
      </rdf:Description>
    </dcterms:subject>
  </rdf:Description>
</rdf:RDF>

October 14, 2009

Open, social and linked - what do current Web trends tell us about the future of digital libraries?

About a month ago I travelled to Trento in Italy to speak at a Workshop on Advanced Technologies for Digital Libraries organised by the EU-funded CACOA project.

My talk was entitled "Open, social and linked - what do current Web trends tell us about the future of digital libraries?" and I've been holding off blogging about it or sharing my slides because I was hoping to create a slidecast of them. Well... I finally got round to it and here is the result:

Like any 'live' talk, there are bits where I don't get my point across quite as I would have liked but I've left things exactly as they came out when I recorded it. I particularly like my use of "these are all very bog standard... err... standards"! :-)

Towards the end, I refer to David White's 'visitors vs. residents' stuff, about which I note he has just published a video. Nice one.

Anyway... the talk captures a number of threads that I've been thinking and speaking about for the last while. I hope it is of interest.

October 12, 2009

Description Set Profiles for DC-2009

For the first time in a few years, I'm not attending the Dublin Core conference, which is taking place in Seoul, Korea this week. I've been actively trying to cut back on the long-haul travel, mainly to shrink my personal carbon footprint, which for several years was probably the size of a medium-sized Hebridean island. But also I have to admit that increasingly I've found it pretty exhausting to try to engage with a long meeting or deliver a presentation after a long flight, and on occasions I've found myself questioning whether it's the "best" use of my own energy reserves!

More broadly, I suppose I think we - the research community generally, not just the DC community - really have to think carefully about the sustainability of the "traditional" model of large international conferences with hundreds of people, some of them travelling thousands of miles to participate.

But that's a topic for another time, and of course, this week I will miss catching up with friends, practising the fine art of waving my hands around animatedly while also trying to maintain control of a full lunch plate, and hatching completely unrealistic plans over late-night beers. Good luck and safe travels to everyone who is attending the conference in Seoul.

Anyway, Liddy Nevile, who is chairing the Workshop Committee (and has over recent years made efforts to enable and encourage remote participation in the conference), invited Karen Coyle and me to contribute a pre-recorded presentation on Description Set Profiles. We only have a ten minute slot so it's necessarily a rather hasty sprint through the topic, but the results are below.

I think this is the first time I've recorded my own voice since I was in a language lab in an A-Level French class in 1990! To date, I've done my best to avoid anything resembling podcasting, or retrospectively adding audio to any of my presentation slide decks on Slideshare, mostly because I hate listening to the sound of my own voice, perhaps all the more so because I know I tend to "umm" and "err" quite a lot. And even if I write myself a "script" (which in most cases I don't) I seem to find it hard to resist the temptation to change bits and insert additional comments on the fly, and then I realise I'm altering the structure of a sentence and repeating things, and I "umm" and "err" even more... Argh. I'm sure I recorded every sentence of this in Audacity at least three times over! :-)

October 05, 2009

QNames, URIs and datatypes

I posted a version of this on the dc-date Jiscmail list recently, but I thought it was worth a quick posting here too, as I think it's a slightly confusing area and the differences in naming systems is something which I think has tripped up some DC implementers in the past.

The Library of Congress has developed a (very useful looking) XML schema datatype for dates, called the Extended Date Time Format (EDTF). It appears to address several of the issues which were discussed some time ago by the DCMI Date working group. It looks like the primary area of interest for the LoC is the use of the data type in XML and XML Schema, and the documentation for the new datatype illustrates its usage in this context.

However, as far as I can see, it doesn't provide a URI for the datatype, which is a prerequisite for referencing it in RDF. I think sometimes we assume that providing what the XML Namespaces spec calls an expanded name is sufficient, and/or that doing that automatically also provides a URI, but I'm not sure that is the case.

In XML Schema, a datatype is typically referred to using an XML Qualified Name (QName), which is essentially a "shorthand" for an "expanded name". An expanded name has two parts: a (possibly null-valued) "namespace name" and a "local name".

Take the case of the built-in datatypes defined as part of the XML Schema specification. In XML/XML Schema, these datatypes are referenced using these two part "expanded names", typically represented in XML documents as QNames e.g.

<count xsi:type="xsd:int">42</count>

where the prefix xsd is bound to the XML namespace name "http://www.w3.org/2001/XMLSchema". So the two-part expanded name of the XML Schema integer datatype is

( http://www.w3.org/2001/XMLSchema , int )

Note: no trailing "#" on that namespace name.

The key point here is that the expanded name system is a different naming system from the URI system. True, the namespace name component of the expanded name is a URI (if it isn't null), but the expanded name as a whole is not. And furthermore, there is no generalised mapping from expanded name to URI.

To refer to a datatype in RDF, I need a URI, and for the built-in datatypes provided, the XML Schema Datayping spec tells me explicitly how to construct such a URI:

Each built-in datatype in this specification (both *primitive* and *derived*) can be uniquely addressed via a URI Reference constructed as follows:

  1. the base URI is the URI of the XML Schema namespace
  2. the fragment identifier is the name of the datatype

For example, to address the int datatype, the URI is:

* http://www.w3.org/2001/XMLSchema#int

The spec tells me that for the names of this set of datatypes, to generate a URI from the expanded name, I use the local part as the fragment id - but some other mechanism might have been applied.

And this is different from, say, the way RDF/XML maps the QNames of some XML elements to URIs, where the mechanism is simple concatenation of "namespace name" and "local name"

(And, as an aside, I think this makes for a bit of "gotcha" for XML folk coming to RDF syntaxes like Turtle where datatype URIs can be represented as prefixed names, because the "namespace URI" you use in Turtle, where the simple concatenation mechanism is used, needs to include the trailing "#" (http://www.w3.org/2001/XMLSchema#), whereas in XML/XML Schema people are accustomed to using the "hash-less" XML namespace name (http://www.w3.org/2001/XMLSchema)).

But my understanding is that that the mapping defined by the XML Schema datatyping document is specific to the datatypes defined by that document ("Each built-in datatype in this specification... can be uniquely addressed..." (my emphasis)), and the spec is silent on URIs for other user-defined XML Schema datatypes.

So it seems to me that if a community defines a datatype, and they want to make it available for use in both XML/XML Schema and in RDF, they need to provide both an expanded name and a URI for the datatype.

The LoC documentation for the datatype has plenty of XML examples so I can deduce what the expanded name is:

(info:lc/xmlns/edtf-v1 , edt)

But that doesn't tell me what URI I should use to refer to the datatype.

I could make a guess that I should follow the same mapping convention as that provided for the XML Schema built-in datatypes, and decide that the URI to use is

info:lc/xmlns/edtf-v1#edt

But given that there is no global expanded name-URI mapping, and the XML Schema specs provide a mapping only for the names of the built-in types, I think I'd be on shaky ground if I did that, and really the owners of the datatype need to tell me explicitly what URI to use - as the XML Schema spec does for those built-in types.

Douglas Campbell responded to my message pointing to a W3C TAG finding which provides this advice:

If it would be valuable to directly address specific terms in a namespace, namespace owners SHOULD provide identifiers for them.

Ray Denenberg has responded with a number of suggestions: I'm less concerned about the exact form of the URI than that a URI is explicitly provided.

P.S. I didn't comment on the list on the choice of URI scheme, but of course I agree with Andy's suggestion that value would be gained from the use of the http URI scheme.... :-)

September 22, 2009

VoCamp Bristol

At the end of the week before last, I spent a couple of days (well, a day and a half as I left early on Friday) at the VoCamp Bristol meeting, at ILRT, at the University of Bristol.

To quote the VoCamp wiki:

VoCamp is a series of informal events where people can spend some dedicated time creating lightweight vocabularies/ontologies for the Semantic Web/Web of Data. The emphasis of the events is not on creating the perfect ontology in a particular domain, but on creating vocabs that are good enough for people to start using for publishing data on the Web.

I admit that I went into the event slightly unprepared, as I didn't have any firm ideas about any specific vocabulary I wanted to work on, but happy to join in with anyone who was working on anything of interest. Some of the outputs of the various groups are listed on the wiki page.

As well as work on specific vocabularies, the opening discussions highlighted an interest in a small set of more general issues, which included the expression of "structural constraints" and "validation"; broader questions of collecting and interpreting vocabulary usage; representing RDF data using JSON; and the features available in OWL 2. Friday morning was set aside for those topics, which meant I had an opportunity to talk a little bit about the work being done within the Dublin Core Metadata Initiative on "Description Set Profiles", which I've mentioned in some recent posts here. I did hastily knock up a few slides, mainly as an aide memoire to make sure I mentioned various bits and pieces:

There was a useful discussion around various different approaches for representing such patterns of constraints at the level of the RDF graph, either based on query patterns, or on the use of OWL (with a "closed-world" assumption that the "world" in question is the graph at hand). Some of the new features in OWL 2, such as capabilities for expressing restrictions on datatypes seem to make it quite an attractive candidate for this sort of task.

I was asked about whether we had considered the use of OWL in the DCMI context. IIRC, we decided against it mostly because we wanted an approach that built explicitly on the description model of the DCMI Abstract Model (i.e. talked in terms of "descriptions" and "statements" and patterns of use of those particular constructs), though I think the "open-world" considerations were also an issue (See this piece for a discussion of some of the "gotchas" that can arise).

Having said that, it would seem a good idea to explore to what extent the constraint types permitted by the DSP model might be mapped into other form(s) of expressing constraints which might be adopted.

All in all, it was a very enjoyable couple of days: a fairly low-key, thoughtful, gentle sort of gathering - no "pitches", no prizes, no "dragons" in their "dens", or other cod-"bizniz" memes :-) - just an opportunity for people to chat and work together on topics that interested them. Thank you to Tom & Damian & Libby for doing the organisation (and introducing me to a very nice Chinese restaurant in Bristol on the Thursday night!)

August 20, 2009

A socio-political analysis of the history of Dublin Core terms

No... this isn't one. Sorry about that! But reading Eric Hellman's recent blog post, Can Librarians Be Put Directly Onto the Semantic Web?, made me wonder if such a thing would be interesting (in a "how much metadata can you fit on the head of a pin" kind of way!).

Eric's post discusses the need for, or not, inverse properties in the Semantic Web and the necessary changes of mindset in moving from thinking about 'metadata' to thinking about 'ontologies':

In many respects, the most important question for the library world in examining semantic web technologies is whether librarians can successfully transform their expertise in working with metadata into expertise in working with ontologies or models of knowledge. Whereas traditional library metadata has always been focused on helping humans find and make use of information, semantic web ontologies are focused on helping machines find and make use of information. Traditional library metadata is meant to be seen and acted on by humans, and as such has always been an uncomfortable match with relational database technology. Semantic web ontologies, in contrast, are meant to make metadata meaningful and actionable for machines. An ontology is thus a sort of computer program, and the effort of making an RDF schema is the first step of telling a computer how to process a type of information.

I think there's probably some interesting theorising to be done about the history of the Dublin Core metadata properties, in particular about the way they have been named over the years and the way some have explicit inverse properties but others don't.

So, for example, dcterms:creator and dcterms:hasVersion use different naming styles ('creator' rather than 'hasCreator') and dcterms:hasVersion has an explicit inverse (dcterms:isVersionOf) whereas dcterms:creator does not (there is no dcterms:isCreatorOf).

Unfortunately, I don't recall much of the detail of why these changes in attitude to naming occured over the years. My suspicion is that it has something to do with the way our understanding of 'web' metadata has evolved over time.  Two things in particular I guess.  Firstly, the way in which there has been a gradual change from understanding properties as being 'attributes with string values' (very much the view when dc:creator was invented) to understanding properties as 'the meat between two resources in an RDF triple'.  And, secondly, a change in thinking first and foremost about 'card catalogues' and/or relational databases to thinking about triple stores (perhaps better characterised (as Eric did) as a transition between thinking about metadata as something that is viewed by humans to something that is acted upon by software).

I strongly suspect that both these changes in attitude are very much ongoing (at least in the DC community - possibly elsewhere?).

Note also the difference in naming between dcterms:valid and dcterms:dateCopyrighted (both of which are refinements of dcterms:date). The former emerged at a time when the prefered encoding syntaxes tended to prefix 'valid' with 'DC.date.' to give 'DC.date.valid' whereas the latter emerged at a time when properties where recognised as being stand-alone entities (i.e. after the emergence of Semantic Web thinking).

If nothing else, working with the Dublin Core community over the years has served as a very useful reminder about the challenges 'ordinary' (I don't mean that in any way negatively) people face in understanding what some 'geeks' might perceive to be very simple Semantic Web concepts.  I've lost track of the number of 'strings vs. things' type discussions I've been involved in!  And to an extent, part of the reason for developing the DCMI Abstract Model was to try to bridge the gap between a somewhat 'old-skool' (dare I say, 'traditional librarian'?) view of the world and the Semantic Web view of the world.  Of course, one can argue about whether we succeeded in that aim.

July 29, 2009

Enhanced PURL server available

A while ago, I posted about plans to revamp the PURL server software to (amongst other things) introduce support for a range of HTTP response codes. This enables the use of PURLs for the identification of a wider range of resources than "Web documents" using the interaction patterns specified by current guidelines provided by the W3C in the TAG's httpRange-14 resolution and the Cool URIs for the Semantic Web document.

Lorcan Dempsey posted on Twitter yesterday to indicate that OCLC have deployed the new software, developed by Zepheira, on the OCLC purl.org service. Although I've just had time for a quick poke around, and I need to read the documentation more carefully to understand all the new features, it looks like it does the job quite nicely.

This should mean that existing PURL owners who have used PURLs to identify things other than "Web documents" - like DCMI, who use PURLs like http://purl.org/dc/terms/title to identify their "metadata terms", "conceptual" resources - should be able to adjust the appropriate entries on the purl.org server so that interactions follow those guidelines. It also offers a new option for those who wish to set up such redirects but perhaps don't have suitable levels of access to configure their own HTTP server to perform those redirects.

July 21, 2009

Linked data vs. Web of data vs. ...

On Friday I asked what I thought would be a pretty straight-forward question on Twitter:

is there an agreed name for an approach that adopts the 4 principles of #linkeddata minus the phrase, "using the standards (RDF, SPARQL)" ??

Turns out not to be so straight-forward, at least in the eyes of some of my Twitter followers. For example, Paul Miller responded with:

@andypowe11 well, personally, I'd argue that Linked Data does NOT require that phrase. But I know others disagree... ;-)

and

@andypowe11 I'd argue that the important bit is "provide useful information." ;-)

Paul has since written up his views more thoughtfully in his blog, Does Linked Data need RDF?, a post that has generated some interesting responses.

I have to say I disagree with Paul on this, not in the sense that I disagree with his focus on "provide useful information", but in the sense that I think it's too late to re-appropriate the "Linked Data" label to mean anything other than "use http URIs and the RDF model".

To back this up I'd go straight to the horses mouth, Tim Berners-Lee, who gave us his personal view way back in 2006 with his 'design issues' document on Linked Data. This gave us the 4 key principles of Linked Data that are still widely quoted today:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
  4. Include links to other URIs. so that they can discover more things.

Whilst I admit that there is some wriggle room in the interpretation of the 3rd point - does his use of "RDF, SPARQL" suggested these as possible standards or is the implication intended to be much stronger? - more recent documents indicate that the RDF model is mandated. For example, in Putting Government Data online Tim Berners-Lee says (refering to Linked Data):

The essential message is that whatever data format people want the data in, and whatever format they give it to you in, you use the RDF model as the interconnection bus. That's because RDF connects better than any other model.

So, for me, Linked Data implies use of the RDF model - full stop. If you put data on the Web in other forms, using RSS 2.0 for example, then you are not doing Linked Data, you're doing something else! (Addendum: I note that Ian Davis makes this point rather better in The Linked Data Brand).

Which brings me back to my original question - "what do you call a Linked Data-like approach that doesn't use RDF?" - because, in some circumstances, adhering to a slightly modified form of the 4 principles, namely:

  1. use URIs as names for things
  2. use HTTP URIs so that people can look up those names
  3. when someone looks up a URI, provide useful information
  4. include links to other URIs. so that they can discover more things

might well be a perfectly reasonable and useful thing to do. As purists, we can argue about whether it is as good as 'real' Linked Data but sometimes you've just got to get on and do whatever you can.

A couple of people suggested that the phrase 'Web of data' might capture what I want. Possibly... though looking at Tom Coates' Native to a Web of Data presentation it's clear that his 10 principles go further than the 4 above.  Maybe that doesn't matter? Others suggested "hypermedia" or "RESTful information systems" or "RESTful HTTP" none of which strikes me as quite right.

I therefore remain somewhat confused. I quite like Bill de hÓra's post on "links in content", Snowflake APIs, but, again, I'm not sure it gets us closer to an agreed label?

In a comment on a post by Michael Hausenblas, What else?, Dan Brickley says:

I have no problem whatsoever with non-RDF forms of data in “the data Web”. This is natural, normal and healthy. Stastical information, geographic information, data-annotated SVG images, audio samples, JSON feeds, Atom, whatever.

We don’t need all this to be in RDF. Often it’ll be nice to have extracts and summaries in RDF, and we can get that via GRDDL or other methods. And we’ll also have metadata about that data, again in RDF; using SKOS for indicating subject areas, FOAF++ for provenance, etc.

The non-RDF bits of the data Web are – roughly – going to be the leaves on the tree. The bit that links it all together will be, as you say, the typed links, loose structuring and so on that come with RDF. This is also roughly analagous to the HTML Web: you find JPEGs, WAVs, flash files and so on linked in from the HTML Web, but the thing that hangs it all together isn’t flash or audio files, it’s the linky extensible format: HTML. For data, we’ll see more RDF than HTML (or RDFa bridging the two). But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much! :)

Count me in as a worrier then!

I ask because, as a not-for-profit provider of hosting and Web development solutions to the UK public sector, Eduserv needs to start thinking about the implications of Tim Berners-Lee's appointment as an advisor to the UK government on 'open data' issues on the kinds of solutions we provide.  Clearly, Linked Data is going to feature heavily in this space but I fully expect that lots of stuff will also happen outside the RDF fold.  It's important for us to understand this landscape and the impact it might have on future services.

April 24, 2009

More RDFa in UK government

It's quite exciting to see various initiatives within UK government starting to make use of Semantic Web technologies, and particularly of RDFa. At the recent OKCon conference, I heard Jeni Tennison talk about her work on using RDFa in the London Gazette. Yesterday, Mark Birbeck published a post outlining some of his work with the Central Office of Information.

The example Mark focuses on is that of a job vacancy, where RDFa is used to provide descriptions of various related resources: the vacancy, the job for which the vacancy is available, a person to contact, and so on. Mark provides an example of a little display app built on the Yahoo SearchMonkey platform which processes this data.

As a a footnote (a somewhat lengthy one, now that I've written it!), I'd just draw attention to Mark's description of developing what he calls an RDF "argot" for constructing such descriptions:

The first vocabularies -- or argots -- that I defined were for job vacancies, but in order to make the terminology usable in other situations, I broke out argots for replying to the vacancy, the specification of contact details, location information, and so on.

An argot doesn't necessarily involve the creation of new terms, and in fact most of the argots use terms from Dublin Core, FOAF and vCard. So although new terms have been created if they are needed, the main idea behind an argot is to collect together terms from various vocabularies that suit a particular purpose.

I was struck by some of the parallels between this and DCMI's descriptions of developing what it calls an "DC application profile" - with the caveat that DCMI typically talks in terms of the DCMI Abstract Model rather than directly of the RDF model. e.g. the Singapore Framework notes:

In a Dublin Core Application Profile, the terms referenced are, as one would expect, terms of the type described by the DCMI Abstract Model, i.e. a DCAP describes, for some class of metadata descriptions, which properties are referenced in statements and how the use of those properties may be constrained by, for example, specifying the use of vocabulary encoding schemes and syntax encoding schemes. The DC notion of the application profile imposes no limitations on whether those properties or encoding schemes are defined and managed by DCMI or by some other agency

And in the draft Guidelines for Dublin Core Application Profiles:

the entities in the domain model -- whether Book and Author, Manifestation and Copy, or just a generic Resource -- are types of things to be described in our metadata. The next step is to choose properties for describing these things. For example, a book has a title and author, and a person has a name; title, author, and name are properties.

The next step, then, is to scan available RDF vocabularies to see whether the properties needed already exist. DCMI Metadata Terms is a good source of properties for describing intellectual resources like documents and web pages; the "Friend of a Friend" vocabulary has useful properties for describing people. If the properties one needs are not already available, it is possible to declare one's own

And indeed the Job Vacancy argot which Mark points to would, I think, probably be fairly recognisable to those familiar with the DCAP notion: compare, for example, with the case of the Scholarly Works Application Profile. The differences are that (I think) an "argot" focuses on the description of a single resource type, and I don't think it goes as far as a formal description of structural constraints in quite the same way DCMI's Description Set Profile model does.

January 14, 2009

Resource List Management on the Semantic Web

Via a post by Ivan Herman of the W3C, I came across a W3C case study titled A Linked Open Data Resource List Management Tool for Undergraduate Students, based on work done between Talis and the University of Plymouth.

Andy and I visited Talis, well, I was going to say a few months ago, but it was probably the middle of last year, and Rob Styles, Chris Clarke & other Talisians talked to us a little bit then about this work, but at that point I don't think they had a live system to show.

This looks pretty neat stuff. It's an RDF application, based on the Talis Platform. They make use of a number of existing ontologies (SIOC, BIBO) and have designed a simple ontology for the Reading Lists themselves and also one for the organisational structure of an academic institution, the AIISO ontology - which I imagine may be of interest to other projects working in this area.

Intelligent "bookmarking" tools for adding items to lists use a variety of techniques to extract metadata from Web pages (in a similar way to the Zotero citation manager tool); the metadata is exposed as RDFa in XHTML representations of the lists, which makes it available to systems like Yahoo's SearchMonkey; other RDF formats are available via content negotiation (following the Linked Data/Cool URIs for the Semantic Web principles); and a SPARQL endpoint for the dataset is available (though I'm not sure whether this is public). The system also allows students to provide annotations, which are also stored as RDF data, but in a separate data store from the "primary" reading list data, allowing different access controls.

About

Recent Mentions on Twitter

eFoundations is powered by TypePad
Add to Technorati Favorites