November 17, 2012

Analysing the police and crime commissioner election data

UK readers will know that elections for police and crime commissioners (PCCs) took place earlier this week, the first time we have had such elections in England and Wales. They will also know that some combination of lack of knowledge, apathy and disagreement with the whole process led to our collective lowest voter turnout for a long time.

In the UK, election results are typically reported in terms of the percentage of votes cast. This makes reasonable sense in cases where the turnout is good; one can make the assumption that the votes cast are representative of the electorate as a whole. Where voter turnout is very low, as was the case this time, I think it makes far less sense. People have stayed away for a reason and I think it is more interesting to look at the mandate that newly elected commissioners have received in terms of the overall electorate.

I took a quick look at the PCC results data that is available from the Guardian data blog and came up with the following chart:

Chart_6
The result makes for pretty uncomfortable viewing, at least for anyone worried about how well democracy served England and Wales in this case.

The data itself comes in the form of a Google Spreadsheet containg a row for each police force and columns for the winning party and the turnout (as votes and as a percentage), followed by various columns for each of the candidates.

The data does not contain a column indicating the size of the electorate in each area, so I added a column called 'Potential vote' and reverse engineered it from the 'Turnout, votes' and 'Turnout, %' columns.

The data also does not contain a single column for each of the winners - instead, this data is spread across multiple columns representing each of the candidates. I created three new colums called 'Winner', 'Voted for winner' and 'Voted for winner as % of turnout' and manually copied the data across from the appropriate cells. (There is probably an automated way of doing this but I couldn't think of it and there's not that many rows to deal with so I did it one at a time).

From there, it is pretty straight-forward to populate columns called 'Voted for other' ('Turnout, votes' less 'Voted for winner') and 'Didn't vote' ('Potential vote' less 'Voted for winner' less 'Voted for other') and to turn these into corresponding percentages of the electorate.

Voila!

As an aside, the data does not contain any information about spoiled ballot papers, of which there were alledgedly a large number in this election. A revised version of the spreadsheet today contains columns for this data but they are currently unpopulated, so I assume that this information is coming. I think this would make a useful addition to this chart.

If you are interested, my data is available here.

May 18, 2012

Big Data - size doesn't matter, it's the way you use it that counts

...at least, that's what they tell me!

IMG_6404Here's my brief take on this year's Eduserv Symposium, Big Data, big deal?, which took place in London last Thursday and which was, by all the accounts I've seen and heard, a pretty good event.

The day included a mix of talks, from an expansive opening keynote by Rob Anderson to a great closing keynote by Anthony Joseph. Watching either, or both, of these talks will give you a very good introduction to big data. Between the two we had some specifics: Guy Coates and Simon Metson talking about their experiences of big data in genomics and physics respectively (though the latter also included some experiences of moving big data techniques between different academic disciplines); a view of the role of knowledge engineering and big data in bridging the medical research/healthcare provision divide by Anthony Brookes; a view of the potential role of big data in improving public services by Max Wind-Cowie; and three shorter talks immediately after lunch - Graham Prior talking about big data and curation, Devin Gafney talking about his 140Kit twitter-analytics project (which, coincidentally, is hosted on our infrastructure) and Simon Hodson talking about the JISC's big data activities.

All of the videos and slides from the day are avaialble at the links above. Enjoy!

For my part, there were several take-home messages:

  • Firstly, that we shouldn’t get too hung up on the word ‘big’. Size is clearly one dimension of the big data challenge but of the three words most commonly associated with big data - volume, velocity and variety - it strikes me that volume is the least interesting and I think this was echoed by several of the talks on the day.
  • In particular, it strikes me there is some confusion between ‘big data’ and ‘data that happens to be big’ - again, I think we saw some of this in some of the talks. Whilst the big data label has helped to generate interest in this area, it seems to me that its use of the word 'big' is rather unhelp in this respect. It also strikes me that the JISC community, in particular, has a history of being more interested in curating and managing data than in making use of it, whereas big data is more about the latter than the former.
  • As with most new innovations (though 'evolution' is probably a better word here) there is a temptation to focus on the technology and infrastructure that makes it work, particularly amoungst a relatively technical audience. I am certainly guilty of this. In practice, it is the associated cultural change that is probably more important. Max Wind-Cowie’s talk, in particular, referred to the kinds of cultural inertia that need to be overcome in the public sector, on both the service provider side and the consumer side, before big data can really have an impact in terms of improving public services. Attitudes like, "how can a technology like big data possibly help me build a *closer* and more *personal* relationship with my clients?" or "why should I trust a provider of public services to know this much about me?" seem likely to be widespread. Though we didn't hear about it on the day, my gut feeling is that a similar set of issues would probably apply in education were we, for example, to move towards a situation where we make significant use of big data techniques to tailor learning experiences at an individual level. My only real regret about the event was that I didn't find someone to talk on this theme from an education perspective.
  • Several talks refered to the improvements in 'evidence-based' decision-making that big data can enable. For example, Rob Anderson talked about poor business decisions being based on poor data currently and Anthony Brookes discussed the role of knowledge engineering in improving the ability of those involved in front-line healthcare provision to take advantage of the most recent medical research. As Adam Cooper of CETIS argues in Analytics and Big Data - Reflections from the Teradata Universe Conference 2012, we need to find ways to ask questions that have efficiency or effectiveness implications and we need to look for opportunities to exploit near-real-time data if we are to see benefits in these areas.
  • I have previously raised the issue of possible confusion, especially in the government sector, between 'open data' and 'big data'. There was some discussion of this on the day. Max Wind-Cowie, in particular, argued that 'open data' is a helpful - indeed, a necessary - step in encouraging the public sector to move toward a more transparent use of public data. The focus is currently on the open data agenda but this will encourage an environment in which big data tools and techniques can flourish.
  • Finally, the issue that almost all speakers touched on to some extent was that of the need to grow the pool of people who can undertake data analytics. Whether we choose to refer to such people as data scientists, knowledge engineers or something else there is a need for us to grow the breadth and depth of the skills-base in this area and, clearly, universities have a critical role to play in this.

As I mentioned in my opening to the day, Eduserv's primary interest in Big Data is somewhat mundane (though not unimportant) and lies in the enabling resources that we can bring to the communities we serve (education, government, health and other charities), either in the form of cloud infrastructure on which big data tools can be run or in the form of data centre space within which physical kit dedicated to Big Data processing can be housed. We have plenty of both and plenty of bandwidth to JANET so if you are interested in working with us, please get in touch.

Overall, I found the day enlightening and challenging and I should end with a note of thanks to all our speakers who took the time to come along and share their thoughts and experiences.

[Photo: Eliot Hall, Eduserv]

April 02, 2012

Big data, big deal?

Some of you may have noticed that Eduserv's annual symposium is happening on May 10. Once again, we're at the Royal College of Physicians in London and this year we are looking at big data, appropriate really... since 2012 has been widely touted as being the year of big data.

Here's the blurb for our event:

Data volumes have been growing exponentially for a long while – so what’s new now? Is Big Data [1] just the latest hype from vendors chasing big contracts? Or does it indeed present wholly new challenges and critical new opportunities, and if so what are they?

The 2012 Symposium will investigate Big Data, uncovering what makes it different from what has gone before and considering the strategic issues it brings with it: both how to use it effectively and how to manage it.  It will look at what Big Data will mean across research, learning, and operations in HE, and at its implications in government, health, and the commercial sector, where large-scale data is driving the development of a whole new set of tools and techniques.

Through presentations and debate delegates will develop their understanding of both the likely demands and the potential benefits of data volumes that are growing disruptively fast in their organisation.

[1] Big Data is "data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it."  What is big data?  Edd Dumbill, O'Reilly Radar, Jan 2012

As usual, the event is free to attend and will be followed by a drinks reception.

You'll note that we refer to Edd Dumbill's What is big data? article in order to define what we mean by big data and I recommend reading this by way of an introduction for the day. The Wikipedia page for Big data provides a good level of background and some links for further reading. Finally, O'Reilly's follow-up publication, Planning for Big Data - A CIO's Handbook to the Changing Data Landscape is also worth a look (and is free to download as an e-book).

You'll also note that the defining characteristics of big data include not just 'size' (though that is certainly an important dimension) but also 'rate of creation and/or change', and 'structural coherence'. These are typically known as the three Vs - "volume (amount of data), velocity (speed of data in/out), and variety (range of data types, sources)". In looking around for speakers, my impression is that there is a strong emphasis on the first of these in people's general understanding about what big data means (which is not surprising given the name) and that in the government sector in particular there is potential confusion between 'big data' and 'open data' and/or 'linked data' which I think it would be helpful to unpick a little - big data might be both 'open' and 'linked' but isn't necessarily so.

So, what do we hope to get out of the day? As usual, it's primarily a 'bringing people up to speed' type of event. The focus will be on our charitable beneficiaries, i.e. organisations working in the area of 'public good' - education, government, health and the charity sector - though I suspect that the audience will be mainly from the first of these. The intention is for people to leave with a better understand of why big data might be important to them and what impact it might have in both strategic and practical terms on the kinds of activities they undertake.

We have a range of speakers, providing perspectives from inside and outside of those sectors, both hands-on and more theoretical - this is one of the things we always try and do at our sympoisia. Our sessions include keynotes by Anthony D. Joseph (Chancellor's Associate Professor in Computer Science at University of California, Berkeley) and Rob Anderson (CTO EMEA, Isilon Storage Division at EMC) as well as talks by Professor Anthony J Brookes (Department of Genetics at the University of Leicester), Dr. Guy Coates (Informatics Systems Group at The Wellcome Trust Sanger Institute) and Max Wind-Cowie (Head of the Progressive Conservatism Project, Demos - author of The Data Dividend).

By the way... we still have a couple of speaking slots available and are particularly interested in getting a couple of short talks from people with practical experience of working with big data, either using Hadoop or something else. If you are interested in speaking for 15 minutes or so (or if you know of someone who might be) please get in touch. Thanks. Another area that I was hoping to find a speaker to talk about, but haven't been able to so far, is someone who is looking at the potential impact of big data on learning analytics, either at the level of a single institution or, more likely, at a national level. Again, if this is something you are aware of, please get in touch. Crowd-sourced speakers FTW! :-)

All in all, I'm confident that this will be an interesting and informative day and a good follow-up to last year's symposium on the cloud - I look forward to seeing you there.

October 21, 2011

Two UK government consultations related to open data

This is just a very quick note to highlight that there are two UK government consultations in the area of open data currently in progress and due to close very shortly - next week on 27 October 2011:

  • Making Open Data Real, from the Cabinet Office, on the Transparency and Open Data Strategy, and "establishing a culture of openness and transparency in public services".
  • A Consultation on Data Policy for a Public Data Corporation, from BIS, on the role of the planned Public Data Corporation and "key aspects of data policy – charging, licensing and regulation of public sector information produced by the PDC for re-use – that will determine how a PDC can deliver against all its objectives".

Below a few pointers to notes and comments I've seen around and about recently via Twitter:

Related to the former consultation is a very interesting report by Kieron O'Hara from the University of Southampton, published by the Cabinet Office as Transparent Government, not Transparent Citizens on the the issues for privacy raised by the government‘s transparency programme, and on reconciling the desire for openness from government with the privacy of individuals, which makes the argument that "privacy is a necessary condition for a successful transparency programme".

November 26, 2010

Digital by default

This week saw publication of Martha Lane Fox's report, and the associated government response, about the future of the UK government's web presence, Directgov 2010 and beyond: Revolution not evolution:

The report, and the Government’s initial response, argues for a Channel Shift that will increasingly see public services provided digitally ‘by default’.

Hardly an earth shattering opener! I'm stifling a yawn as I write.

Delving deeper, the report itself makes quite a lot of sense to me, though I'm not really in a position to comment on its practicality. What I did find odd was the wording of the key recommendations, which I found to be rather opaque and confused:

  1. Make Directgov the government front end for all departments’ transactional online services to citizens and businesses, with the teeth to mandate cross government solutions, set standards and force departments to improve citizens’ experience of key transactions.
  2. Make Directgov a wholesaler as well as the retail shop front for government services & content by mandating the development and opening up of Application Programme Interfaces (APIs) to third parties.
  3. Change the model of government online publishing, by putting  a new central team in Cabinet Office in absolute control of the overall user experience across all digital channels, commissioning all government online information from other departments.
  4. Appoint a new CEO for Digital in the Cabinet Office with absolute authority over the user experience across all government online services (websites and APIs) and the power to direct all government online spending.

As has been noted elsewhere, it took a comment by Tom Loosemore on a post by Steph Gray to clarify the real intent of recommendation 3:

The *last* thing that needs to happen is for all online publishing to be centralised into one humungous, inflexible, inefficient central team doing everything from nots to bolts from a bunker somewhere deep in Cabinet Office.

The review doesn’t recommend that. Trust me! It does, as you spotted, point towards a model which is closer to the BBC – a federated commissioning approach, where ‘commissioning’ is more akin to the hands-off commissioning of a TV series, rather than micro-commissioning as per a newspaper editor. Equally, it recommends consistent, high-quality shared UI / design / functionality/serving. Crucially, it recommends universal user metrics driving improvement (or removal) when content can be seen to be underperforming.

So recommendation 3 really appears to mean "federated, and relatively hands-off, commissioning for content that is served at a single domain name". If so, why not simply say that?

One thing I still don't get is why there is such a divide in thinking between "transactional online services" and "information" services. Given the commerce-oriented language used in recommendation 2, this seems somewhat odd. When I use Amazon, I don't go to one place to carry out a transaction and another place to find information about products and service - I just go to the Amazon website. So the starting point for recommendation 1 feels broken from the outset (to me). It is recovered, in part, by the explanation about the real intent of recommendation 3 (which simply draws everything together in one place) but why start from that point at all? It just strikes me as overly confusing.

As I say, this is not a criticism of what is being suggested - just of how it has been suggested.

September 28, 2010

An App Store for the Government?

I listened in to a G-Cloud web-cast organised by Intellect earlier this month, the primary intention of which was to provide an update on where things have got to. I use the term 'update' loosely because, with the election and change of government and what-not, there doesn't seem to have been a great deal of externally visible progress since the last time I heard someone speak about the G-Cloud. This is not surprising I guess.

The G-Cloud, you may recall, is an initiative of the UK government to build a cloud infrastructure for use across the UK public sector. It has three main strands of activity:

The last of these strikes me as the hardest to get right. As far as I can tell, it's an idea that stems (at least superficially) from the success of the Apple App Store though it's not yet clear whether an approach that works well for low-cost, personal apps running on mobile handsets is also going to work for the kinds of software applications found running across government. My worry is that, because of the difficulty, the ASG will distract from progress on the other two fronts, both of which strike me as very sensible and potentially able to save some of the tax-payer's hard-earned dosh.

App stores (the real ones I mean) work primarily because of their scale (global), the fact that people can use them to showcase their work and/or make money, their use of relatively micro-payments, and their socialness. I'm not convinced that any of these factors will have a role to play in a government app store so the nature of the beast is quite different. During the Q&A session at the end of the web-cast someone asked if government departments and/or local councils would be able to 'sell' their apps to other departments/councils via the ASG. The answer seemed to be that it was unlikely. If we aren't careful we'll end up with a simple registry of government software applications, possibly augmented by up-front negotiated special deals on pricing or whatever and a nod towards some level of social engagement (rating, for example) but where the incentives for taking part will be non-obvious to the very people we need to take part - those people who procure government software. It's the kind of thing that Becta used to do for the school's sector... oh, wait! :-(

For the ASG to work, we need to identify those factors that might motivate people to use it (other than an outright mandate) - as individuals, as departments and as government as a whole. I think this will be quite a tricky thing to get right. That's not to say that it isn't worth trying - it may well be. But I wonder if it would be better unbundled from the other strands of the G-Cloud concept, which strike me as being quite different.

Addendum: A G-Cloud Overview [PDF, dated August 2010] is available from the G-Digital Programme website:

G-Digital will establish a series of digital services that will cover a wide range of government’s expected digital needs and be available across the public sector. G-Digital will look to take advantage of new and emerging service and commercial models to deliver benefits to government.

September 03, 2010

Call for 'ideas' on UK government identity directions

The Register reports that the UK government is calling for ideas on future 'identity' directions, UK.gov fishes for ID ideas:

Directgov has asked IT suppliers to come up with new thinking on identity verification.

The team, which is now within the Cabinet Office, has issued a pre-tender notice published in the Official Journal of the European Union, saying that it wants feedback on potential requirements for the public sector on all aspects of identity verification and authentication. This is particularly relevant to online and telephone channels, and the notice says the services include the provision of related software and computer services.

The notice itself is somewhat hard to find online - I have no idea why that should be! - but a copy is available from the Sell2Wales website.

Oddly, to me at least - perhaps I'm just naive? - the notice doesn't use the word 'open' once, a little strange since one might assume that this would be treated as part of the wider 'open government' agenda as it is in the US where a similar call resulted in the OpenID Foundation putting together a nice set of resources on OpenID and Open Government. In particular, their Open Trust Frameworks for Open Government whitepaper is worth a look:

Open government is more than just publishing government proceedings and holding public meetings. The real goal is increased citizen participation, involvement, and direction of the governing process itself. This mirrors the evolution of “Web 2.0” on the Internet—the dramatic increase in user-generated content and interaction on websites. These same social networking, blogging, and messaging technologies have the potential to increase the flow of information between governments and citizenry—in both directions. However, this cannot come at the sacrifice of either security or privacy. Ensuring that citizen/government interactions are both easy and safe is the goal of a new branch of Internet technology that has grown very rapidly over the past few years.

July 29, 2010

legislation.gov.uk

I woke up this morning to find a very excited flurry of posts in my Twitter stream pointing to the launch by the UK National Archives of the legislation.gov.uk site, which provides access to all UK legislation, including revisions made over time. A post on the data.gov.uk blog provides some of the technical background and highlights the ways in which the data is made available in machine-processable forms. Full details are provided in the "Developer Zone" documents.

I don't for a second pretend to have absorbed all the detail of what is available, so I'll just highlight a couple of points.

First and foremost, this is being delivered with an eye firmly on the Linked Data principles. From the blog post I mentioned above:

For the web architecturally minded, there are three types of URI for legislation on legislation.gov.uk. These are identifier URIs, document URIs and representation URIs. Identifier URIs are of the form http://www.legislation.gov.uk/id/{type}/{year}/{number} and are used to denote the abstract concept of a piece of legislation - the notion of how it was, how it is and how it will be. These identifier URIs are designed to support the use of legislation as part of the web of Linked Data. Document URIs are for the document. Representation URIs are for the different types of possible rendition of the document, so htm, pdf or xml.

(Aside: I admit to a certain squeamishness about the notion of "representation URIs" and I kinda prefer to think in terms of URIs for Generic Documents and for Specific Documents, along the lines described by Tim Berners-Lee in his "Generic Resources" note, but that's a minor niggle of terminology on my part, and not at all a disagreement with the model.)

A second aspect I wanted to highlight (given some of my (now slightly distant) past interests) is that, on looking at the RDF data (e.g. http://www.legislation.gov.uk/ukpga/2010/24/contents/data.rdf), I noticed that it appears to make use of a FRBR-based model to deal with the challenge of representing the various flavours of "versioning" relationships.

I haven't had time to look in any detail at the implementation, other than to observe that the data can get quite complex - necessarily so - when dealing with a lot of whole-part and revision-of/variant-of/format-of relationships. (There was one aspect where I wondered if the FRBR concepts were being "stretched" somewhat, but I'm writing in haste and I may well be misreading/misinterpreting the data, so I'll save that question for another day.)

It's fascinating to see the FRBR approach being deployed as a practical solution to a concrete problem, outside of the library community in which it originated.

Pretty cool stuff, and congratulations to all involved in providing it. I look forward to seeing how the data is used.

February 26, 2010

The 2nd Linked Data London Meetup & trying to bridge a gap

On Wednesday I attended the 2nd London Linked Data Meetup, organised by Georgi Kobilarov and Silver Oliver and co-located with the JISC Dev8D 2010 event at ULU.

The morning session featured a series of presentations:

  • Tom Heath from Talis started the day with Linked Data: Implications and Applications. Tom introduced the RDF model, and planted the idea that the traditional "document" metaphor (and associated notions like the "desktop" and the "folder") were inappropriate and unnecessarily limiting in the context of Linked Data's Web of Things. Tom really just scratched the surface of this topic, I think, with a few examples of the sort of operations we might want to perform, and there was probably scope for a whole day of exploring it.
  • Tom Scott from the BBC on the Wildlife Finder, the ontology beind it, and some of the issues around generating and exposing the data. I had heard Tom speak before, about the BBC Programmes and Music sites, and again this time I found myself admiring the way he covered potentially quite complex issues very clearly and concisely. The BBC examples provide great illustrations of how linked data is not (or at least should not be) something "apart from" a "Web site", but rather is an integral part of it: they are realisations of the "your Web site is your API" maxim. The BBC's use of Wikipedia as a data source also led into some interesting discussion of trust and provenance, and dealing with the risk of, say, an editor of a Wikipedia page creating malicious content which was then surfaced on the BBC page. At the time of the presentation, the wildlife data was still delivered only in HTML, but Tom announced yesterday that the RDF data was now being exposed, in a similar style to that of the Programmes and Music sites.
  • John Sheridan and Jeni Tennison described their work on initiatives to open up UK government data. This was really something of a whirlwind (or maybe given the presenters' choice of Wild West metaphors, that should be a "twister") tour through a rapidly evolving landscape of current work, but I was impressed by the way they emphasised the practical and pragmatic nature of their approaches, from guidance on URI design through work on provenance, to the current work on a "Linked Data API" (on which more below)
  • Lin Clark of DERI gave a quick summary of support for RDFa in Drupal 7. It was necessarily a very rapid overview, but it was enough to make me make a mental note to try to find the time to explore Drupal 7 in more detail.
  • Georgi Kobilarov and Silver Oliver presented Uberblic, which provides a single integrated point of access to a set of data sources. One of the very cool features of Uberblic is that updates to the sources (e.g. a Wikipedia edit) are reflected in the aggregator in real time.

The morning closed with a panel session chaired by Paul Miller, involving Jeni Tennison, Tom Scott, Ian Davis (Talis) and Timo Hannay (Nature Publishing) which picked up many of the threads from the preceding sessions. My notes (and memories!) from this session seem a bit thin (in my defence, it was just before lunch and we'd covered a lot of ground...), but I do recall discussion of the trade-offs between URI readability and opacity, and the impact on managing persistence, which I think spilled out into quite a lot of discussion on Twitter. IIRC, this session also produced my favourite quote of the day, from Tom Scott, which was something along the lines of, "The idea that doing linked data is really hard is a myth".

Perhaps the most interesting (and timely/topical) session of the day was the workshop at the end of the afternoon by Jeni Tennison, Leigh Dodds and Dave Reynolds, in which they introduced a proposal for what they call a "Linked Data API".

This defines a configurable "middleware" layer that sits in front of a SPARQL endpoint to support the provision of RESTful access to the data, including not only the provision of descriptions of individual identified resources, but also selection and filtering based on simple URI patterns rather than on SPARQL, and the delivery of multiple output formats (including a serialisation of RDF in JSON - and the ability to generate HTML or XHTML). (It only addresses read access, not updates.)

This initiative emerged at least in part out of responses to the data.gov.uk work, and comments on the UK Government Data Developers Google Group and elsewhere by developers unfamiliar with RDF and related technologies. It seeks to try to address the problem that the provision of queries only through SPARQL requires the developers of applications to engage directly with the SPARQL query language, the RDF model and the possibly unfamiliar formats provided by SPARQL. At the same time, this approach also seeks to retain the "essence" of the RDF model in the data - and to provide clients with access to the underlying queries if required: it complements the SPARQL approach, rather than replaces it.

The configurability offers a considerable degree of flexibility in the interface that can be provided - without the requirement to create new application code. Leigh made the important point that the API layer might be provided by the publisher of the SPARQL endpoint, or it might be provided by a third party, acting as an intermediary/proxy to a remote SPARQL endpoint.

IIRC, mentions were made of work in progress on implementations in Ruby, PHP and Java(?).

As a non-developer myself, I hope I haven't misrepresented any of the technical details in my attempt to summarise this. There was a lot of interest in this session at the meeting, and it seems to me this is potentially an important contribution to bridging the gap between the world of Linked Data and SPARQL on the one hand and Web developers on the other, both in terms of lowering immediate barriers to access and in terms of introducing SPARQL more gradually. There is now a Google Group for discussion of the API.

All in all it was an exciting if somewhat exhausting day. The sessions I attended were all pretty much full to capacity and generated a lot of discussion, and it generally felt like there is a great deal of excitement and optimism about what is becoming possible. The tools and infrastructure around linked data are still evolving, certainly, but I was particularly struck - through initiatives like the API project above - by the sense of willingness to respond to comments and criticisms and to try to "build bridges", and to do so in very real, practical ways.

February 02, 2010

Data.gov.uk, Creative Commons and the public domain

In a blog post at Creative Commons, UK moves towards opening government data, Jane Park notes that the UK Government have taken a significant step towards the use of Creative Commons licences by making the terms and conditions for the data.gov.uk website compatible with CC-BY 3.0:

In a step towards openness, the UK has opened up its data to be interoperable with the Attribution Only license (CC BY). The National Archives, a department responsible for “setting standards and supporting innovation in information and records management across the UK,” has realigned the terms and conditions of data.gov.uk to accommodate this shift. Data.gov.uk is “an online point of access for government-held non-personal data.” All content on the site is now available for reuse under CC BY. This step expresses the UK’s commitment to opening its data, as they work towards a Creative Commons model that is more open than their former Click-Use Licenses.

This feels like a very significant move - and one that I hadn't fully appreciated in the recent buzz around data.gov.uk.

Jane Park ends her piece by suggesting that "the UK as well as other governments move in the future towards even fuller openness and the preferred standard for open data via CC Zero". Indeed, I'm left wondering about the current move towards CC-BY in relation to the work undertaken a while back by Talis to develop the Open Data Commons Public Domain Dedication and Licence.

As Ian Davis of Talis says, Linked Data and the Public Domain:

In general factual data does not convey any copyrights, but it may be subject to other rights such as trade mark or, in many jurisdictions, database right. Because factual data is not usually subject to copyright, the standard Creative Commons licenses are not applicable: you can’t grant the exclusive right to copy the facts if that right isn’t yours to give. It also means you cannot add conditions such as share-alike.

He suggests instead that waivers (of which CC Zero and the Public Domain Dedication and License (PDDL) are examples) are a better approach:

Waivers, on the other hand, are a voluntary relinquishment of a right. If you waive your exclusive copyright over a work then you are explictly allowing other people to copy it and you will have no claim over their use of it in that way. It gives users of your work huge freedom and confidence that they will not be persued for license fees in the future.

Ian Davis' post gives detailed technical information about how such waivers can be used.

January 22, 2010

The right and left hands of open government data in the UK

As I'm sure everyone knows by now, the UK Government's data.gov.uk site was formally launched yesterday to a significant fanfare on Twitter and elsewhere.  There's not much I can add other than to note that I think this initiative is a very good thing and I hope that we can contribute more in the future than we have done to date.

[Edit: I note that the video of the presentation by Tim Berners-Lee and Nigel Shadbolt is now available.]

I'd like to highlight two blog posts that hurtled past in my Twitter stream yesterday.  The first, by Brian Hoadley, rightly reminds us that Open data is not a panacea – but it is a start:

In truth, I’ve been waiting for Joe Bloggs on the street to mention in passing – “Hey, just yesterday I did ‘x’ online” and have it be one of those new ‘Services’ that has been developed from the release of our data. (Note: A Joe Bloggs who is not related to Government or those who encircle Government. A real true independent Citizen.)

It may be a long wait.

The reality is that releasing the data is a small step in a long walk that will take many years to see any significant value. Sure there will be quick wins along the way – picking on MP’s expenses is easy. But to build something sustainable, some series of things that serve millions of people directly, will not happen overnight. And the reality, as Tom Loosemore pointed out at the London Data Store launch, it won’t be a sole developer who ultimately brings it to fruition.

The second, from the Daily Mash, is rather more flippant, New website to reveal exactly why Britain doesn't work:

Sir Tim said ordinary citizens will be able to use the data in conjunction with Ordnance Survey maps to show the exact location of road works that are completely unnecessary and are only being carried out so that some lazy, stupid bastard with a pension the size of Canada can use up his budget before the end of March.

The information could also be used to identify Britain's oldest pothole, how much business it has generated for its local garage and why in the name of holy buggering fuck it has never, ever been fixed.

And, while we are on the subject of maps and so on, today's posting to the Ernest Marples Blog, Postcode Petition Response — Our Reply, makes for an interesting read about the government's somewhat un-joined-up response to a petition to "encourage the Royal Mail to offer a free postcode database to non-profit and community websites":

The problem is that the licence was formed to suit industry. To suit people who resell PAF data, and who use it to save money and do business. And that’s fine — I have no problem with industry, commercialism or using public data to make a profit.

But this approach belongs to a different age. One where the only people who needed postcode data were insurance and fulfilment companies. Where postcode data was abstruse and obscure. We’re not in that age any more.

We’re now in an age where a motivated person with a laptop can use postcode data to improve people’s lives. Postcomm and the Royal Mail need to confront this and change the way that they do things. They may have shut us down, but if they try to sue everyone who’s scraping postcode data from Google, they’ll look very foolish indeed.

Finally — and perhaps most importantly — we need a consistent and effective push from the top. Number 10’s right hand needs to wake up and pay attention to the fantastic things the left hand’s doing.

Without that, we won’t get anywhere.

Hear, hear.

December 08, 2009

UK government’s public data principles

The UK government has put down some pretty firm markers for open data in it's recent document, Putting the Frontline First: smarter government. The section entitled Radically opening up data and promoting transparency sets out the agenda as follows:

  1. Public data will be published in reusable, machine-readable form
  2. Public data will be available and easy to find through a single easy to use online access point (http://www.data.gov.uk/)
  3. Public data will be published using open standards and following the recommendations of the World Wide Web Consortium
  4. Any 'raw' dataset will be represented in linked data form
  5. More public data will be released under an open licence which enables free reuse, including commercial reuse
  6. Data underlying the Government's own websites will be published in reusable form for others to use
  7. Personal, classified, commercially sensitive and third-party data will continue to be protected.

(Bullet point numbers added by me.)

I'm assuming that "linked data" in point 4 actually means "Linked Data", given reference to W3C recommendations in point 3.

There's also a slight tension between points 4 and 5, if only because the use of the phrase, "more public data will be released under an open licence", in point 5 implies that some of the linked data made available as a result of point 4 will be released under a closed licence.  One can argue about whether that breaks the 'rules' of Linked Data but it seems to me that it certainly runs counter to the spirit of both Linked Data and what the government says it is trying to do here?

That's a pretty minor point though and, overall, this is a welcome set of principles.

Linked Data, of course, implies URIs and good practice suggests Cool URIs as the basic underlying principle of everything that will be built here.  This applies to all government content on the Web, not just to the data being exposed thru this particular initiative.  One of the most common forms of uncool URI to be found on the Web in government circles is the technology-specific .aspx suffix... hey, I work for an organisation that has historically provided the technology to mint a great deal of these (though I think we do a better job now).  It's worth noting, for example, that the two URIs that I use above to cite the Putting the Frontline First document both end in .aspx - ironic huh?

I'm not suggesting that cool URIs are easy, but there are some easy wins and getting the message across about not embedding technology into URIs is one of the easier ones... or so it seems to me anyway.

December 01, 2009

On "Creating Linked Data"

In the age of Twitter, short, "hey, this is cool" blog posts providing quick pointers have rather fallen out of fashion, but I thought this material was worth drawing attention to here. Jeni Tennison, who is contributing to the current work with Linked Data in UK government, has embarked on a short series of tutorial-style posts called "Creating Linked Data", in which she explains the steps typically involved in reformulating existing data as linked data, and discusses some of the issues arising.

Her "use case" is the scenario in which some data is currently available in CSV format, but I think much of the discussion could equally be applied to the case where the provider is making data available for the first time. The opening post on the sequence ("Analysing and Modelling") provides a nice example of working through the sort of "things or strings?" questions which we've tried to highlight in the context of designing DC Application Profiles. And as Jeni emphasises, this always involves design choices:

It’s worth noting that this is a design process rather than a discovery process. There is no inherent model in any set of data; I can guarantee you that someone else will break down a given set of data in a different way from you. That means you have to make decisions along the way.

And further on in the piece, she rationalises her choices for this example in terms of what those choices enable (e.g. "whenever there’s a set of enumerated values it’s a good idea to consider turning them into things, because to do so enables you to associate extra information about them").

The post on URI design offers some tips, not only on designing new URIs but also on using existing URIs where appropriate: I admit I tend to forget about useful resources like placetime.com "a URI space containing URIs that represent places and times" (and provides redirects to descriptions in various formats).

On a related note, the post on choosing/coining properties, classes and datatypes includes a pointer to the OWL Time ontology. This is something I was aware of, but only looked at in any detail relatively recently. At first glance it can seem rather complex; Ian Davis has a summary graphic which I found helpful in trying to get my head round the core concepts of the ontology.

It seems to me these sort of very common areas like time data are those around which some shared practice will emerge, and articles like these, by "hands-on" practitioners, are important contributions to that process.

November 20, 2009

COI guidance on use of RDFa

Via a post from Mark Birbeck, I notice that the UK Central Office for Information has published some guidelines called Structuring information on the Web for re-usability which include some guidance on the use of RDFa to provide specific types of information, about government consultations and about job vacancies.

This is exciting news as, as far as I know, this is the first formal document from UK central government to provide this sort of quite detailed, resource-type-specific guidance with recommendations on the use of particular RDF vocabularies - guidance of the sort I think will be an essential component in the effective deployment of RDFa and the Linked Data approach. It's also the sort of thing that is of considerable interest to Eduserv, as a developer of Web sites for several government agencies. The document builds directly on the work Mark has been doing in this area, which I mentioned a while ago.

As Mark notes in his post, the document is unequivocal in its expression of the government's commitment to the Linked Data approach:

Government is committed to making its public information and data as widely available as possible. The best way to make structured information available online is to publish it as Linked Data. Linked Data makes the information easier to cut and combine in ways that are relevant to citizens.

Before the announcement of these guidelines, I recently had a look at the "argot" for consultations - "argot" is Mark's term for a specification of how a set of terms from multiple RDF vocabularies is used to meet some application requirement; as I noted in that earlier post, I think it is similar to what DCMI calls an "application profile" - , and I had intended to submit some comments. I fear it is now somewhat late in the day for me to be doing this, but the release of this document has prompted me to write them up here. My comments are concerned primarily with the section titled "Putting consultations into Linked Data"

The guidelines (correctly, I think) establish a clear distinction between the consultation on the one hand and the Web page describing the consultation on the other by (in paragraphs 30/31) introducing a fragment identifier for the URI of the consultation (via the about="#this" attribute). The consultation itself is also modelled as a document, an instance of the class foaf:Document, which in turn "has as parts" the actual document(s) on which comment is being sought, and for which a reply can be sent to some agent.

I confess that my initial "instinctive" reaction to this was that this seemed a slightly odd choice, as a "consultation" seemed to me to be more akin to an event or a process, taking place during an interval of time, which had a as "inputs" to that process a set of documents on which comments were sought, and (typically at least) resulted in the generation of some other document as a "response". And indeed the page describing the Consultation argot introduces the concept as follows (emphasis added):

A consultation is a process whereby Government departments request comments from interested parties, so as to help the department make better decisions. A consultation will usually be focused on a particular topic, and have an accompanying publication that sets the context, and outlines particular questions on which feedback is requested. Other information will include a definite start and end date during which feedback can be submitted and contact details for the person to submit feedback to.

I admit I find it difficult to square this with the notion of a "document". And I think a "consultation-as-event" (described by a Web page) could probably be modelled quite neatly using the Event Ontology or the similar LODE ontology (with some specialisation of classes and properties if required).

Anyway, I appreciate that aspect may be something of a "design choice". So for the remainder of the comments here, I'll stick to the actual approach described by the guidelines (consultation as document).

The RDF properties recommended for the description of the consultation are drawn mainly from Dublin Core vocabularies, and more specifically from the "DC Terms" vocabulary.

The first point to note is that, as Andy noted recently, DCMI recently made some fairly substantive changes to the DC Terms vocabulary, as a result of which the majority of the properties are now the subject of rdfs:range assertions, which indicate whether the value of the property is a literal or a non-literal resource. The guidelines recommend the use of the publisher (paragraphs 32-37), language(paragraphs 38-39), and audience (paragraph 46) properties, all with literal values, e.g.

<span property="dc:publisher" content="Ministry of Justice"></span>

But according to the term descriptions provided by DCMI, the ranges of these properties are the classes dcterms:Agent, dcterms:LinguisticSystem and dcterms:AgentClass respectively. So I think that would require the use of an XHTML-RDFa construct something like the following, introducing a blank node (or a URI for the resource if one is available):

<div rel="dc:publisher"><span property="foaf:name" content="Ministry of Justice"></span></div>

Second, I wasn't sure about the recommendation for the use of the dcterms:source property (paragraph 37). This is used to "indicate the source of the consultation". For the case where this is a distinct resource (i.e. distinct from the consultation and this Web page describing it), this seems OK, but the guidelines also offer the option of referring to the current document (i.e. the Web page) as the source of the consultation:

<span rel="dc:source" resource=""></span>

DCMI's definition of the property is "A related resource from which the described resource is derived", but it seems to me the Web page is acting as a description of the consultation-as-document, rather than a source of it.

Third, the guidelines recommend the use of some of the DCMI date properties (paragraph 42):

  • dcterms:issued for the publication date of the consultation
  • dcterms:available for the start date for receiving comments
  • dcterms:valid for the closing date ("a date through which the consultation is 'valid'")

I think the use of dcterms:valid here is potentially confusing. DCMI's definition is "Date (often a range) of validity of a resource", so on this basis I think the implication of the recommended usage is that the consultation is "valid" only on that date, which is not what is intended. The recommendations for dcterms:issued and dcterms:available are probably OK - though I do think the event-based approach might have helped make the distinction between dates related to documents and dates related to consultation-as-process rather clearer!

Oh dear, this must read like an awful lot of pedantic nitpicking on my part, but my intent is to try to ensure that widely used vocabularies like those provided by DCMI are used as consistently as possible. As I said at the start I'm very pleased to see this sort of very practical guidance appearing (and I apologise to Mark for not submitting my comments earlier!)

April 24, 2009

More RDFa in UK government

It's quite exciting to see various initiatives within UK government starting to make use of Semantic Web technologies, and particularly of RDFa. At the recent OKCon conference, I heard Jeni Tennison talk about her work on using RDFa in the London Gazette. Yesterday, Mark Birbeck published a post outlining some of his work with the Central Office of Information.

The example Mark focuses on is that of a job vacancy, where RDFa is used to provide descriptions of various related resources: the vacancy, the job for which the vacancy is available, a person to contact, and so on. Mark provides an example of a little display app built on the Yahoo SearchMonkey platform which processes this data.

As a a footnote (a somewhat lengthy one, now that I've written it!), I'd just draw attention to Mark's description of developing what he calls an RDF "argot" for constructing such descriptions:

The first vocabularies -- or argots -- that I defined were for job vacancies, but in order to make the terminology usable in other situations, I broke out argots for replying to the vacancy, the specification of contact details, location information, and so on.

An argot doesn't necessarily involve the creation of new terms, and in fact most of the argots use terms from Dublin Core, FOAF and vCard. So although new terms have been created if they are needed, the main idea behind an argot is to collect together terms from various vocabularies that suit a particular purpose.

I was struck by some of the parallels between this and DCMI's descriptions of developing what it calls an "DC application profile" - with the caveat that DCMI typically talks in terms of the DCMI Abstract Model rather than directly of the RDF model. e.g. the Singapore Framework notes:

In a Dublin Core Application Profile, the terms referenced are, as one would expect, terms of the type described by the DCMI Abstract Model, i.e. a DCAP describes, for some class of metadata descriptions, which properties are referenced in statements and how the use of those properties may be constrained by, for example, specifying the use of vocabulary encoding schemes and syntax encoding schemes. The DC notion of the application profile imposes no limitations on whether those properties or encoding schemes are defined and managed by DCMI or by some other agency

And in the draft Guidelines for Dublin Core Application Profiles:

the entities in the domain model -- whether Book and Author, Manifestation and Copy, or just a generic Resource -- are types of things to be described in our metadata. The next step is to choose properties for describing these things. For example, a book has a title and author, and a person has a name; title, author, and name are properties.

The next step, then, is to scan available RDF vocabularies to see whether the properties needed already exist. DCMI Metadata Terms is a good source of properties for describing intellectual resources like documents and web pages; the "Friend of a Friend" vocabulary has useful properties for describing people. If the properties one needs are not already available, it is possible to declare one's own

And indeed the Job Vacancy argot which Mark points to would, I think, probably be fairly recognisable to those familiar with the DCAP notion: compare, for example, with the case of the Scholarly Works Application Profile. The differences are that (I think) an "argot" focuses on the description of a single resource type, and I don't think it goes as far as a formal description of structural constraints in quite the same way DCMI's Description Set Profile model does.

About

Search

Loading
eFoundations is powered by TypePad