Linked Data & Web Search Engines
I seem to have fallen into the habit of half-writing posts and then either putting them to one side because I'm don't feel entirely happy with them or because I get diverted into other more pressing things. This is one of several that I seem to have accumulated over the last few weeks, and which I've resolved to try to get out there....
A few weekends ago I spotted a brief exchange on Twitter between Andy and our colleague Mike Ellis on the impact of exposing Linked Data on Google search ranking. Their conclusion seemed to be that the impact was minimal. I think I'd question this assessment, and here I'll try to explain why - though in the absence of empirical evidence, I admit this is largely speculation on my part, a "hunch", if you like. I admit I almost hesitate to write this post at all, as I am far from an expert in "search-engine optimisation", and, tbh, I have something of an instinctive reaction against a notion that a high Google search ranking is the "be all and end all" :-) But I recognise it is something that many content providers care about.
In this post, I'm not considering the ways search engines might use the presence of structured data in the documents they index to enhance result sets (or make that data available to developers to provide such enhancements); rather, I'm thinking about the potential impact of the Linked Data approach on ranking.
It is widely recognised that one of the significant factors in Google's page ranking algorithm is the weighting it attaches to the number of links made to the page in question from other pages ("incoming links"). Beyond that, the recommendations of the Google Webmaster guidelines seem to be largely "common sense" principles for providing well-formed X/HTML, enabling access to your document set for Google's crawlers, and not resorting to techniques that attempt to "game" the algorithm.
Let's go back to Berners-Lee's principles for Linked Data:
Use URIs as names for things
Use HTTP URIs so that people can look up those names.
When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
Include links to other URIs so that they can discover more things.
The How to Publish Linked Data on the Web and the W3C Note on Cool URIs for the Semantic Web elaborate on some of the mechanics of providing Linked Data. Both of these sources make the point that to "provide useful information" means to provide data both in RDF format(s) and in human-readable forms.
So following those guidelines typically means that "exposing Linked Data" results in exposing a whole lot of new Web documents "about" the things featured in the dataset, in both RDF/XML (or another RDF format) and in XHTML/HTML - and indeed the use of XHTML+RDFa could meet both requirements in a single format. So this immediately increases what Leigh Dodds of Talis rather neatly refers to as the "surface area" of my pages which are available for Google to crawl and index.
The second aspect which is significant is that, by definition, Linked Data is about making links: I make links between items described in my own dataset, but I also make ("outgoing") links between those items and items described in other linked datasets made available by other parties elsewhere. And (hopefully!), at least in time, other people exposing Linked Data make ("incoming") links to items in my datasets.
And in the X/HTML pages at least, those are the very same hyperlinks that Google crawls and counts when calculating its pagerank.
The key point, I think, is that my pages are available, not just to other "Linked Data applications", but also for other people to reference, bookmark and make links to just as they do any page on any Web site. This is one of the points I was trying to highlight in my last post when I mentioned the BBC's Linked Data work: the pages generated as part of those initiatives are fairly seamlessly integrated within the collection of documents that make up the BBC Web site. They do not appear as some sort of separate "data area", something just for "client apps that want data", somehow "different from" the news pages or the IPlayer pages; on the contrary, they are linked to by those other pages, and the X/HTML pages are given a "look and feel" that emphasises their membership of this larger aggregation. And human readers of the BBC Web site encounter those pages in the course of routinely navigating the site.
Of course the key to increasing the rank of my pages in Google is whether other people actually make those links to pages I expose, and it may well be that for much of the data surfaced so far, such links are relatively small in number. But the Linked Data approach, and its emphasis on the use of URIs and links, helps me do my bit to make sure my resources are things "of (or in) the Web".
So I'd argue that the Linked Data approach is potentially rather a good fit with what we know of the way Google indexes and ranks pages - precisely because both approaches seek to "work with the grain of the Web". I'd stick my neck out and say that having a page about my event (project, idea, whatever) provides a rather better basis for making that information findable in Google than exposing that description only as the content of a particular row in an Excel spreadsheet, where it is difficult to reference as an individual target resource and where it is (typically at least) not a source of links to other resources.
As I was writing this, I saw a new post appear from Michael Hausenblas, in which he attempts to categorise some common formats and services according to what he calls their "Link Factor" ("the degree of how 'much' they are in the Web". And more recently, I noticed the appearance of a a post titled 10 Reasons Why News Organizations Should Use 'Linked Data' which, in its first two points, highlights the importance of Linked Data's use of hyperlinks and URIs to SEO - and points to the fact that the BBC's Wildlife Finder pages do tend to appear prominently in Google result sets.
Before I get carried away, I should add a few qualifiers, and note some issues which I can imagine may have some negative impact. And I should emphasise this is just my thinking out loud here - I think more work is necessary to examine the actual impact, if any.
- Redirects: Many of the links in Linked Data are made between "things", rather than between the pages describing the things. And following the "Cool URIs" guidelines, these URIs would either be URIs with fragment identifiers ("hash URIs") or URIs for which an HTTP server responds with a 303 response providing the URI of a document describing the thing. For the first case, I think Google recognises these as links to the document with the URI obtained by stripping the fragment id; for the 303 case, I'm unsure about the impact of the use of the redirect on the ranking for the document which is the final target. (A related issue would be that some sources might cite the URI of the thing and other sources might cite the URI of the document describing the thing).
- Synonyms: As the Publishing Linked Data tutorial highlights, one of the characteristics of Linked Data is that it often makes use of URI aliases, multiple URIs each referring to the same resource. If some users bookmark/cite URI A and some users bookmark/cite URI B, then that would result in a lower link-based ranking for each of the two pages describing the thing than if all users bookmarked/cited a single URI. To some extent, this is just part of the nature of the Web, and it applies similarly outside the Linked Data context, but the tendency to generate an increasing number of aliases is something which generates continued discussion in the LD community (see, for example, the recent thread on "annotation" on the public-lod mailing list generated in response to Leigh Dodds' and Ian Davis' recent Linked Data Patterns document (which I should add, from my very hasty skim reading so far, seems to provide an excellent combination of thoughtful discussion and clear practical suggestions.)).
- "Caching"/"Resurfacing": As we are seeing Linked Data being deployed, we are seeing data aggregated by various agencies and resurfaced on the Web using new URIs. Similarly to the previous point, this may lead to a case where two users cite different URIs, with a corresponding reduction in the number of incoming links to any single document. I also note that Google's guidelines include the admonition: "Don't create multiple pages, subdomains, or domains with substantially duplicate content", which does make me wonder whether such resurfaced content may have a negative impact on ranking.
- "Good XHTML": While links are important, they aren't the whole story, and attention still needs to be paid to ensuring that HTML pages generated by a Linked Data application follow the sort of general good practice for "good Web pages" described in the Google guidelines (provide well-structured XHTML, use title elements, use alt attributes, don't fill with irrelevant keywords etc etc)
- Sitemaps: This is probably just a special case of the previous point, but Google emphasises the importance of using sitemaps to provide entry points for its crawlers. Although I'm aware of the Semantic Sitemap extension, I'm not sure whether the use of sitemaps is widespread in Linked Data deployments - though it is the sort of thing I'd expect to see happen as Linked Data moves further out of the preserve of the "research project" and towards more widespread deployment.
- "Granularity": (I'm unsure whether this is a factor or not: I can imagine it might be, but it's probably not simple to assess exactly what the impact is.) How a provider decides to "chunk up" their descriptive data into documents might have an impact on the "density" of incoming links. If they expose a large number of documents each describing a single specific resource, does that result in each document receiving fewer incoming links than if they expose a smaller number of documents each describing several resources?
- Integration: Although above I highlighted the BBC as an example of Linked Data being well-integrated into a "traditional" Web site, and so made highly visible to users of that Web site, I suspect this may - at the moment at least - be the exception rather than the rule. However, as with the previous point, this is something I'd expect to become more common.
Nevertheless, I still stand by my "hunch" that the LD approach is broadly "good" for ranking. I'm not claiming Linked Data is a panacea for search-engine optimisation, and I admit that some of what I'm suggesting here may be "more potential than actual". But I do believe the approach can make a positive contribution - and that is because both the Google ranking algorithm and Linked Data exploit the URI and the hyperlink: they "work with the grain of the Web".