« e-Portfolios: an overview of JISC activities | Main | Web 2.0 and friendship »

October 27, 2006

Pushing an OpenDOAR

The OpenDOAR directory of open access repositories has announced a new search service based on Google's Custom Search Engine facility.  Good stuff - though for me it raises several questions of policy and implementation...

Firstly, the announcement for the service states:

It is well known that a simple full-text search of the whole web will turn up thousands upon thousands of junk results, with the valuable nuggets of information often being lost in the sheer number of results.

Well, err... OK.  This is not an uncommon statement, but simply repeating it over and over again doesn't necessarily make it true!  It doesn't reflect my informal impressions of the way that Web searching works today.  So I thought I'd do a little experiment, to try and compare results from the new OpenDOAR search service with results from a bog standard Google search.

Note that this isn't in the least bit scientific and it only compares known item searching - which may not be all that helpful - but bear with me while I give you my results and try to draw some conclusions.  (Then you can say "what a waste of time"!).

So, what I did was to browse my way thru the list of eprints.org repositories, selecting 10 research papers as randomly as I was able.  I then used the title of the paper to construct a known-item search against both the new OpenDOAR search interface and Google itself, noting down where the paper came in the search results and how many results there were overall.  Note that I counted the abstract page for the paper (as served by the repository in which the paper is held) as a hit.

My results were as follows:

Recent work on French rural history
Google 1 (out of ~8,000,000 results)
OpenDOAR 1 (out of 115 results)

Towards absorbing outer boundaries in general relativity
Google 5 (out of 69,000) (http://arxiv.org/abs/gr-qc/0608051 copy number 1)
OpenDOAR not in first 10 results (out of 86) (http://arxiv.org/abs/gr-qc/0608051 copy number 1)

More to life than Google – a journey for PhD students
Google 3 (out of 219,000) (http://magpie.lboro.ac.uk/dspace/handle/2134/676 copy number 1)
OpenDOAR not in first 10 (http://magpie.lboro.ac.uk/dspace/handle/2134/676 copy number 1)

Google 1 (out of 23)
OpenDOAR 1 (out of 2)

Pulse rates in the songs of trilling field crickets (Orthoptera: Gryllidae: Gryllus).
Google 5 (out of 158) (http://tjwalker.ifas.ufl.edu/AE93p565.pdf copy number 1)
OpenDOAR 1 (out of 4)

Te mana o te reo me ngā tikanga: Power and politics of the language
Google 1 (out of 540)
OpenDOAR 1 (out of 8)

Surface Projection Method for Visualizing Volumetric Data
Google 1 (out of 730,000)
OpenDOAR 1 (out of 193)

The Social Construction of Economic Man: The Genesis, Spread, Impact and Institutionalisation of Economic Ideas
Google 1 (out of 10,200)
OpenDOAR 1 (out of 38)

The Adoption of Activity-Based Costing in Thailand
Google 1 (out of 30,100)
OpenDOAR 1 (out of 30)

Connecteurs de conséquence et portée sémantique
Google 3 (out of 18,800) (pweb.ens-lsh.fr/jjayez/clf19.pdf copy number 1)
OpenDOAR 1 (out of 34)

As I say, I'm not in the least bit proud of this experiment, but it only took 30 minutes or so to carry out, so if you tell me how flawed it is I won't get too upset.

What these results say to me is that, for known item searching at least, there is little evidence that Google is losing our research nuggets within large results sets.  What Google is doing is to push the nuggets to the top of the list.  In fact, in some cases at least, I suspect one could argue that the vanilla Google search is surrounding those nuggets with valuable non-repository resources that are missed in the OpenDOAR repository-only search engine.

For me, this exercise raises three interesting questions:

  1. Are repositories successfully exposing the full-text of articles (the PDF file or whatever) to Google rather than (or as well as) the abstract page?  If not, then they should be.  I think there is some evidence from these results that some repositories are only exposing the abstract page, not the full-text.  For a full-text search engine, this is less than optimal.  My suspicion is that the way that Google uses the OAI-PMH to steer its Web crawling is actually working against us here and that we either need to work with Google to improve the way this works, or bite the bullet and ask repository software developers to support Google sitemaps in order to improve the way that Google indexes our repositories.
  2. Are we consistent in the way we create hypertext links between research papers in repositories?  If not, then we should be.  In the context of Google searches, linking is important because each link to a paper increases its Google-juice, which helps to push that paper towards the top of Google's search results.  Researchers currently have the option of linking either direct to the full-text (or one of several full-texts) or to the abstract page.  This choice ultimately results in a lowering of the Google-juice assigned to both the paper and the abstract page - potentially pushing both further down the list of Google search results.  The situation is made worse by the use of OpenURLs, which do nothing for the Google-juice of the resource that they identify, in effect working against the way the Web works.  If we could agree on a consistent way of linking to materials in repositories, we would stand to improve the visibility of our high-quality research outputs in search engines like Google.
  3. What is the role of metadata in a full-text indexing world?  What the mini-experiment above and all my other experience says to me is that full-text indexing clearly works.  In terms of basic resource discovery, we're much better off exposing the full-text of research papers to search engines for indexing, than we are exposing metadata about those papers.  Is metadata therefore useless?  No.  We need metadata to support the delivery of other bibliographic services.  In particular we need metadata to capture those attributes that are useful for searching, ranking and linking but that can't reliably be derived from the resource itself.  I'm thinking here primarily of the status of the paper and of the relationships between the paper and other things - the relationships between papers and people and organisations, the relationships between different versions, between different translations, between different formats and between different copies of a paper.  These are the kinds of relationships that we have been trying to capture in our work on the DC Eprints Application Profile.  It is these relationships that are important in the metadata, much moreso than the traditional description and keywords kind of metadata.

Overall, what I conclude from this (once again) is that it is not the act of depositing a paper in a repository that is important for open access, but the act of surfacing the paper on the Web - the repository is just a means to en end in that respect.  More fundamentally, I conclude that the way we configure, run and use repositories has to fit in with the way the Web works - not work against it or around it!  First and foremost, our 'resource discovery' efforts should centre on exposing the full text of research papers in repositories to search engines like Google and on developing Web-friendly and consistent approaches to creating hypertext links between research papers.


TrackBack URL for this entry:

Listed below are links to weblogs that reference Pushing an OpenDOAR:

» Revamped Google service prompts wave of repository search from EPrints Insiders
Google's Custom Search Engine has realised immediate results in the area of repository search services. Although not the first such service, and not even the first such service from Google, this one seems to have hit the mark where previous attempts to pr [Read More]


I very much agree. Digital library systems are too often off-Web silos, ... and blaming non-curated large scale search engines for mixing up good and bad stuff doesn't do end users any favours. OpenDOAR is rather handy though, I was looking for a way to harvest RDF from repositories, ... and it at least gives us a starting point :)

I think the issue may be "a simple full-text search." I assume that you searched for the title, in quotes. Given such a search, if the article has a title that's at all distinctive and the content's been exposed, the major web search engines should work just fine. As your experiment shows. (Sure, it's anecdotal, but I'd expect roughly similar results for a sample of a few hundred.)

I would assume that OpenDOAR would yield smaller and more distinctive result sets for keyword/"topical" searches based on full text or otherwise. Those searches--the "1.5 words" searches--are the ones where the big four increasingly return sets that could be considered junk.

I would also assume that OpenDOAR might yield result sets that are something more than the "1, 2, ... 999, many" results the web engines return (That is, there's no way to verify or know whether "1001 results" and "2,368,950 results" are actually different, since you'll never get past record 1,000--and "approx. 10,000 results" frequently turns into 170 or so).

But for known items with distinctive titles, I honestly can't see why a specialized engine should do better than a mainstream web search site.

Thanks for the comments Walt. Just to clarify, I used the title string as is for the search string - I didn't add quotes. So each word in the title became a separate search term. I guess that's why some of the result sets from Google are so large. (Apologies, I should have made this clear in the original posting).

I wonder why the experiments were made using the google homepage instead of the more specialized scholar.google.com?

In answer to Rafael's question about why I used Google rather than Google Scholar. Essentially, I was trying to counter the argument made in the OpenDOAR press release that "Google hides nuggets in large result sets". I don't believe that to be true in general. Trying the experiment with Google Scholar would indeed have been interesting (perhaps I should have done a three-way test) but it wouldn't have helped to counter that argument.


A very interesting posting. (1) and (2) above are related, and in essence boil down to the requirement Andy puts forward at the end of (2): we need a consistent manner to link to materials in repositories. Trying to frame this requirement in terms of "are we linking to the splash-page location or to the PDF location" isn't good enough. For one not all repositories may have user interfaces with splash-pages, but, more importantly, we are increasingly dealing with materials that are compound in nature. That is, they are aggregations of datastreams with both a variety of media types and a variety of intellectual content types including papers, datasets, simulations, software, etc. Note that simple objects - say eprints - also fit in this perspective as they typically consist of at least 2 datastreams: descriptive metadata and a full-text document.

Anyhow, now the question becomes, in this broader realm of diverse and compound scholarly objects, what is it we are going to link to?
In the Pathways research [see [http://dx.doi.org/10.1045/october2006-vandesompel] we experimented with linking to surrogates for the materials through a repository's "obtain" service interface that, given an identifier of an object from a repository, returns a surrogate for it. A surrogate is a machine-readable representations (think some form of XML or XML/RDF) of the compound scholarly object that contains unambiguous pointers (URLs) to the constituent datastreams of the object as well as an indication of the semantics of those datastreams.

For example, in the case of an eprint, the surrogate would contain a URL pointing at the machine-readable metadata qualified with the semantic indication that the thing being pointed at is descriptive metadata, as well as a URL pointing at the full-text document, again qualified with the appropriate semantic indication.

Such an approach could go quite a way in addressing the issues (1) and (2) because those largely are about how robots (Google being the uber-bot) can make better use of materials in repositories. Interestingly, however, this approach brings up a question about humans in all of this. See, the hyperlink in a document is not only for machine consumption, it occasionally is also being followed by a human reader. Now, we don't want to present our reader with the XML or XML/RDF surrogate that is appealing to bots, do we?

So, this leads us to thinking about contextual resolution of that hyperlink, one way for a machine, another way for a human. I can think of several intriguing avenues of exploration, here. OpenURL, that is the NISO Framework with the ContextObject, not your link resolver OpenURL, is one of the things that come to mind. User-application "intelligence" another.

The comments to this entry are closed.



eFoundations is powered by TypePad