The OpenDOAR directory of open access repositories has announced a new search service based on Google's Custom Search Engine facility. Good stuff - though for me it raises several questions of policy and implementation...
Firstly, the announcement for the service states:
It is well known that a simple full-text search of the whole web will turn up thousands upon thousands of junk results, with the valuable nuggets of information often being lost in the sheer number of results.
Well, err... OK. This is not an uncommon statement, but simply repeating it over and over again doesn't necessarily make it true! It doesn't reflect my informal impressions of the way that Web searching works today. So I thought I'd do a little experiment, to try and compare results from the new OpenDOAR search service with results from a bog standard Google search.
Note that this isn't in the least bit scientific and it only compares known item searching - which may not be all that helpful - but bear with me while I give you my results and try to draw some conclusions. (Then you can say "what a waste of time"!).
So, what I did was to browse my way thru the list of eprints.org repositories, selecting 10 research papers as randomly as I was able. I then used the title of the paper to construct a known-item search against both the new OpenDOAR search interface and Google itself, noting down where the paper came in the search results and how many results there were overall. Note that I counted the abstract page for the paper (as served by the repository in which the paper is held) as a hit.
My results were as follows:
Recent work on French rural history
Google 1 (out of ~8,000,000 results)
OpenDOAR 1 (out of 115 results)
Towards absorbing outer boundaries in general relativity
Google 5 (out of 69,000) (http://arxiv.org/abs/gr-qc/0608051 copy number 1)
OpenDOAR not in first 10 results (out of 86) (http://arxiv.org/abs/gr-qc/0608051 copy number 1)
More to life than Google – a journey for PhD students
Google 3 (out of 219,000) (http://magpie.lboro.ac.uk/dspace/handle/2134/676 copy number 1)
OpenDOAR not in first 10 (http://magpie.lboro.ac.uk/dspace/handle/2134/676 copy number 1)
PATTERNS OF CLASSROOM INTERACTION AT DIFFERENT EDUCATIONAL LEVELS IN THE LIGHT OF PLANDER S INTERACTION ANALYSIS
Google 1 (out of 23)
OpenDOAR 1 (out of 2)
Pulse rates in the songs of trilling field crickets (Orthoptera: Gryllidae: Gryllus).
Google 5 (out of 158) (http://tjwalker.ifas.ufl.edu/AE93p565.pdf copy number 1)
OpenDOAR 1 (out of 4)
Te mana o te reo me ngā tikanga: Power and politics of the language
Google 1 (out of 540)
OpenDOAR 1 (out of 8)
Surface Projection Method for Visualizing Volumetric Data
Google 1 (out of 730,000)
OpenDOAR 1 (out of 193)
The Social Construction of Economic Man: The Genesis, Spread, Impact and Institutionalisation of Economic Ideas
Google 1 (out of 10,200)
OpenDOAR 1 (out of 38)
The Adoption of Activity-Based Costing in Thailand
Google 1 (out of 30,100)
OpenDOAR 1 (out of 30)
Connecteurs de conséquence et portée sémantique
Google 3 (out of 18,800) (pweb.ens-lsh.fr/jjayez/clf19.pdf copy number 1)
OpenDOAR 1 (out of 34)
As I say, I'm not in the least bit proud of this experiment, but it only took 30 minutes or so to carry out, so if you tell me how flawed it is I won't get too upset.
What these results say to me is that, for known item searching at least, there is little evidence that Google is losing our research nuggets within large results sets. What Google is doing is to push the nuggets to the top of the list. In fact, in some cases at least, I suspect one could argue that the vanilla Google search is surrounding those nuggets with valuable non-repository resources that are missed in the OpenDOAR repository-only search engine.
For me, this exercise raises three interesting questions:
- Are repositories successfully exposing the full-text of articles (the PDF file or whatever) to Google rather than (or as well as) the abstract page? If not, then they should be. I think there is some evidence from these results that some repositories are only exposing the abstract page, not the full-text. For a full-text search engine, this is less than optimal. My suspicion is that the way that Google uses the OAI-PMH to steer its Web crawling is actually working against us here and that we either need to work with Google to improve the way this works, or bite the bullet and ask repository software developers to support Google sitemaps in order to improve the way that Google indexes our repositories.
- Are we consistent in the way we create hypertext links between research papers in repositories? If not, then we should be. In the context of Google searches, linking is important because each link to a paper increases its Google-juice, which helps to push that paper towards the top of Google's search results. Researchers currently have the option of linking either direct to the full-text (or one of several full-texts) or to the abstract page. This choice ultimately results in a lowering of the Google-juice assigned to both the paper and the abstract page - potentially pushing both further down the list of Google search results. The situation is made worse by the use of OpenURLs, which do nothing for the Google-juice of the resource that they identify, in effect working against the way the Web works. If we could agree on a consistent way of linking to materials in repositories, we would stand to improve the visibility of our high-quality research outputs in search engines like Google.
- What is the role of metadata in a full-text indexing world? What the mini-experiment above and all my other experience says to me is that full-text indexing clearly works. In terms of basic resource discovery, we're much better off exposing the full-text of research papers to search engines for indexing, than we are exposing metadata about those papers. Is metadata therefore useless? No. We need metadata to support the delivery of other bibliographic services. In particular we need metadata to capture those attributes that are useful for searching, ranking and linking but that can't reliably be derived from the resource itself. I'm thinking here primarily of the status of the paper and of the relationships between the paper and other things - the relationships between papers and people and organisations, the relationships between different versions, between different translations, between different formats and between different copies of a paper. These are the kinds of relationships that we have been trying to capture in our work on the DC Eprints Application Profile. It is these relationships that are important in the metadata, much moreso than the traditional description and keywords kind of metadata.
Overall, what I conclude from this (once again) is that it is not the act of depositing a paper in a repository that is important for open access, but the act of surfacing the paper on the Web - the repository is just a means to en end in that respect. More fundamentally, I conclude that the way we configure, run and use repositories has to fit in with the way the Web works - not work against it or around it! First and foremost, our 'resource discovery' efforts should centre on exposing the full text of research papers in repositories to search engines like Google and on developing Web-friendly and consistent approaches to creating hypertext links between research papers.