Freedom, Google-juice and institutional mandates
[Note: This entry was originally posted on the 9th Feb 2009 but has been updated in light of comments.]
An interesting thread has emerged on the American Scientist Open Access Forum based on the assertion that in Germany "freedom of research forbids mandating on university level" (i.e. that a mandate to deposit all research papers in an institutional repository (IR) would not be possible legally). Now, I'm not familiar with the background to this assertion and I don't understand the legal basis on which it is made. But it did cause me to think about why there might be an issue related to academic freedom caused by IR deposit mandates by funders or other bodies.
In responding to the assertion, Bernard Rentier says:
No researcher would complain (and consider it an infringement upon his/ her academic freedom to publish) if we mandated them to deposit reprints at the local library. It would be just another duty like they have many others. It would not be terribly useful, needless to say, but it would not cause an uproar. Qualitatively, nothing changes. Quantitatively, readership explodes.
Quite right. Except that the Web isn't like a library so the analogy isn't a good one.
If we ignore the rarefied, and largely useless, world of resource discovery based on the OAI-PMH and instead consider the real world of full-text indexing, link analysis and, well... yes, Google then there is a direct and negative impact of mandating a particular place of deposit. For every additional place that a research paper surfaces on the Web there is a likely reduction in the Google-juice associated with each instance caused by an overall diffusion of inbound links.
So, for example, every researcher who would naturally choose to surface their paper on the Web in a location other than their IR (because they have a vibrant central (discipline-based) repository (CR) for example) but who is forced by mandate to deposit a second copy in their local IR will probably see a negative impact on the Google-juice associated with their chosen location.
Now, I wouldn't argue that this is an issue of academic freedom per se, and I agree with Bernard Rentier (earlier in his response) that the freedom to "decide where to publish is perfectly safe" (in the traditional academic sense of the word 'publish'). However, in any modern understanding of 'to publish' (i.e. one that includes 'making available on the Web') then there is a compromise going on here.
The problem is that we continue to think about repositories as if they were 'part of a library', rather than as a 'true part of the fabric of the Web', a mindset that encourages us to try (and fail) to redefine the way the Web works (through the introduction of things like the OAI-PMH for example) and that leads us to write mandates that use words like 'deposit in a repository' (often without even defining what is meant by 'repository') rather than 'make openly available on the Web'.
In doing so I think we do ourselves, and the long term future of open access, a disservice.
Addendum (10 Feb 2009): In light of the comments so far (see below) I confess that I stand partially corrected. It is clear that Google is able to join together multiple copies of research papers. I'd love to know the heuristics they use to do this and I'd love to know how successful those heuristics are in the general case. Nonetheless, on the basis that they are doing it, and on the assumption that in doing so they also combine the Google juice associated with each copy, I accept that my "dispersion of Google-juice" argument above is somewhat weakened.
There are other considerations however, not least the fact that the Web Architecture explicitly argues against URI aliases:
Good practice: Avoiding URI aliases
A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.
The reasons given align very closely to the ones I gave above, though couched in more generic language:
Although there are benefits (such as naming flexibility) to URI aliases, there are also costs. URI aliases are harmful when they divide the Web of related resources. A corollary of Metcalfe's Principle (the "network effect") is that the value of a given resource can be measured by the number and value of other resources in its network neighborhood, that is, the resources that link to it.
The problem with aliases is that if half of the neighborhood points to one URI for a given resource, and the other half points to a second, different URI for that same resource, the neighborhood is divided. Not only is the aliased resource undervalued because of this split, the entire neighborhood of resources loses value because of the missing second-order relationships that should have existed among the referring resources by virtue of their references to the aliased resource.
Now, I think that some of the discussions around linked data are pushing at the boundaries of this guidance, particularly in the area of non-information resources. Nonetheless, I think this is an area in which we have to tread carefully. I stand by my original statement that we do not treat scholarly papers as though they are part of the fabric of the Web - we do not link between them in the way we link between other Web pages. In almost all respects we treat them as bits of paper that happen to have been digitised and the culprits are PDF, the OAI-PMH, an over-emphasis on preservation and a collective lack of imagination about the potential transformative effect of the Web on scholarly communication. We are tampering at the edges and the result is a mess.
This is less of a burning issue if all places where a paper exists keep roughly-analogous download statistics.
Posted by: Dorothea Salo | February 09, 2009 at 01:51 PM
Dorothea, I'm inclined to disagree. This is not an issue about usage stats or traditional citation ranking. It's about PageRank, and what can be inferred from the links between Web pages, and the impact that can have on resource discovery.
If a IR deposit mandate ultimately means that my research paper is less discoverable than it otherwise would have been, how is that good for open access?
Posted by: AndyP | February 09, 2009 at 03:31 PM
"... ignore the rarefied, and largely useless, world of resource discovery based on the OAI-PMH..." says Andy Powell. Yes, yes, yes!!!
Sorry, couldn't resist. Won't do it again, promise (;-)
Posted by: Chris Rusbridge | February 09, 2009 at 04:39 PM
This piece has made me think about searchers behaviour in relation to 'academic' papers and data.
First its worth remembering that to provide Open Access to traditional (non-Green) commercially published articles there will always need to be be at least 2 different copies available on the Web in different collections: the publisher's subscription only copy and an open access version. This raises issues as to how Google relates to publishers, whether Google has agreements with publishers on ranking.
From a different perspective, thinking of user behaviour, if a searcher wants to find scholarly (peer reviewed) papers using Google then their search strategy might well be to enter quite a few very specific terms. In this case how significant is the Google-juice effect? Or to put this more generically, how does search strategy moderate the effects of Google-juice?
In any case (looking at their FAQ) Google Scholar does group different versions of publications, collecting all citations to all versions of a work thereby improving the position of an article in search results. If a search is for an academic paper it makes sense to use Google Scholar, not sure how well this message gets across to students.
Having said all this, for various reasons I would support making OA articles available on the Web at a single locus (the institutional repository) and for other services to link to the IR copy.
Posted by: Rachel Heery | February 09, 2009 at 04:48 PM
More seriously, these kinds of citation-dilution issues do arise all the time. It was one of the concerns of publishers early in the repository movement, and I suspect is why the citation and pointers to the original source tend to be included in repository metadata (eg http://www.era.lib.ed.ac.uk/handle/1842/1476). That ERA entry undoubtedly slightly reduces the juice of the Ariadne article, but both the citation and pointers are there to compensate (slightly).
Taking that example, if you go to Google Scholar and search for "Rusbridge Excuse me", then you find 7 versions, some of which are quotes and two at least are the original Ariadne version and the ERA version.
I'm not saying there is no damage, but that there are compensations...
Posted by: Chris Rusbridge | February 09, 2009 at 04:49 PM
I think I missed something. When a paper of mine appears in several locations scholar.google just notes it as multiple copies of the same thing. In fact, I'm happy with that because when one repository goes down, you can find my paper in another. The main point about living in a googlised world is: just put it out there, no matter where, and make sure there's an incoming link from a visible page. No semantics required.
What completely baffles me is why any researcher would choose not to through their papers on the web. If there's one case for an "eyeballs economy" its academic papers. The value of a paper is determined by citations, which means you want every Joe Bloggs to download and read it.
One thing though, if someone says I *MUST* put something in some place, I will do my best to ignore them. Piss them off as much as I can, make them email me 15 times, claim their emails fell into my spam bin, then bitch about their impossible interface and refuse to use it. Childish, I know. But can't help myself.
Posted by: yish | February 10, 2009 at 12:14 AM
the reason we treat papers differently is that we rely on a structured process of quality control, namely peer review. That makes it necessary to fix a version of the text as a unit, which in turn refers to other fixed versions of texts.
as for URI aliasing, yeah, ok. so there's a logical argument against it. but try to post a sciencedirect link on twitter..
Posted by: yish | February 10, 2009 at 12:41 PM
@yish I don't understand why the peer-review process impacts on the way we surface the resulting papers on the Web? I take your point about scholarly communication relying on 'fixed' versions but what is it about PDF that makes it more fixed than HTML? In the digital space, the 'fixing' comes from human commitment and trust in the management process, not from the technology per se. One could argue that the 'repository' is a convenient and necessary evil in that it is a way of focusing institutional (or other) commitment so that the end user knows what they are dealing with. But even if that is the case (and I'm not totally sure that it is) I don't see why, in the long term, such a commitment can't be built up around interlinked XHTML documents?
Posted by: AndyP | February 10, 2009 at 01:27 PM
Hm. The idea that multiple copies dilutes PageRank instead of aggregating more total linkage bears examination. Is there evidence? (I don't disbelieve you; it's an intuitive supposition. I'm just not entirely sure it's true.)
Has anybody studied repository PageRank? That's actually a really interesting and potentially fruitful question. How do we do SEO for repositories?
Anyway, another possibility among better-networked repositories is that given several OA copies, some repositories agree to defer to others, holding a copy of something and letting it be locally searchable/browsable (for the little that tends to be worth), but redirecting users (and Google) to other download copies. If the access repo falls down and goes boom, presumably another repo could take up the job of serving up copies.
Hard, but maybe feasible?
Posted by: Dorothea Salo | February 10, 2009 at 02:03 PM
Checked the local repo's PageRank: 6. Not superlatively wonderful, but not bad, considering. arXiv has an 8.
Darn. Now I want to research this. Anybody else in? Preferably somebody with more Google-fu/SEO-fu than I have?
Posted by: Dorothea Salo | February 10, 2009 at 02:06 PM
Andy,
I would like to reply to several of the issues you
raise in "Freedom, Google-juice and institutional mandates".
1. Multiple copies != URI aliases. Two URIs for the same
resource is not the same thing as two URIs for two different
resources, i.e. my copy and your copy. Our copies (or a 3rd
party) could assert owl:sameAs, ore:similarto or rdfs:seeAlso
and Google et al. could to choose to believe or not (http://en.wikipedia.org/wiki/Trust,_but_Verify )
2. Multiple copies splits PageRank, which hurts discoverability.
But having multiple copies which helps discoverability if
one copy disappears or is encumbered. This is the premise behind
"Lazy Preservation" -- reconstructing lost websites from the copies
automatically created by SEs and the IA. See Frank McCown's
dissertation:
http://www.harding.edu/fmccown/pubs/lazy-preservation-dissertation.pdf
and prior pub, with multiple copies clustered together by Google
Scholar:
http://scholar.google.com/scholar?cluster=901867837757109707
It is also the premise of Martin Klein's ongoing PhD work, based
on the premise that things rarely actually disappear, they just
"move" to another URI. Here is a representative publication that
GS has yet to cluster with the Springer version:
http://www.cs.odu.edu/~mln/pubs/ecdl-2008/lex-sig-ecdl-2008.pdf
Ricardo Baeza-Yates has a related publication on this topic, showing
that web pages are generally just copy-n-pasted from other web pages:
http://scholar.google.com/scholar?cluster=867843863008632582
3. PageRank is good, but I don't get promoted on PageRank.
Whether I *should* is another discussion, but for the momemnt I
don't. That's too bad, b/c this paper with Joan Smith has been
discussed and linked to many times in the web (I'm linking to
a title query b/c Google significantly underreports inlinks)
http://www.google.com/search?q=Site+Design+Impact+on+Robots.+An+Examination+of+Search+Engine+Crawler+Behavior+at+Deep+and+Wide+Websites
But has gathered no citations according to GS:
http://scholar.google.com/scholar?cluster=5353450509634331249
In conventional terms, we've authored a "popular" but not "important"
paper (but Joan made some awesome animations ;-).
4. Usage domain is important to consider; Google Scholar is an
existence proof that domain is important. GS clusters what it
believes to be copies of the "same" publication (based on extracted
metadata and metadata from the publishers) and aggregates their
scores because that is what you do in the scholarly domain. Regular
Google looks for "sameness" in order to suppress duplication.
Furthermore, although scholarly pubs are part of the "fabric of
the web", all but the lucky few are in the long tail of web links.
For example, I've recently submitted a paper analyzing college
football results. If it is accepted and it becomes wildly popular,
it might get 100 citations. That would be hitting the big time.
100 links for a general query like "college football results" would
be nothing, and it would be buried on p. 312 of the search results
unless a general reader was unfortunate to stumble across the
"right" (or "wrong") combination of words to vault my paper to the
top.
5. Ultimately, the problem with institutional repositories is not
their existence, but with their ingest model. Instead of *requiring*
faculty to do anything (let's face it, if I took direction well I
would have a job other than professor), the IRs should be filled
by robots that crawl the web and/or search engines and stuff the
IR automagically (like GS, CiteSeer, and going waaaay back, UCSTRI).
Once critical mass is achieved, then the model can change. For
example, I'm concerned about the few results that GS gets *wrong*
only b/c they get so much of it right. If they got it mostly wrong,
no one would care (cf. Live Search Academic).
In computer science at least, the web has all but killed the concept
of departmental technical reports. Some people submit to arXiv,
but most just put (pr)eprints on their web site. After a few more
academic generations go by, departments/colleges/universities might
realize **they have no record of what they've published in the last
10+ years.** At that point, GS (or similar service) will either
add organizational grouping to its services, or there will be a
great interest in back-filling IRs and robots will be the only
viable option.
regards,
Michael
Posted by: Michael L. Nelson | February 12, 2009 at 08:14 PM