« The role of universities in a Web 2.0 world? | Main | Symposium live-streaming and social media »

May 15, 2009

URIs and protocol dependence

In responding to my recent post, The Nature of OAI, identifiers and linked data, Herbert Van de Sompel says:

There is a use case for both variants: the HTTP variant is the one to use in a Web context (for obvious reasons; so totally agreed with both Andy and PeteJ on this), and the doi: variant is technology independent (and publishers want that, I think, if only to print it on paper).

(My emphasis added).

I'm afraid to say that I could not disagree more with Herbert on this. There is no technological dependence on HTTP by http URIs. [Editorial note: in the first comment below Herbert complains that I have mis-represented him here and I am happy to conceed that I have and apologise for it.  By "technology independent" Herbert meant "independednt of URIs", not "independent of HTTP".  I stand by my more general assertion in the remainder of this post that a mis-understanding around the relationship between http URIs and HTTP (the protocol) led the digital library community into a situation where it felt the need to invent alternative approaches to identification of digital conetent and, further, that those alternative approaches are both harmful to the Web and harmful to digital libraries.  I think those mis-understandings are still widely held in the digital library community and I disagree with those people who continue to promote relatively new non-http forms of URIs for 'scholarly' content (by which I primarily mean info URIs and doi URIs) as though their use was good practice.  On that basis, this blog post may represent a disagreement between Herbert and me but it may not.  See the comment thread for further discussion.  Note also that my reference to Stu Weibel below is intended to indicate only what he said at the time, not his current views (about which I know nothing).]  As I said to Marco Streefkerk in the same thread:

there is nothing in the 'http' prefix of the http URI that says, "this must be dereferenced using HTTP". In that sense, there is no single 'service' permanently associated with the http URI - rather, there happens to be a current, and very helpful, default de-referencing mechanism available.

At the point that HTTP dies, which it surely will at some point, people will build alternative de-referencing mechanisms (which might be based on Handle, or LDAP, or whatever replaces HTTP). The reason they will have to build something else is that the weight of deployed http URIs will demand it.

That's the reasoning behind my, "the only good long term identifier, is a good short term identifier" mantra.

The mis-understanding that there is a dependence between the http URI and HTTP (the protocol) is endemic in the digital library community and has been the cause of who knows how many wasted person-years, inventing and deploying unnecessary, indeed unhelpful, URI schemes like 'oai', 'doi' and 'info'. Not only does this waste time and effort but it distances the digital library community from the mainstream Web - something that we cannot afford to happen.

As the Architecture of the World Wide Web, Volume One (section 3.1) says:

Many URI schemes define a default interaction protocol for attempting access to the identified resource. That interaction protocol is often the basis for allocating identifiers within that scheme, just as "http" URIs are defined in terms of TCP-based HTTP servers. However, this does not imply that all interaction with such resources is limited to the default interaction protocol.

This has been re-iterated numerous times in discussion, not least by Roy Fielding:

"Really, it has far more to do with a basic misunderstanding of web architecture, namely that you have to use HTTP to get a representation of an "http" named resource. I don't think there is any simple solution to that misbelief aside from just explaining it over and over again."
http://lists.w3.org/Archives/Public/www-tag/2008Jun/0078.html

"However, once named, HTTP is no more inherent in "http" name resolution than access to the U.S. Treasury is to SSN name resolution."
http://lists.w3.org/Archives/Public/www-tag/2008Aug/0012.html

"The "http" resolution mechanism is not, and never has been, dependent on DNS. The authority component can contain anything properly encoded within the defined URI syntax. It is only when an http identifier is dereferenced on a local host that a decision needs to be made regarding which global name resolution protocol should be used to find the IP address of the corresponding authority. It is a configuration choice."
http://lists.w3.org/Archives/Public/www-tag/2008Aug/0044.html

The (draft) URI Schemes and Web Protocols Tag Finding makes similar statements (e.g. section 4.1):

A server MAY offer representations or operations on a resource using any protocol, regardless of URI scheme. For example, a server might choose to respond to HTTP GET requests for an ftp resource. Of course, this is possible only for protocols that allow references to URIs in the given scheme.

I know that some people don't like this interpretation of http URIs, claiming that W3C (and presumably others) have changed their thinking over time. I remember Stu Weibel starting a presentation about URIs at a DCC meeting in Glasgow with the phrase:

Can you spell revisionist?

I disagree with this view. The world evolves. Our thinking evolves. This is a good thing isn't it? It's only not a good thing if we refuse to acknowledge the new because we are too wedded to the old.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8345203ba69e201156f9479bb970c

Listed below are links to weblogs that reference URIs and protocol dependence:

Comments

I am afraid you misunderstood me, Andy.

It should be clear from my comment that I meant "based on URI technology" when I mentioned "technology-dependent", as proven by my following sentence:

"Maybe exactly because of what I mentioned above: doi:10.1038/nature01234 is a technology-neutral variant, whereas both info:doi/10.1038/nature01234 and http://dx.doi.org/10.1038/nature01234 are (URI) technology based?"

I guess you were too keen of reading "based on HTTP" in it all. Sorry to spoil your pleasure. I know my comment was quite lengthy, but maybe next time you should give it a thorough read before you publicly lay words in my mouth.

I have much more to say about your above discourse, but I think I am going to try and enjoy my weekend instead.

Herbert,
hi. You are right, I have indeed misunderstood you and I apologise for that. To set the record straight, I've added an editorial note to the text of the blog post (above) indicating where I went wrong.

I hope that's OK?

As I say in the note, I stand by my wider assertion "that a mis-understanding around the relationship between http URIs and HTTP (the protocol) led the digital library community into a situation where it felt the need to invent alternative approaches to identification of digital conetent and, further, that those alternative approaches are both harmful to the Web and harmful to digital libraries. I think those mis-understandings are still widely held in the digital library community and I disagree with those people who continue to promote relatively new non-http forms of URIs for 'scholarly' content (by which I primarily mean info URIs and doi URIs) as though their use was good practice."

I think there is reasonable justification historically for why that mis-understanding arose, not least an evolving view of URIs by the W3C over the years, but I don't think that it is an ongoing justification for continuing to promote the use of info and doi URIs where http URIs would do the job just as well.

If I'm "too keen of reading 'based on HTTP' in it all", and you are probably right that I am, it's because I feel so frustrated by years of rather wasteful 'digital library' discussion and deployment of URI identifier schemes that effectively fracture the way the Web works.

I don't know if I should feel apologetic about feeling and venting that frustration or not - but I do accept that it is not an excuse for not reading your comment thoroughly enough.

That said, I still feel confused about where you are coming from here...

You argue that what I am calling a doi URI (a URI starting doi:...) is *not* a URI. You say, "It is an identifier, not a URI".

Here we do disagree. I think it is a URI, and I think that on the basis of what the DOI Handbook says. I've given a longer justification for this belief in response to your comment on the original post:

http://efoundations.typepad.com/efoundations/2009/05/the-nature-of-oai-identifiers-and-linked-data.html

But putting that particular disagreement to one side for a moment, you seem to argue that the doi: form of identifier is OK because it sits outside the URI system in a way that allows for "technology-neutral applications". Here I don't understand. You suggest that publishers want to be able to "print it [the identifier] on paper". I think the world is mature enough now to cope with writing http URIs on paper?

In short, I'm struggling to see what added-value a technology-neutral non-URI-based DOI has over a URI-based identifier and I'm struggling to see what added-value a non-http URI has over an http URI (in this case). All of which leads me to conclude that the world would be a better place if DOI was implemented only as an http URI. (And I would make the same argument about the info URI).

Now, people may argue that the horse has bolted, that there are too many deployed doi (and info) URIs to go back, and that we have to live with what we have. I tend to disagree (though clearly I am not the one who is going to have to clean up the mess) primarily because I think that the longer we leave it, the more mess is out there and the more difficult it will become to deal with the consequences.

This, I suppose, was the impetus for my original post - seeing a very influential player like Nature pushing out a large amount of metadata content containing (only) the doi: form of the identifier/URI (which is what they are doing with their use of oai_dc records) effectively polutes (I know that is an emotive word but it's how I feel) the Web and, worse, encourages others to do likewise.

Andy,

I think you are overstating the distance between the http URI scheme
and the http protocol. Quoting from
[url="http://www.ietf.org/rfc/rfc2616.txt"]RFC-2616[/url], (emphasis
mine):

[quote]

3.2.2 http URL

The "http" scheme is used to locate network resources via the HTTP
protocol. This section defines the scheme-specific syntax and
semantics for http URLs.

http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

If the port is empty or not given, port 80 is assumed. [b]The semantics
are that the identified resource is located at the server listening
for TCP connections on that port of that host[/b], and the Request-URI
for the resource is abs_path (section 5.1.2). The use of IP addresses
in URLs SHOULD be avoided whenever possible (see RFC 1900 [24]).

[/quote]

The RFC is pretty clear about the relationship between the scheme
and the protocol. The [url="http://www.w3.org/TR/webarch/#dereference-uri"]Web Arch
document[/url] (section 3.1) reinforces this relationship (again, emphasis
mine):

[quote]

Many URI schemes define a default interaction protocol for attempting
access to the identified resource. That interaction protocol is
often the basis for allocating identifiers within that scheme, [b]just
as "http" URIs are defined in terms of TCP-based HTTP servers.[/b]

[/quote]

The remainder of that section does expand on the RFC's position by
acknowledging that http URIs could be subsumed into a future
protocol/service, much as http has done for older protocols:

[quote]

For example, information retrieval systems often make use of proxies
to interact with a multitude of URI schemes, such as HTTP proxies
being used to access "ftp" and "wais" resources. Proxies can also
to provide enhanced services, such as annotation proxies that combine
normal information retrieval with additional metadata retrieval to
provide a seamless, multidimensional view of resources using the
same protocols and user agents as the non-annotated Web. Likewise,
future protocols may be defined that encompass our current systems,
using entirely different interaction mechanisms, without changing
the existing identifier schemes.

[/quote]

The relationship between the scheme and protocol is assumed in other
TAG documents, for example TAG Finding 21 March 2004,
[url="http://www.w3.org/2001/tag/doc/whenToUseGet.html"]"URIs,
Addressability, and the use of HTTP GET and POST"[/url].

Expanding the context of Roy's emails is also helpful. For example,
quoting from
http://lists.w3.org/Archives/Public/www-tag/2008Aug/0012.html:

[quote]

The "http" URI is an interesting case. [b]It is a naming scheme that
is delegated hierarchically by association with domain ownership,
TCP port, and HTTP listener[/b]. However, once named, HTTP is no more
inherent in "http" name resolution than access to the U.S. Treasury
is to SSN name resolution.

If I want to find an http-named resource, then I tell my software
to get it (or metadata about it) from whatever set of repositories
I may have access. Sometimes those repositories are local disks,
sometimes they are remote proxies, sometimes they are archives
stored on CD-ROM, sometimes they are queries with Google's spider
cache, and sometimes they involve HTTP access over TCP port 80 to
a server that my name lookup daemon has been told to be the owner
of that URI.

[/quote]

The emphasized section admits to the existence of an http server
during the naming phase. The next paragraph discusses scenarios
where http URIs can be things like keys in local caches, databases,
etc. This is the essence of the
http://lists.w3.org/Archives/Public/www-tag/2008Jun/0078.html email
as well.

(N.B. Social Security Numbers are not issued by the U.S. Treasury
as Roy implies in his email; they're issued and managed by the
Social Security Administration, which is an independent agency in
the US Govt. To extend the metaphor, the SSA is the http server
and the Treasury (read: IRS) is a non-http service. The IRS uses
SSNs, but they would not exist without first being issued by the
SSA. See: http://en.wikipedia.org/wiki/Social_security_number)

Finally, reading the full post of
http://lists.w3.org/Archives/Public/www-tag/2008Aug/0044.html ,
the essence is that http URIs do not depend on DNS. Whenever the
next globally deployed IP name -> address resolution mechanism comes
along (read: not any time soon), then the http scheme will work
just fine with that too.

This flavor is present in section 3.2.2 of
[url="http://www.ietf.org/rfc/rfc3986.txt"]RFC 3986[/url] (2nd to
last paragraph, emphasis mine):

[quote]

This specification does not mandate a particular registered name
lookup technology and therefore does not restrict the syntax of reg-
name beyond what is necessary for interoperability. Instead, it
delegates the issue of registered name syntax conformance to the
operating system of each application performing URI resolution, and
that operating system decides what it will allow for the purpose of
host identification. A URI resolution implementation might use DNS,
host tables, yellow pages, NetInfo, WINS, or any other system for
lookup of registered names. [b]However, a globally scoped naming
system, such as DNS fully qualified domain names, is necessary for
URIs intended to have global scope.[/b] URI producers should use names
that conform to the DNS syntax, even when use of DNS is not
immediately apparent, and should limit these names to no more than
255 characters in length.

[/quote]

The issue of http scheme URIs and non-http scheme URIs is not really
about DNS vs. WINS vs. /etc/host vs. some other rival globally
scoped naming system. Instead, it is about whether or not there
are URIs that should not be *required* to have a host name (or
address) as part of the URI.

Using the DOI example, some publishers run their own resolvers. For
example, the ACM would presumably prefer this URL:

http://doi.acm.org/10.1145/1065385.1065484

over:
http://dx.doi.org/10.1145/1065385.1065484

or:

http://hdl.handle.net/10.1145/1065385.1065484

or:

http://tiny.cc/hMRHK

One could stitch these together using owl:sameAs. Or you could use
linkbun.ch. Or you could print "10.1145/1065385.1065484" (say, in
a journal) and then the resolution mechanism becomes:

http://www.google.com/search?q=10.1145%2F1065385.1065484

Which gives better results than:

http://www.google.com/search?q=http%3A%2F%2Fdx.doi.org%2F10.1145%2F1065385.1065484

(although publishing this comment will probably actually
change that).

In summary, while it is true that the http scheme and the http
protocol can be decoupled, it appears that they are coupled at least
at URI creation. And while the http scheme cannot *enforce* http
as the resolution protocol, this is clearly the general case. On
the other hand, if you *want* to decouple the scheme from the
resolution protocol (at least at the time of creation), you can
chose a non-http scheme (e.g., doi, tag, urn:isbn).

I obviously won't argue against the utility of http URIs, but I
don't think the "danger" is http or DNS going away any time soon
(I think they will both be like Fortran and be with us for a l-o-n-g
time). But I do see it as more likely that doi.org will go out of
business and/or forget to renew their domain (which would actually be
much worse, since all those DOIs will now start giving 200 responses
with spam content), or something to that effect. Ok, admitedly
that's unlikely for doi.org, but you never know. Exposing an
additional URI *not* coupled to any particular host seems like a
safe bet for the future.

regards,

Michael

(PS -- the raw html didn't seem to work for me, so I've left the bbcode in so my intended formatting would be clear)

I believe what Herbert means with "There is, however, an elephant in the room: doi:10.1038/nature01234 is not a URI in any registered URI scheme. It is an identifier, not a URI." is that the "doi:" scheme is not registered with IANA:

http://www.iana.org/assignments/uri-schemes.html

See also:

http://en.wikipedia.org/wiki/URI_scheme#Unofficial_but_common_URI_schemes

Michael,
thanks for the two replies which are very helpful. I take your point that "while it is true that the http scheme and the http protocol can be decoupled, it appears that they are coupled at least at URI creation" but I'm not totally clear where it leaves us in terms of the benefits vs. costs of inventing schemes other than http. It strikes me (still) that the costs are too high and the benefits largely unproven.

I also agree with you that the real persistence danger here is "that doi.org will go out of business and/or forget to renew their domain (which would actually be much worse, since all those DOIs will now start giving 200 responses with spam content), or something to that effect".

I don't want to do it here - because it needs a whiteboard and lots of armwaving - but I think there's a really interesting discussion to be had around what really happens when the IDF go out of business. My gut feel is that it is not clear cut whether existing doi: URIs or existing http://dx.doi.org URIs will be worse hit.

You present some conventions for why the doi: URIs will survive better - but I think they are based primarily on the fact that the doi: form doesn't work now... and so therefore won't actually get much worse!

What I'd much prefer to see happening is that the experts we have in the digital library community put their efforts into making what seems to work now (http URIs), work in the long term, rather than into inventing multiple new forms of URI schemes which don't actually work very well now.

For example, one might try to involve long term organisations (like national libraries) in building/running the infrastructure and ensuring ownership for those domain names we really care about in the long term (e.g. as I have previously suggested for purl.org).

For example, I continue to believe that if the 'info' URI initiative had been based on http URIs (e.g. using http://info.net/..., or http://id.info/..., or whatever) with the engagement and commitment of a group of national libraries to ensure their future, they would have significantly more utility right now and no significantly worse potential for failure in the future. Ditto the DOI - though as you and Herbert both point out... the doi URI is in a significantly worse position than the info URI because they haven't even managed to register the URI scheme succesfully. On that basis, it is the worst choice for Nature to surface in their oai_dc metadata records :-(

I'm still waiting for the US Department of the Interior to register their own URI scheme ("ooh look, the 'doi' URI scheme is still available") - an alternative, and equally unlikely(!), disaster scenario for the IDF to consider?

The comments to this entry are closed.

About

Search

Loading
eFoundations is powered by TypePad