« COI guidance on use of RDFa | Main | An increasingly common Twitter/OAuth scenario... »

November 23, 2009

Memento and negotiating on time

Via Twitter, initially in a post by Lorcan Dempsey, I came across the work of Herbert Van de Sompel and his comrades from LANL and Old Dominion University on the Memento project:

The project has since been the topic of an article in New Scientist.

The technical details of the Memento approach are probably best summarised in the paper "Memento: Time Travel for the Web", and Herbert has recently made available a presentation which I'll embed here, since it includes some helpful graphics illustrating some of the messaging in detail:

Memento seeks to take advantage of the Web Architecture concept that interactions on the Web are concerned with exchanging representations of resources. And for any single resource, representations may vary - at a single point in time, variant representations may be provided, e.g. in different formats or languages, and over time, variant representations may be provided reflecting changes in the state of the resource. The HTTP protocol incorporates a feature called content negotiation which can be used to determine the most appropriate representation of a resource - typically according to variables such as content type, language, character set or encoding. The innovation that Memento brings to this scenario is the proposition that content negotiation may also be applied to the axis of date-time. i.e. in the same way that a client might express a preference for the language of the representation based on a standard request header, it could also express a preference that the representation should reflect resource state at a specified point in time, using a custom accept header (X-Accept-Datetime).

More specifically, Memento uses a flavour of content negotiation called "transparent content negotiation" where the server provides details of the variant representations available, from which the client can choose. Slides 26-50 in Herbert's presentation above illustrate how this technique might be applied to two different cases: one in which the server to which the initial request is sent is itself capable of providing the set of time-variant representations, and a second in which that server does not have those "archive" capabilities but redirects to (a URI supported by) a second server which does.

This does seem quite an ingenious approach to the problem, and one that potentially has many interesting applications, several of which Herbert alludes to in his presentation.

What I want to focus on here is the technical approach, which did raise a question in my mind. And here I must emphasise that I'm really just trying to articulate a question that I've been trying to formulate and answer for myself: I'm not in a position to say that Memento is getting anything "wrong", just trying to compare the Memento proposition with my understanding of Web architecture and the HTTP protocol, or at least the use of that protocol in accordance with the REST architectural style, and understand whether there are any divergences (and if there are, what the implications are).

In his dissertation in which he defines the REST architectural style, Roy Fielding defines a resource as follows:

More precisely, a resource R is a temporally varying membership function MR(t), which for time t maps to a set of entities, or values, which are equivalent. The values in the set may be resource representations and/or resource identifiers. A resource can map to the empty set, which allows references to be made to a concept before any realization of that concept exists -- a notion that was foreign to most hypertext systems prior to the Web. Some resources are static in the sense that, when examined at any time after their creation, they always correspond to the same value set. Others have a high degree of variance in their value over time. The only thing that is required to be static for a resource is the semantics of the mapping, since the semantics is what distinguishes one resource from another.

On representations, Fielding says the following, which I think is worth quoting in full. The emphasis in the first and last sentences is mine.

REST components perform actions on a resource by using a representation to capture the current or intended state of that resource and transferring that representation between components. A representation is a sequence of bytes, plus representation metadata to describe those bytes. Other commonly used but less precise names for a representation include: document, file, and HTTP message entity, instance, or variant.

A representation consists of data, metadata describing the data, and, on occasion, metadata to describe the metadata (usually for the purpose of verifying message integrity). Metadata is in the form of name-value pairs, where the name corresponds to a standard that defines the value's structure and semantics. Response messages may include both representation metadata and resource metadata: information about the resource that is not specific to the supplied representation.

Control data defines the purpose of a message between components, such as the action being requested or the meaning of a response. It is also used to parameterize requests and override the default behavior of some connecting elements. For example, cache behavior can be modified by control data included in the request or response message.

Depending on the message control data, a given representation may indicate the current state of the requested resource, the desired state for the requested resource, or the value of some other resource, such as a representation of the input data within a client's query form, or a representation of some error condition for a response. For example, remote authoring of a resource requires that the author send a representation to the server, thus establishing a value for that resource that can be retrieved by later requests. If the value set of a resource at a given time consists of multiple representations, content negotiation may be used to select the best representation for inclusion in a given message.

So at a point in time t1, the "temporally varying membership function" maps to one set of values, and - in the case of a resource whose representations vary over time - at another point in time t2, it may map to another, different set of values. To take a concrete example, suppose at the start of 2009, I launch a "quote of the day", and I define a single resource that is my "quote of the day", to which I assign the URI http://example.org/qotd/. And I provide variant representations in XHTML and plain text. On 1 January 2009 (time t1), my quote is "From each according to his abilities, to each according to his needs", and I provide variant representations in those two formats, i.e. the set of values for 1 January 2009 is those two documents. On 2 January 2009 (time t2), my quote is "Those who do not move, do not notice their chains", and again I provide variant representations in those two formats, i.e. the set of values for 2 January 2009 (time t2) is two XHTML and plain text documents with different content from those provided at time t1.

So, moving on to that second piece of text I cited, my interpretation of the final sentence as it applies to HTTP (and, as I say, I could be wrong about this) would be that the RESTful use of the HTTP GET method is intended to retrieve a representation of the current state of the resource. It is the value set at that point in time which provides the basis for negotiation. So, in my example here, on 1 January 2009, I offer XHTML and plain text versions of my "From each according to his abilities..." quote via content negotiation, and on 2 January 2009, I offer XHTML and plain text versions of my "Those who do not move..." quotations. i.e. At two different points in time t1 and t2, different (sets of) representations may be provided for a single resource, reflecting the different state of that resource at those two different points in time, but at either of those points in time, the expectation is that each representation of the set available represents the state of the resource at that point in time, and only members of that set are available via content negotiation. So although representations may vary by language, content-type etc, they should be in some sense "equivalent" (Roy Fielding's term) in terms of their representation of the current state of the resource.

I think the Memento approach suggests that on 2 January 2009, I could, using the date-time-based negotiation convention, offer all four of those variants listed above (and on each day into the future, a set which increases in membership as I add new quotes). But it seems to me that is at odds with the REST style, because the Memento approach requires that representations of different states of the resource (i.e. the state of the resource at different points in time) are all made available as representations at a single point in time.

I appreciate that (even if my interpretation is correct, which it may not be) the constraints specified by the REST architectural style are just that: a set of constraints which, if observed, generate certain properties/characteristics in a system. And if some of those constraints are relaxed or ignored, then those properties change. My understanding is not good enough to pinpoint exactly what the implications of this particular point of divergence (if indeed it is one!) would be - though as Herbert notes in hs presentation, it would appear that there would be implications for cacheing.

But as I said, I'm really just trying to raise the questions which have been running around my head and which I haven't really been able to answer to my own satisfaction.

As an aside, I think Memento could probably achieve quite similar results by providing some metadata (or a link to another document providing that metadata) which expressed the relationships between the time-variant resource and all the time-specific variant resources, rather than seeking to manage this via HTTP content negotiation.

Postscript: I notice that, in the time it has taken me to draft this post, Mark Baker has made what I think is a similar point in a couple of messages (first, second) to the W3C public-lod mailing list.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8345203ba69e201287567054a970c

Listed below are links to weblogs that reference Memento and negotiating on time:

Comments


(this note is also available at: http://www.cs.odu.edu/~mln/memento/response-2009-11-24.html in case some of the formatting gets munged)


This note is in response to Pete Johnston's blog post about Memento: http://efoundations.typepad.com/efoundations/2009/11/memento-and-negotiating-on-time.html

1. Enumeration of URIs of Archived Resources

First, we'll address your suggestion that a "Link" header could be used to achieve what we propose doing by using datetime content negotiation, and in doing so we ignore for the moment that the Link header seems to be stuck in a perpetual draft status. The difficulty of listing the URIs of archived versions and their associated metadata in a document that is pointed at by a URI in a Link response header is that is based on an implicit assumption that the number of options (i.e. archived resources) is small. The same assumption underlies the Alternates header in RFC 2295, which expects there to be no more than 10 or so options. Indeed, there just aren't that many media types, languages, encodings and charsets, and hence the associated 4D matrix of options for a resource is generally very sparse. Obviously, this is not true for the time dimension. Looking forward, there will be a countable infinite number of options. Looking backward, there are a finite number of options, the boundary of which =~ 20 * 365 * 24 * 60 * 60 (assuming the first web server was running in 1989, and RFC 1123 date format only supports second granularity).

Obviously, most resources will not have that many archived resources associated with them. But, as Wikipedia version pages show, they might easily have several thousands. While it would be possible to dereference "/qotd/" (possibly with a HEAD) and parse such an extensive list that is pointed at by a URI in the "Link" response header in search for an archived resource that meets the client's datetime preference, this could turn out to be rather challenging. Also, even though it might be possible to do so, it would certainly not be an efficient (or arguably even elegant) approach. Rather, if the client knows it wants a 2005-12-13 version of "/qotd/", there is no reason to not make that desire known at request time. This is what happens in the existing 4 dimensions of CN, using Accept-* headers. We note that we do propose the use of the Link header, as a means to try and honor the mandate expressed in RFC 2295 that an Alternates header must list all available options. Given the amount of options, this is not possible for the datetime dimension. Hence, our approach to resolving this problem is to list in the "Alternates" response header only a "few" archived resources that are temporal neighbors to the datetime preference expressed by the client (allowing the server some leeway in how many are a "few"), and to provide a "Link" header with a URI of a TimeMap (an ORE ReM listing all available archived resources). Hence, the list is available via the "Link" header. But for reasons explained above, we do not think the "Link" header is the most appealing approach to allow a client to figure which resource meets its preference.

2. The resource that delivers the archived representation.

Just to make sure there is no misunderstanding about this (as there has been on another forum) we want to state that in our proposed datetime content negotiation scheme, the archived representation will never be delivered by the original resource. Rather it is requested via the original resource but delivered by another resource, i.e. an archived resource that has its own URI. See http://www.mementoweb.org/guide/http/local/ and http://www.mementoweb.org/guide/http/remote/ .

3. The "current" state issue

It has been suggested here, and in another forum that it might be inappropriate to request an archived version of a resource via the original resource doing so would be in conflict with the definitive documents (W3C Web Architecture, RFC 2616) and with Roy Fielding's Dissertation (REST).

On the Linked Data list, Mark Baker formulates the problem as follows:

Quote: My claim is simply that all HTTP requests, no matter the headers, are requests upon the current state of the resource identified by the Request-URI, and therefore, a request for a representation of the state of "Resource X at time T" needs to be directed at the URI for "Resource X at time T", not "Resource X".

We can understand this concern, and will try and show that the definitive documents are quite less firm regarding the "current state" issue as Mark Baker is in the above. But before doing so, we would like to point out that it seems rather logical (and even essential) to us to involve the original resource in the attempt to get to its prior versions. After all, it is the URI of the original resource by which the resource has been known as it evolved over time. Hence, it makes sense to be able to use that URI to try and get to its past versions. And by "get", we don't mean search for it, but rather use the network to get there. After all, we all go by the same name irrespective of the day you talk to us. Or we have the same Linked Data URI irrespective of the day it is dereferenced. Why would we suddenly need a new URI when we want to see what the Linked Data description for any of us was, say, a year ago? Why must we prevent that this same URI helps us to get to prior versions?

But back to the authoritative documents. It is our impression that neither RFC 2616 or the W3C Web Arch document really define or enforce the notion of *current* state when it comes to the representation that is returned in response to a GET on a resource.

3.1 W3C Web Architecture

The W3C Web Arch document is agnostic about "current" state. See bullets 3 and 4 from http://www.w3.org/TR/webarch/#dereference-details

Quote:

Precisely which representation(s) are retrieved depends on a number of factors, including:
1. Whether the URI owner makes available any representations at
all;
2. Whether the agent making the request has access privileges for
those representations (see the section on linking and access
control (35.2));
3. If the URI owner has provided more than one representation (in
different formats such as HTML, PNG, or RDF; in different
languages such as English and Spanish; or transformed
dynamically according to the hardware or software capabilities
of the recipient), the resulting representation may depend on
negotiation between the user agent and server.
4. The time of the request; the world changes over time, so
representations of resources are also likely to change over
time.

Assuming that a representation has been successfully retrieved, the
expressive power of the representation's format will affect how
precisely the representation provider communicates resource state.


3.2 RFC 2616

RFC 2616 is pretty open ended about choosing a representation for a resource (emphasis ours): ...

resource
A network data object or service that can be identified by a URI,
as defined in section 3.2. Resources may be available in multiple
representations (e.g. multiple languages, data formats, size, and
resolutions) or **vary in other ways.**

content negotiation
The mechanism for selecting the **appropriate representation**
when servicing a request, as described in section 12. The
representation of entities in any response can be negotiated
(including error responses).
...

12 Content Negotiation

Most HTTP responses include an entity which contains information for
interpretation by a human user. Naturally, it is desirable to supply
the user with the "best available" entity corresponding to the
request. Unfortunately for servers and caches, not all users have the
same preferences for what is "best," and not all user agents are
equally capable of rendering all entity types. For that reason, HTTP
has provisions for several mechanisms for "content negotiation" --
** the process of selecting the best representation for a given response
when there are multiple representations available.**

...

However, an origin server is not limited to these dimensions and
MAY vary the response based on any aspect of the request,
including information outside the request-header fields or
**within extension header fields not defined by this specification.**


3.3 Fielding's Dissertation

Returning to Fielding's dissertation, it admittedly depends on how you read it, but we think it at the very least it does not preclude Memento. Re-quoting some of the relevant bits:

... a resource R is a temporally varying membership function M_R(t), which for time t maps to a set of entities, or values, which are equivalent. The values in the set may be resource representations and/or resource identifiers.

and:

If the value set of a resource at a given time consists of multiple representations, content negotiation may be used to select the best representation for inclusion in a given message.

Returning to the "/qotd/" resource: If you view "/qotd/" *as* the string "From each..." @ t1, and then *as* the string "Those who..." @ t2, then you're not going to like the proposed Memento approach. If you view "/qotd/" more abstractly -- say as "pithy quotations from left-wing German philosophers" -- then you'll have no problem with "/qotd/" negotiating to different strings @ t1, t2, etc. So *if* you subscribe to this notion of abstractness of the resource, then negotiating in the established 4 dimensions as well as in 5th, time, dimension should be acceptable.

We feel that the abstract perspective of a resource is supported rather strongly by Fielding's perspective, when he states:

The key abstraction of information in REST is a resource. Any information that can be named can be a resource: a document or image, a temporal service (e.g. "today's weather in Los Angeles"), a collection of other resources, a non-virtual object (e.g. a person), and so on. In other words, any concept that might be the target of an author's hypertext reference must fit within the definition of a resource. A resource is a conceptual mapping to a set of entities, not the entity that corresponds to the mapping at any particular point in time.

And later on:

This abstract definition of a resource enables key features of the Web architecture. First, it provides generality by encompassing many sources of information without artificially distinguishing them by type or implementation. Second, it allows late binding of the reference to a representation, enabling content negotiation to take place based on characteristics of the request. Finally, it allows an author to reference the concept rather than some singular representation of that concept, thus removing the need to change all existing links whenever the representation changes (assuming the author used the right identifier).

So, assuming you buy into the 5 content negotiation dimensions, "/qotd/" will negotiate to these representations (each of which is a stand-alone resource in its own right):

/qotd/index.html.de.20090101
/qotd/index.html.de.20090102
/qotd/index.html.en.20090101
/qotd/index.html.en.20090102
/qotd/index.pdf.de.20090101
/qotd/index.pdf.de.20090102
/qotd/index.pdf.en.20090101
/qotd/index.pdf.en.20090102


(Note the above notions of time (i.e., in the "X-Accept-Datetime" dimension) are different from value that might exist as expressed in "Last-Modified". For example, I could fix a misspelling in a .20090101 version today, thus changing its modification time w/o changing its "X-Accept-Datetime" value.)

Which one to return when dereferencing "/qotd/"? Most people's browsers probably make this explicit, but even if they didn't you'd probably get the .html.en version as the default. We'll go furher and say for datetime you should get the most current version, .html.en.20091124 for example, as the default value when no datetime preference is specified. That is, in the absence of Memento headers, everything should operate as it normally does. In the future this should be formalized in a variant selection algorithm, such as those at:

http://www.ietf.org/rfc/rfc2296.txt
http://httpd.apache.org/docs/2.3/content-negotiation.html

On the other hand, if you intended to get another representation of that resource (e.g., .pdf.de.20090531), then you need to be explicit about your preferences in the request headers.

We believe the above section also addresses Mark Baker's concerns about what "Accept-" headers negotiate to:

http://lists.w3.org/Archives/Public/public-lod/2009Nov/0138.html
http://lists.w3.org/Archives/Public/public-lod/2009Nov/0140.html

To reiterate, in transparent content negotiation (i.e., RFC 2295), "Accept-" headers always end up negotiating from one URI to another URI. This is the purpose of the "Alternates" header: to enumerate and make transparent these URIs. While it is possible to implement content negoaitation non-transparently (with anonymous representations, e.g. a cgi script that chooses a representation depending on Accept- headers), this is 1) not considered good practice (it is dangerously close to "cloaking"!), and 2) it limits content negoaitation to representations available on the same server (i.e., you can't negotiate from URI1 on server1 to URI2 on server2); this is not a limitation with content negoaitation done with 302s.

We note that the first quote from Fielding's thesis is honored (i.e. not violated) by each of the resources that actually deliver representations in this scheme. And the second quote applies to the resource that is being negotiated on. In cases where that negotiable resource does not deliver any representations itself (the case of TimeGates for servers without internal archival capabilities), the first quote of Fielding applies in the sense that the set of values is (and will always) be empty as there are no representations. In cases where that negotiable resource does deliver a representation (the case of the TimeGate that coincides with the original resource for servers with internal archival capabilities), the representation that is served is always that of the current state, and hence the first quote of Fielding is honored by a set that contains this representation.

It helps to think of it like this: instead of "/qotd/" negotiating to "secondary" URIs such as:

/qotd/index.html.en.20090102
/qotd/index.pdf.de.20090101


it is better to think of the above URIs as primary, and "/qotd/" as the URI that is introduced to glue those more explicit URIs together. In a sense, "/qotd/" doesn't exist with its own representation, it just negotiates to another URI that does have a representation (although this could depend on how "/qotd/" is implemented).

In this sense, "/qotd/" perhaps has more in common with Linked Data "non-information resources" than with conventional document resources. If viewed on a continuum:

* resource
* transparently negotiable resource *
* non-information resource

Although perhaps this is a discussion for a different time...

* also called "content-type-generic" in http://www.w3.org/Provider/Style/URI

4. Summary

So ultimately, Memento will hinge on your comfort level with the notion of time as negotiable dimension and the nature of resources. Of course, documents like the W3C Web Arch and RFC 2616 don't explicitly address this (otherwise we would not have had to introduce it), but we don't think they explicitly deny it either. As such, we think Datetime works nicely as a fifth dimension for Content Negotiation.

Michael, Herbert & Rob.

The comments to this entry are closed.

About

Search

Loading
eFoundations is powered by TypePad