[rfc-i] Can the web be archived?
phill at hallambaker.com
Thu Jan 22 07:43:10 PST 2015
The 'dead links' problem is a consequence of the 'scruffy links' approach
that made the Web possible in the first place.
Remember that in 1992 the Web lacked 80% of what 'everyone' believed was
'essential' in network hypertext. It turned out that search was better as
an add-on. I think archiving is another example of something better
provided as an add on.
We can fix this, everything described in the attached proposal could be
implemented and deployed in a few months. There is a co-dependency problem
in that browser providers and sites need to co-operate a little. But the
synchronization required is really very small.
One way to fix the problem is with a 'Way Back' machine interface in the
browser. Most browsers have some sort of support for 'safe browsing' where
it pings some service to check to see if a link would take the user to a
malware site. We could beef that up.
Rather than having the site author register pages, this is something that
the archive service would enable automatically based on user behavior.
When a dead link is encountered, the browser would say '404 not found, but
here is a copy I took earlier'. There would be some sort of interface to
enable the user to pick the vintage of the link but by default it would be
the one closest in time to the last update time of the page.
This approach solves the problem in the large and requires infrastructure
provider support. The extra work required would be huge unless you were
already a search engine provider in which case it is a small incremental
This approach can be further improved with HTML extensions to allow the
author to distinguish links to static documents from links to dynamic
resources. This does require the publisher to assist the process but this
approach removes the reliance on infrastructure. Everything can be done by
just the publisher and the browser provider.
Consider the following:
<p><a href="http://ietf.org/rfc/rfc20">[RFC 20]</a> was mentioned on
<a href="http://cnn.com/">CNN</a> this morning!</p>
The first is a link to a static resource but the second is not.
A good starting point would be a <meta> tag in the HTML to say 'this is a
final static document'. But that can be problematic unless the meta tag is
inserted automatically by the publishing mechanism itself.
A better one would be to sign the document. That does fix it in time.
Another approach (probably complimentary) is to modify the link tag so that
it specifies both a name and a locator:
Note the use of the ni scheme here which is good stuff. The ni URI could
have an authority added to specify one download point for the document. So
lets say the reference was in a W3C document. They would probably run their
own local service:
Generating 'hard links' to specific versions of the documents via digest is
the sort of thing that can be done automatically by the publishing engine.
One of the defects in XML is that being based on SGML, each attribute can
only occur once. Having only one resource linked to a document never made
much sense to me. If an image is available in JPEG and PNG then it might
well be on two different servers.
This could be fixed by giving each <a> tag a unique ID and then having a
concordance section at the end giving alternative anchors for those links:
<a href="http://ietf.org/rfc/rfc20" id="autogen12">
The concordance would be given right at the end so as to minimize impact on
the page download. So while this is definitely metadata it is not data we
want in the <head>.
I wish folk had been willing to spend a tenth the amount of time spent on
whizz-bangifying HTML as a presentation language to fix the structural part.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the rfc-interest