[rfc-i] "canonical" URI for RFCs, BCPs

Alfred Hönes ah at TR-Sys.de
Mon Feb 1 17:12:00 PST 2010


Folks,
I got the impression that the discussion on this thread suffers from
lack of precision in the usage of terms, and of overloading of terms
-- an always perfect trigger condition for never-ending debates.

Let me try to cut in pieces this almost Gordian Knot.

For the purpose of exposition in this posting, I purposely
distinguish between the more general term 'URI' (with emphasis
on 'I' = Identifier) and the traditional term 'URL', indicating a
precise Locator for a network accessible instance of a particular
resource.

IMO, Julian does not seek for a 'canonical' URI, he seeks for a
*generic* URI for each RFC, BCP, STD, FYI -- 'generic' in the sense
that it can be generated algorithmically, or even guessed, without
looking at the details.
Apparently, Joe essentially wants to talk about "canonical URLs".

As will be elaborated upon below, these two terms are very
different and should be held apart; and so it does not come
to surprise that the crossover form "canonical URI" used by both
leads to confusion and endless debates.

Based on its ethymology and historical use over two millenia,
"canonical" designates a distinguished, authoritative form or
instance to which all derivative forms/copies should compare well.
"canonicalization" is a complicated matter and usually cannot
be implemented in a 1-line algorithm -- cf. the various Unicode
Normal Forms --, and it needs to take into accont many details.

The RFC Editor most likely has valid reasons to indicate on the
.../info/rfcnnnn  pages the "canonical URL" of the corresponding
document as pointing to the "authoritative" version of the document.
Note they do _not_ say "canonical URI", they say "... URL".

These pages have been produced rather quickly during the final
stages of evolution of the draft that led to RFC 5741, and,
as we all know, rapid prototyping is not going to be perfect.
There are various artifacts on the /info/ pages that need to be
found and weeded out now, after the transition -- but give them some
time to catch up with the Queue that still suffers from the RFC 5378
induced backlogs and gets fuelled again heavily these days by a busy
IESG working towards the Spring IETF meeting (a common seasonal
effect that can be observed every year).
I'd rather classify the current state of giving a .txt URL (pointing
to a 'redirection' .txt file) as the "canonical URL" for those few
.PS-only RFCs as such kind of flaw that will be fixed.

OTOH, as John Levine has pointed out, giving the file extension
in an URL is deemed useful for many purposes.  If not, why would
the file extension(s) have been a part of media type registrations
since these have ever begun?
The most important benefit (if managed properly) of a file extension
in the URL is the unambiguous indication of what to expect.
If I'm working on a low-bandwith dial-up line, or if I want to
perform `grep's and `diff`s (etc.) on the picked up document, or
excerpt IANA registrations not properly escrowed and filed at the
IANA site, and so on, I really do not want to get a .PDF scan of a
dusty, yellowed paper document of megabyte++ size, or a .PS file
(some of those are hardcoded to unexpected large paper format and do
not print properly).  However, if I'm looking for weird details in a
typed-in early RFC that raise suspicion of severe clerical error, I
will need the .PDF format scan of the paper document for comparison
and to identify the flaws in the .txt file.

These details of the resource representation are of much less concern
for a "generic URI" leading to metadata of the documents proper.
IIRC, even DOI and scientific citation servers (quoted on this thread
a couple of times) give "typed" URLs for the various presentation
formats of the documents proper they refer to, as far as available,
with the media type shown by the file extension.

IMO, the  .../info/rfcnnnn  pages properly support the desires
of various users, in giving the "canonical URL" to the authoritative
representation of the documents and additional links to alternative
forms of the documents, each clearly indicating the media type to be
expected by clicking on the corresponding link -- of course, assuming
that in the long term, any remaining errors will be fixed.

Other SDOs serve archived documents in Postscript, PDF, M$ Word,
Word Perfect or TXT format, and/or related tables or machine-
readable excerpts or annexes (e.g., MIBs / PIBs, other ASN.1,
some XML, etc.) and/or collections of documents assembled into
ZIP archives or gzip-compressed tarballs, and they do that regularly
with 'expressive' URLs indicating the media type via the registered
file extension, and they make use of "guessable", neutral, generic
URIs for navigatable access to information about particular documents,
which eventually hold these specific URLs -- cf. the ITU-T, IEEE 802,
ETSI, and many other freely available standards sites.
I assume that such sites, and the RFC Editor site as well, are capable
of making correct use of file extensions, and I do not see any reason
why the RFC Editor site should not use proper file extensions in URLs
as well.


Thus, summing up:

I concluse that Julian looks for a "generic URI", that could be
generated easily algorithmically and/or can be guessed easily.
To avoid confusion, I recommend not to use the adjective "canonical"
(i.e. "in unique, authoritative form") in this context.  I agree
that these generic URIs for RFCs (etc.) should not carry a file
extension, in order to emphasize the generic nature of the resource
(document metadata) to which such URI is expected to resolve.

However, I concur with the RFC Editor that the term "canonical URL"
is most properly used for an URI that can be directly resolved to
the authoritative presentation of the document.  If the document
is invariate (as e.g., RFCs) that URL should not change and always
resolve to the same authoritative presentation, as used at the time
of creation of the document.  If the resource is expected to vary
over time (e.g., BCPs or STDs, which might be replaced or amended
over time), longevity of the pointed-to resource should not be
expected, and if sometime the IETF decided to publish the current
STD and/or BCPs in another format than .txt, these resources and
the related URLs could be expected to change accordingly.  For the
canonical URLs and the URLs used for alternative representations
of archival documents, IMO use of the extensions registered with
IANA for the respective media types is very useful and appreciated.

I would appreciate if the RFC access and search pages on the
RFC Editor web site were amended to point to the "generic URI"
for the documents in order to provide the most easy access to
all relevant metadata, including RFC Errata (which apparently
are still getting missed by developers); optionally, the
"canonical URL" might be accessible there as well.
The traditional  .../cgi-bin/rfcdoctype.pl?... style URLs there
that eventually lead to a redirection to the "generic URL" of the
RFC seem less useful and most likely to cause some confusion, now
that the  /info/rfcnnnn  pages are available.
Addtionally, having  /info/stdxxx , /info/bcpnnn , and  info/fyimmm
pages would be a welcome addition.


Kind regards,
  Alfred Hönes.

-- 

+------------------------+--------------------------------------------+
| TR-Sys Alfred Hoenes   |  Alfred Hoenes   Dipl.-Math., Dipl.-Phys.  |
| Gerlinger Strasse 12   |  Phone: (+49)7156/9635-0, Fax: -18         |
| D-71254  Ditzingen     |  E-Mail:  ah at TR-Sys.de                     |
+------------------------+--------------------------------------------+



More information about the rfc-interest mailing list