[rfc-i] issue: canonical formats

John Levine johnl at taugh.com
Thu May 31 19:36:05 PDT 2012

As I read the wiki page, I see notes about a canonical source version
and a canonical display version.  I would like to suggest that there
be only one canonical version, and whatever it is, it should be a
source versiont has structure and metadata at roughly the level that
xml2rfc does.

That is, I would like a canonical version that makes it possible to
mechanically (i.e., using algorithms, not heuristics) identify that
this part is the abstract, that part is a paragraph of text, and this
other part is the second author's postal code.  It doesn't have to be
xml2rfc, a constrained HTML or XHTML subset could do the job.

The problem with output formats is that they are simultaneously
overconstrained and undertagged.  Something like PDF/A prints nicely,
but it is full of stuff like fonts and line and page breaks that
aren't relevant to the semantics of the document, while missing the
metadata about the abstract and the postcode.

So I'd rather that a form with metadata but that doesn't attempt to do
layout be canonical, and any other derived format is correct to the
extent that it correctly represents the contents of the canonical
version.  (This is not all that unsual.  Try figuring out which of the
umpteen translations of a European Union law or regulation is the
canonical one.)

For long term stability, I'd also waht the canonical format to be well
specified, and possible for a reasonably motivated person to interpret
without complex tools.  So XML or HTML, which you can look at in any
text editor and visually identify the text and the markup, would be
better than, say, Postscript, which you can look at in the editor, but
typically can't decode the text without running a lot of code in your
head, or PDF or Word which needs a hex editor if you don't happen to
have a rendering engine handy.


More information about the rfc-interest mailing list