[rfc-i] issue: canonical formats

Paul Hoffman paul.hoffman at vpnc.org
Fri Jun 1 08:42:38 PDT 2012

On May 31, 2012, at 7:36 PM, John Levine wrote:

> As I read the wiki page, I see notes about a canonical source version
> and a canonical display version.  I would like to suggest that there
> be only one canonical version, and whatever it is, it should be a
> source versiont has structure and metadata at roughly the level that
> xml2rfc does.
> That is, I would like a canonical version that makes it possible to
> mechanically (i.e., using algorithms, not heuristics) identify that
> this part is the abstract, that part is a paragraph of text, and this
> other part is the second author's postal code.  It doesn't have to be
> xml2rfc, a constrained HTML or XHTML subset could do the job.
> The problem with output formats is that they are simultaneously
> overconstrained and undertagged.  Something like PDF/A prints nicely,
> but it is full of stuff like fonts and line and page breaks that
> aren't relevant to the semantics of the document, while missing the
> metadata about the abstract and the postcode.
> So I'd rather that a form with metadata but that doesn't attempt to do
> layout be canonical, and any other derived format is correct to the
> extent that it correctly represents the contents of the canonical
> version.  (This is not all that unsual.  Try figuring out which of the
> umpteen translations of a European Union law or regulation is the
> canonical one.)
> For long term stability, I'd also waht the canonical format to be well
> specified, and possible for a reasonably motivated person to interpret
> without complex tools.  So XML or HTML, which you can look at in any
> text editor and visually identify the text and the markup, would be
> better than, say, Postscript, which you can look at in the editor, but
> typically can't decode the text without running a lot of code in your
> head, or PDF or Word which needs a hex editor if you don't happen to
> have a rendering engine handy.

+1 to all that. Although my current proposal has the canonical format as displayable, I think this proposal for the canonical format is better.

--Paul Hoffman

More information about the rfc-interest mailing list