[rfc-i] RFC editing tools

Ted Lemon mellon at fugue.com
Fri Dec 7 12:09:24 PST 2012


On Dec 7, 2012, at 2:52 PM, Paul Hoffman <paul.hoffman at vpnc.org> wrote:
>> But then there's the further complexity that once the HTML has been successfully transformed into a tree structure, the parser has to groom the tree structure, examining each element to see if it requires special parsing of the enclosed text and then, if so, doing that special parsing, for which there is no grammar—it's just free-form text.   Once this is done, we now have a tree structure containing the document which can be processed and spat out in a different form.
> 
> ...and this is identical to XML, yes?

No, I seem to be failing to communicate.   Joe's proposed format has metadata that's not part of the grammar.   E.g., the subject headings, and the inter-section links.   So with XML you parse the XML and you have a tree.   With HTML, you parse the HTML into a tree, and then you parse the text nodes in the tree for more metadata.   This is hard to get right, and easy to get wrong.

Yes, there are complexities in the xml2rfc format.   I am not fond of the xml2rfc format, and would not propose that it be the canonical format.   But some format that can be parsed into a tree by a machine is the right format.

I think having that format look as much as possible like HTML is a win, but I would like it to be regular like XML, not special like HTML.   Yes, there are several HTML parsers that work reasonably well, but part of the process of working reasonably well is that they are overly permissive, and this has caused serious and well-documented problems in the past.

The problems with xml2rfc stem from two factors.   The first is that the tags names are gratuitously different than HTML, something that a number of folks here have agreed is a problem worth fixing.   The second is that it depends on DTD validation, which is a very restrictive validator that has produced an overly stilted syntax with a lot of usability problems.   I don't think that the format we use should necessarily be validated by a DTD—it would be better to use a W3C schema validator or RELAX NG.

I'm not specifically advocating XML—I'd be perfectly happy with LISP sexprs.   But I suspect XML is a more practical choice.   I don't think HTML is a practical choice.



More information about the rfc-interest mailing list