[rfc-i] RFC editing tools

Joe Hildebrand (jhildebr) jhildebr at cisco.com
Sun Dec 9 13:10:40 PST 2012


On 12/9/12 8:45 AM, "Ted Lemon" <mellon at fugue.com> wrote:

>>The tooling is pretty trivial to fix all of that up.
>
>Yes, of course, it's a simple matter of programming.   I am not saying
>that what you propose is impossible; merely arguing that it's not the
>best solution.

Understood, but in this case, a) it really was simple code, and b) it's
already been written, so arguing it's hard doesn't make sense.

>> I don't know why you would ever edit section numbering by hand, even in
>> WYSIWYG mode.
>
>It ought to be possible to edit the XML or HTML source in a text editor.
> If section numbering is in the canonical form of the document, that's
>suddenly a whole lot harder.

I typed this:

<div class='section' id='md-ipr'>
  <h3>IPR Statements</h3>

The tooling turned it into this:

<h3><a class='self-ref' href='#md-ipr'>4.4.</a> IPR Statements</h3>

The version I submitted was nicely linked in to the TOC, with a correct
section number.  When I added a section above this, the new number got put
in place of 4.4.

>> I agree that the current form blurs presentation and representation, and
>> I'm open to other HTML representations.  However, this doesn't seem
>>*that*
>> complex a regular expression in practice:
>> 
>> /^(Appendix [A-Z]+\.)?([\d\.]+)?\s+/
>
>Okay, so what's the parsing/validation process?   Let's walk through it:

The code is here:

https://github.com/IETF-Formatters/html-rfc/blob/master/nits/toc.js


It's a prototype, so please excuse the lack of comments.

>1. Validate the XML using W3C schema or similar
>2. Parse the XML into a DOM.

That's pretty straightforward in an HTML processing environment.

>3. Recursively descend the DOM, looking for nodes that require special
>case handling.
>4. For each such node, look for a text sub-node.
>5. Normalize the text of the sub-node (convert all whitespace chunks to
>single spaces, delete leading and trailing whitespace).
>6. If there are multiple valid forms the text could take, determine which
>form the text has taken (e.g., Appendix versus Section)
>7. Based on this determination, validate the text syntactically.
>8. Turn the text node into an internal DOM node that contains the
>semantic information that was formerly represented as text
>9. Add the faked-up DOM node to a table of similar nodes.

Pretty close.  There's some extra bits because Appendices are named
differently from other sections, and some recursive stuff because sections
contain other sections.  All of that is made pretty straightforward by
using jQuery (http://jquery.com/), which allows relatively syntax-sparse
mechanisms for querying and modifying HTML-like docs.

>Now, once we've processed the entire tree, for each set of semantically
>similar textually-parsed nodes, validate the semantics that were parsed
>out of text nodes and hence couldn't be validated by W3C schema, to wit:
>
>- Make sure that section numbers are sequential and that there are no gaps
>- Make sure that appendix numbers are sequential
>- Make sure that no appendixes appear before sections

Sure.

>Compare this to a pure XML doc with no semantics in any text nodes:
>
>1. Validate the XML using W3C schema or similar
>2. Parse the XML into a DOM.

Thinking of XML schema as straightforward to write correctly, use for
document generation, or figure out its error handling is not something I
bet most of our target market is currently familiar with, nor would
learning it be easy.

>Why is the XML doc parsing and validation process so much shorter?   Two
>reasons.   First, xml tags can be validated by W3C schema; div tags with
>special meaning given by class attributes can't.   Second, because it
>doesn't contain any generated information that would need to be
>checked‹there are no section numbers, for instance.   Section numbers
>only appear in presentation docs, not in the canonical representation.

You're missing another reason, that XSD is already widely implemented.

(however, it's also pretty widely-reviled by people who don't have to use
it every day)

>> I really didn't intend to define new HTML tags.  I thought that I had
>>been
>> pretty careful about picking tags that were both standardized and
>> widely-implemented.  Could you please give me an example of what you're
>> talking about so I can fix it?
>
>You've said that you need additional standards docs to define things
>equivalent to the xml2rfc author tag.   Either you are defining new tags,
>or you are defining div tags with special semantics based on class
>attributes.   Again, these can't be validated by a schema.

The latter, as in most microformats in the HTML world.  They could be
validated by a sufficiently-complex XML schema, but I can't imagine anyone
bothering to do that, when writing jQuery in unit-test style would be much
easier.

-- 
Joe Hildebrand





More information about the rfc-interest mailing list