[rfc-i] How "modern" word processors do it

Brian E Carpenter brian.e.carpenter at gmail.com
Sat May 26 01:55:49 PDT 2012


Does OOXML support containment? Not everywhere; as far as I can tell the
fragment below has the semantics of <h1>Introduction</h1> rather than
<section title="Introduction">. This explains why MS Word is seemingly
unaware of overall document structure. Nevertheless, OOXML does seem to require
localised containment of these <w> things.

Unsurprisingly, ODF format is cleaner and simpler but is also <h1>-like:
<text:h text:style-name="Heading_20_1" text:outline-level="1">Introduction</text:h>

My point is anyway that "WYSIWYG" always hides markup and what really matters
is the richness or poverty of the underlying markup. I think we are going
at this all wrong, and should start by deciding what properties the underlying
markup MUST preserve. I am certain the list is significantly larger than Joe's
and I would start with XML2RFC's capabilities and work from there.

- <w:p w:rsidR="001F585B" w:rsidRPr="00B3566E" w:rsidRDefault="009F11DF" w:rsidP="00B3566E">
- <w:pPr>
  <w:pStyle w:val="Heading1" />
  </w:pPr>
- <w:r>
  <w:t xml:space="preserve"></w:t>
  </w:r>
- <w:r w:rsidR="001F585B" w:rsidRPr="00B3566E">
  <w:t>Intro</w:t>
  </w:r>
- <w:r w:rsidR="00373B98" w:rsidRPr="00B3566E">
  <w:t>duction</w:t>
  </w:r>
  </w:p>


Regards
   Brian


On 2012-05-26 08:00, Joe Hildebrand wrote:
> On 5/26/12 12:40 AM, "Joe Touch" <touch at isi.edu> wrote:
> 
>> Here's the counterexample:
>>
>> heading
>> para
>> para
>> para
>> list item
>> list item
>> list item
>> list item
>>
>> Is that one list of four items? Is it two lists of two items each? Where is
>> the list container? Does the list belong to the paragraph that precedes it, or
>> as a separate container belonging to the heading level?
> 
> I was only talking about sections.  There's no good way in HTML to do list
> items without an ol or ul around them, and I don't believe that Word is
> generating lists without wrappers.  In a text-only format, this is exactly
> the sort of ambiguity that the doc doesn't have enough structure to answer.
> In the face of a lack of data, I'd say that it's one list with four items.
> If it doesn't matter to the author enough to use a tool that preserves his
> or her intent, then nobody else is likely to care about the difference
> downstream.
> 
>>>> E.g., Word doesn't use that structure.
>>> You post-process the output of Word anyway.  Whoever writes the
>>> post-processing tool is going to have to write a few lines of code.
>> Some of it is easy - as you note, I can generate tags that contain sections
>> within the headings that delimit them.
>>
>> I cannot generate section containers for groupings that cannot be indicated by
>> Word - as per the list above.
> 
> If Word is generating <li> without a <ul> or <ol> around it, there's a bug.
> 
>> Further, why group all the paragraphs under one heading? At least one output
>> from Word treats them as one long paragraph with BRs in between, rather than
>> as individual paragraphs.
> 
> English text contains paragraphs.  In RFC's, we often group multiple
> paragraphs together into a section; the lineprinter format uses a blank line
> to delineate a paragraph boundary.
> 
> Word knows how to deal with paragraphs.  It's inserting br's in order to
> gain control over line splitting, which is one of the things we're trying to
> solve for in the "reflowing" discussion.
> 
>> That isn't the same structure most people use when writing XML2RFC, but it is
>> just as valid.
> 
> If you know the size of the screen of the device that the reader is going to
> use, and there's only one such size, perhaps.  We don't live in that world
> anymore.
> 
>>>> I could add it for heading sections
>>>> (it's easy to generate from nested navigation tags), but cannot generate it
>>>> for lists or other sections - there's no way to differentiate between a set
>>>> of
>>>> paragraphs that are not related and ones that are, so there's no way to
>>>> group
>>>> them in a container.
>>> I assume the sections are separated by a header, which has a depth
>>> associated with it?  Everything between headers is in the same section.
>> But not necessarily the same container.
> 
> You can intuit a container (and add it if need-be) if the sections are
> separated.  I'd walk through the logic for it, but you haven't been
> interested in algorithms to this point.  Perhaps you could either care, or
> take my word for it?
> 
>> You've only given the same reason repeatedly - editing. Support for editing
>> was not given for any formats except authoring, which we all seem to agree
>> ought to be up to authors.
> 
> No, I've given two.  Programmatic editing is one, and information extraction
> is the other.  The extraction function has nothing to do with editing, since
> it does not modify the file.
> 



More information about the rfc-interest mailing list