Thoughts on Non-Canonical Formats

This page is for keeping thoughts about the expected output formats *other than* XML.

The formats discussed so far are:

Well-structured HTML
Text (Unpaginated text and Paginated text)
PDF
EPUB

Well-structured HTML

Initial proposal: A strong design goal is that the conversion from canonical XML to HTML should be round-trippable, that is, that it should be possible to convert the HTML back to XML with literally zero loss of semantic content. Conversion from and to the canonical XML might be done with XSLT. Response:

Round-tripping would require preserving non-semantic information.
For instance, it'll be hard not no loose information from <author> elements, because the various aspects get rendered into different places. You'd need either heuristics or additional markup to link these places together in order to re-combine the information.
If roundtripping is not a goal, we need to make clear what kind of information we want to be represented in the HTML. “All semantic content” is too imprecise.
Semantic information *will* be lost during the transformation. The balancing act is making certain that enough semantic information is kept for making the HTML output useful for html-processing tools. The tough part is how to express that as a requirement.

 For the example of counter="requirement", there are ways that the
 information could be propagated, such as into a class name of
 list_counter_requirement, but that's kind of ugly and subject to issues
 when the namespace characters are different. (What do you do with a
 counter name that contains spaces or non-alphameric characters?) With a
 requirement of "all", each of these edge cases would need to be nailed
 down. But is it the type of semantic information that *needs* to be
 propagated? I really don't think so.

Initial proposal: Consider allowing (eventually) javascript Response: No

This would negatively impact people feeling safe when opening RFCs
Would make it more difficult to ensure RFCs look the same in all environments
We wouldn't be able to agree on what Javascript to include.
It's completely unnecessary for a simple text document.
Think of the testing involved. Think of all the contexts in which an RFC might be consumed. If the Javascript is required to render the RFC, it will inevitably fail in some cases.

Text

ASCII vs. UTF-8 for Text Output

As of 2013-10-09, it is not clear whether or not the text output will be ASCII or UTF-8. The following assumes ASCII. If the format is UTF-8, then the following is wrong.

The text-only format must have the same character-set limitations as the current RFC format. For new RFCs that have non-ASCII characters in them, each such character must be represented as [*U+xxxx*], where xxxx is a 4- or 6- character hex value. The use case here is that it must be possible to convert all of the encoded versions of the non-ASCII characters in the text-only document exactly to the correct characters in the canonical document. The choice of [*U+xxxx*] was made because it is extremely unlikely for that sequence to be part of a normal RFC, even one that talks about Unicode code points by their hex values. For example, an author's name that is represented in the canonical format as “Martin Dürst” would be represented in the text-only format as “Martin D[*U+00FC*]rst”. This requires that lines in the text-only format be longer than 80 columns if those lines contain non-ASCII characters.

Dave thinks: disagree with the above paragraph. I'm leaning towards saying there should be a separate UTF-8 (e.g. .utf8) text version. And for either version I don't think any U+ sequence should appear for a person's name.

Paul thinks: if there are two versions, the .txt should be UTF-8 and the ASCII version should be .asc. If there is an all-ASCII version, we need to ask the authors how they want their names (mis)spelled in ASCII.

Initial proposal: There should be multiple text outputs: ASCII-only with page breaks, ASCII-only without page breaks, UTF-8 with page breaks, UTF-8 without page breaks.

Response: Limit the .txt output to one option only, as similar as reasonable to what is available today. That would be text, ascii-art only with links to images, page breaks with headers and footers.

Avoiding Bad Breaks in Paginated Text

The paginated text format needs to deal with the issue of paragraph or art that would be split over a page break.

[PH] Eliminate the problem is to just be willing to leave extra white space at the bottom of the paginated pages. If a single paragraph or figure is too large to fit on a paginated page (the tool should warn about this every time it emits paginated text output), the Production Center can break the paragraph or split the figure into two.

[TH] (widow == bottom line of a paragraph that winds up in the next column/page. orphan == top line of a paragraph that is separated from the rest of the paragraph by a column/page break.) In most cases, both can be eliminated by not limiting yourself to a strict number of lines (N) on a page, but allowing yourself to go to N+1. If the paragraph is exactly 3 lines long, then a page length of N+2 can eliminate both the widow and orphan.

If you must limit the page size to a maximum of N lines, then you can use a page length of N-1 lines to force another line onto the top of the next page. If headings occur prior to the orphan, then they must be moved to the next page as well. Paragraphs exactly 3 lines long that have been split in either direction would just be moved to the next page, along with any headings.

PDF

Initial proposal: The document needs to include live links

 For linking between RFCs, pointers to RFCs published before the format switchover will point to the TXT version
 For linking between RFCs, pointers to RFCs published after the format switchover will point to the PDF version and will allow for pointers to specific sections within a document
 The PDF version will include the standard front page header and include page numbers
 The PDF version will be sized for ???

Response: With HTML as an option, there is not a compelling case to require links in the PDF. One use case described was that of the IESG, several members of which choose to print out the PDF version for review. Links would not provide enough (any?) additional value to suggest we need to add this. Team suggests that the requirements for PDF do not actually need to change from what they are today: PDF as a direct copy of the TXT format with the inclusion of graphics.

Team also briefly considered how a tool like PrinceXML could generate PDF from HTML. That went to much in to implementation, and it was left as an example of something that might be possible and limit the number of tools needing to be modified or created.

EPUB

If we also want to do MOBI (the native Amazon format), we might consider running the free-but-closed-source program from Amazon http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000765211. If people are hesitant about us even partially supporting that program, we could consider having a pointer to the program on an advisory page on rfc-editor.org.

If the HTML output is designed well, it can be used to create EPUB output with few, if any, additional requirements.

RSE Wiki Archive

Table of Contents

Thoughts on Non-Canonical Formats

Well-structured HTML

Text

ASCII vs. UTF-8 for Text Output

Avoiding Bad Breaks in Paginated Text

PDF

EPUB

RSE Wiki Archive

User Tools

Site Tools

Table of Contents

Thoughts on Non-Canonical Formats

Well-structured HTML

Text

ASCII vs. UTF-8 for Text Output

Avoiding Bad Breaks in Paginated Text

PDF

EPUB

Page Tools