This is an old revision of the document!

Thoughts on Non-Canonical Formats

This page is for keeping thoughts about the expected output formats *other than* XML.

The formats discussed so far are:

Well-structured HTML
Text (Unpaginated text and Paginated text)
PDF
EPUB

Well-structured HTML

Initial proposal: A strong design goal is that the conversion from canonical XML to HTML should be round-trippable, that is, that it should be possible to convert the HTML back to XML with literally zero loss of semantic content. Conversion from and to the canonical XML might be done with XSLT. Response:

Round-tripping would require preserving non-semantic information.
For instance, it'll be hard not no loose information from <author> elements, because the various aspects get rendered into different places. You'd need either heuristics or additional markup to link these places together in order to re-combine the information.
If roundtripping is not a goal, we need to make clear what kind of information we want to be represented in the HTML. “All semantic content” is too imprecise.
Semantic information *will* be lost during the transformation. The balancing act is making certain that enough semantic information is kept for making the HTML output useful for html-processing tools. The tough part is how to express that as a requirement.

 For the example of counter="requirement", there are ways that the
 information could be propagated, such as into a class name of
 list_counter_requirement, but that's kind of ugly and subject to issues
 when the namespace characters are different. (What do you do with a
 counter name that contains spaces or non-alphameric characters?) With a
 requirement of "all", each of these edge cases would need to be nailed
 down. But is it the type of semantic information that *needs* to be
 propagated? I really don't think so.

Initial proposal: Consider allowing (eventually) javascript Response: No

This would negatively impact people feeling safe when opening RFCs
Would make it more difficult to ensure RFCs look the same in all environments
We wouldn't be able to agree on what Javascript to include.
It's completely unnecessary for a simple text document.
Think of the testing involved. Think of all the contexts in which an RFC might be consumed. If the Javascript is required to render the RFC, it will inevitably fail in some cases.

Text

Initial proposal: There should be multiple text outputs: ASCII-only with page breaks, ASCII-only without page breaks, UTF-8 with page breaks, UTF-8 without page breaks.

Response: Limit the .txt output to one option only, as similar as reasonable to what is available today. That would be text, ascii-art only with links to images, page breaks with headers and footers.

Avoiding Bad Breaks in Paginated Text

The paginated text format needs to deal with the issue of paragraph or art that would be split over a page break.

[PH] Eliminate the problem is to just be willing to leave extra white space at the bottom of the paginated pages. If a single paragraph or figure is too large to fit on a paginated page (the tool should warn about this every time it emits paginated text output), the Production Center can break the paragraph or split the figure into two.

[TH] (widow == bottom line of a paragraph that winds up in the next column/page. orphan == top line of a paragraph that is separated from the rest of the paragraph by a column/page break.) In most cases, both can be eliminated by not limiting yourself to a strict number of lines (N) on a page, but allowing yourself to go to N+1. If the paragraph is exactly 3 lines long, then a page length of N+2 can eliminate both the widow and orphan.

If you must limit the page size to a maximum of N lines, then you can use a page length of N-1 lines to force another line onto the top of the next page. If headings occur prior to the orphan, then they must be moved to the next page as well. Paragraphs exactly 3 lines long that have been split in either direction would just be moved to the next page, along with any headings.

PDF

Initial proposal: The document needs to include live links

 For linking between RFCs, pointers to RFCs published before the format switchover will point to the TXT version
 For linking between RFCs, pointers to RFCs published after the format switchover will point to the PDF version and will allow for pointers to specific sections within a document
 The PDF version will include the standard front page header and include page numbers
 The PDF version will be sized for ???

Response: With HTML as an option, there is not a compelling case to require links in the PDF. One use case described was that of the IESG, several members of which choose to print out the PDF version for review. Links would not provide enough (any?) additional value to suggest we need to add this. Team suggests that the requirements for PDF do not actually need to change from what they are today: PDF as a direct copy of the TXT format with the inclusion of graphics.

Team also briefly considered how a tool like PrinceXML could generate PDF from HTML. That went to much in to implementation, and it was left as an example of something that might be possible and limit the number of tools needing to be modified or created.

EPUB

If we also want to do MOBI (the native Amazon format), we might consider running the free-but-closed-source program from Amazon http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000765211. If people are hesitant about us even partially supporting that program, we could consider having a pointer to the program on an advisory page on rfc-editor.org.

If the HTML output is designed well, it can be used to create EPUB output with few, if any, additional requirements.

RSE Wiki Archive

Table of Contents

Thoughts on Non-Canonical Formats

Well-structured HTML

Text

Avoiding Bad Breaks in Paginated Text

PDF

EPUB

RSE Wiki Archive

User Tools

Site Tools

Table of Contents

Thoughts on Non-Canonical Formats

Well-structured HTML

Text

Avoiding Bad Breaks in Paginated Text

PDF

EPUB

Page Tools