User Tools

Site Tools


design:formats

This is an old revision of the document!


Thoughts on Non-Canonical Formats

This page is for keeping thoughts about the expected output formats *other than* XML.

The formats discussed so far are:

  • Well-structured HTML
  • Unpaginated text
  • Paginated text
  • PDF
  • EPUB

Well-structured HTML

A strong design goal is that the conversion from canonical XML to HTML should be round-trippable, that is, that it should be possible to convert the HTML back to XML with literally zero loss of semantic content. Conversion from and to the canonical XML might be done with XSLT.

Joe says more here.

Unpaginated Text

Paginated Text

This is text with headers, footers, and page break characters.

Avoiding Bad Breaks in Paginated Text

The paginated text format needs to deal with the issue of paragraph or art that would be split over a page break.

[PH] Eliminate the problem is to just be willing to leave extra white space at the bottom of the paginated pages. If a single paragraph or figure is too large to fit on a paginated page (the tool should warn about this every time it emits paginated text output), the Production Center can break the paragraph or split the figure into two.

[TH] (widow == bottom line of a paragraph that winds up in the next column/page. orphan == top line of a paragraph that is separated from the rest of the paragraph by a column/page break.) In most cases, both can be eliminated by not limiting yourself to a strict number of lines (N) on a page, but allowing yourself to go to N+1. If the paragraph is exactly 3 lines long, then a page length of N+2 can eliminate both the widow and orphan.

If you must limit the page size to N lines, then you can use a page length of N-1 lines to force another line onto the top of the next page. If headings occur prior to the orphan, then they must be moved to the next page as well. Paragraphs exactly 3 lines long that have been split in either direction would just be moved to the next page, along with any headings.

PDF

There will be (at least) two formats

  • PDF format 1 that looks much like how the HTML would look if printed, including having live links, text formatting (bold/italics, differing sizes), all art, and relevant headers/footers.
  • PDF format 2, hopefully having live links but not having text formatting or SVG art.

Both PDF formats are produced for US Letter and A4 page sizes.

We have talked about using PrinceXML to generate PDF.

EPUB

If we also want to do MOBI (the native Amazon format), we might consider running the free-but-closed-source program from Amazon http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000765211. If people are hesitant about us even partially supporting that program, we could consider having a pointer to the program on an advisory page on rfc-editor.org.

design/formats.1378326679.txt.gz · Last modified: 2013/09/04 13:31 by paul