User Tools

Site Tools


design:formats

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
design:formats [2013/08/16 16:51]
paul More thoughts from Paul
design:formats [2013/10/15 19:07] (current)
rsewikiadmin
Line 1: Line 1:
 ====== Thoughts on Non-Canonical Formats ====== ====== Thoughts on Non-Canonical Formats ======
- 
  
 This page is for keeping thoughts about the expected output formats *other than* XML. This page is for keeping thoughts about the expected output formats *other than* XML.
Line 7: Line 6:
  
   * Well-structured HTML   * Well-structured HTML
-  * Unpaginated text +  * Text (Unpaginated text and Paginated text)
-  * Paginated text+
   * PDF   * PDF
   * EPUB   * EPUB
- 
-Recent discussion was around generating the well-structured HTML from the XML, and the rest of the formats from the HTML, because there are usually more tools for conversion from HTML than XML. 
  
 ===== Well-structured HTML ===== ===== Well-structured HTML =====
  
-A strong design goal is that the conversion from canonical XML to HTML should be round-trippable, that is, that it should be possible to convert the HTML back to XML with literally zero loss of semantic content.+Initial proposal: A strong design goal is that the conversion from canonical XML to HTML should be round-trippable, that is, that it should be possible to convert the HTML back to XML with literally zero loss of semantic content. Conversion from and to the canonical XML might be done with XSLT. 
 +Response: 
 +  * Round-tripping would require preserving non-semantic information. 
 +  * For instance, it'll be hard not no loose information from <author> elements, because the various aspects get rendered into different places.  You'd need either heuristics or additional markup to link these places together in order to re-combine the information.  
 +  * If roundtripping is not a goal, we need to make clear what kind of information we want to be represented in the HTML. "All semantic content" is too imprecise. 
 +  * Semantic information *will* be lost during the transformation. The balancing act is making certain that enough semantic information is kept for making the HTML output useful for html-processing tools. The tough part is how to express that as a requirement.
  
-<del>Joe goes here.</del>+   For the example of counter="requirement", there are ways that the 
 +   information could be propagated, such as into a class name of 
 +   list_counter_requirement, but that's kind of ugly and subject to issues 
 +   when the namespace characters are different. (What do you do with a 
 +   counter name that contains spaces or non-alphameric characters?) With a 
 +   requirement of "all", each of these edge cases would need to be nailed 
 +   down. But is it the type of semantic information that *needs* to be 
 +   propagated? I really don't think so.
  
-===== Unpaginated Text =====+Initial proposal: Consider allowing (eventually) javascript 
 +Response: No 
 +  * This would negatively impact people feeling safe when opening RFCs 
 +  * Would make it more difficult to ensure RFCs look the same in all environments 
 +  * We wouldn't be able to agree on what Javascript to include. 
 +  * It's completely unnecessary for a simple text document. 
 +  * Think of the testing involved.  Think of all the contexts in which an RFC might be consumed.  If the Javascript is required to render the RFC, it will inevitably fail in some cases.
  
-===== Paginated Text ===== 
  
-A way to eliminate the need for widow/orphan control or bad page breaks in figures is to just be willing to leave extra white space at the bottom of the paginated pages. If a single paragraph or figure is too large to fit on a paginated page (the tool should warn about this every time it emits paginated text output), the Production Center can break the paragraph or split the figure into two.+===== Text ===== 
 +=== ASCII vs. UTF-8 for Text Output === 
 + 
 +As of 2013-10-09, it is not clear whether or not the text output will be ASCII or UTF-8. The following assumes ASCII. If the format is UTF-8, then the following is wrong. 
 + 
 +The text-only format must have the same character-set limitations as the current RFC format. For new RFCs that have non-ASCII characters in them, each such character must be represented as //[*U+xxxx*]//, where //xxxx// is a 4- or 6- character hex value. The use case here is that it must be possible to convert all of the encoded versions of the non-ASCII characters in the text-only document exactly to the correct characters in the canonical document. The choice of //[*U+xxxx*]// was made because it is extremely unlikely for that sequence to be part of a normal RFC, even one that talks about Unicode code points by their hex values. For example, an author's name that is represented in the canonical format as "Martin Dürst" would be represented in the text-only format as "Martin D[*U+00FC*]rst". This requires that lines in the text-only format be longer than 80 columns if those lines contain non-ASCII characters. 
 + 
 +//Dave thinks: disagree with the above paragraph. I'm leaning towards saying there should be a separate UTF-8 (e.g. .utf8) text version.  And for either version I don't think any U+ sequence should appear for a person's name.// 
 + 
 +//Paul thinks: if there are two versions, the .txt should be UTF-8 and the ASCII version should be .asc. If there is an all-ASCII version, we need to ask the authors how they want their names (mis)spelled in ASCII.// 
 + 
 +Initial proposal: There should be multiple text outputs: ASCII-only with page breaks, ASCII-only without page breaks, UTF-8 with page breaks, UTF-8 without page breaks. 
 + 
 +Response: Limit the .txt output to one option only, as similar as reasonable to what is available today.  That would be text, ascii-art only with links to images, page breaks with headers and footers.   
 + 
 +==== Avoiding Bad Breaks in Paginated Text ==== 
 + 
 +The paginated text format needs to deal with the issue of paragraph or art that would be split over a page break.  
 + 
 +[PH] Eliminate the problem is to just be willing to leave extra white space at the bottom of the paginated pages. If a single paragraph or figure is too large to fit on a paginated page (the tool should warn about this every time it emits paginated text output), the Production Center can break the paragraph or split the figure into two
 + 
 +[TH] (widow == bottom line of a paragraph that winds up in the next column/page. orphan == top line of a paragraph that is separated from the rest of the paragraph by a column/page break.) In most cases, both can be eliminated by not limiting yourself to a strict number of lines (N) on a page, but allowing yourself to go to N+1. If the paragraph is exactly 3 lines long, then a page length of N+2 can eliminate both the widow and orphan. 
 + 
 +If you must limit the page size to a maximum of N lines, then you can use a page length of N-1 lines to force another line onto the top of the next page. If headings occur prior to the orphan, then they must be moved to the next page as well. Paragraphs exactly 3 lines long that have been split in either direction would just be moved to the next page, along with any headings.
  
 ===== PDF ===== ===== PDF =====
  
-We have talked about using [[http://www.princexml.com|PrinceXML]] to generate PDF+Initial proposalThe document needs to include live links 
 +   For linking between RFCs, pointers to RFCs published before the format switchover will point to the TXT version 
 +   For linking between RFCs, pointers to RFCs published after the format switchover will point to the PDF version and will allow for pointers to specific sections within a document 
 +   The PDF version will include the standard front page header and include page numbers 
 +   The PDF version will be sized for ???
  
-We need to have at least two formatsUS-standard (8.5 x 11) and A4.+Response: With HTML as an option, there is not a compelling case to require links in the PDF.  One use case described was that of the IESG, several members of which choose to print out the PDF version for review.  Links would not provide enough (any?) additional value to suggest we need to add this.   Team suggests that the requirements for PDF do not actually need to change from what they are todayPDF as a direct copy of the TXT format with the inclusion of graphics. 
 + 
 +Team also briefly considered how a tool like [[http://www.princexml.com|PrinceXML]] could generate PDF from HTML.  That went to much in to implementation, and it was left as an example of something that might be possible and limit the number of tools needing to be modified or created
  
 ===== EPUB ===== ===== EPUB =====
  
 If we also want to do MOBI (the native Amazon format), we might consider running the free-but-closed-source program from Amazon [[http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000765211]]. If people are hesitant about us even partially supporting that program, we could consider having a pointer to the program on an advisory page on rfc-editor.org. If we also want to do MOBI (the native Amazon format), we might consider running the free-but-closed-source program from Amazon [[http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000765211]]. If people are hesitant about us even partially supporting that program, we could consider having a pointer to the program on an advisory page on rfc-editor.org.
 +
 +If the HTML output is designed well, it can be used to create EPUB output with few, if any, additional requirements.
  
  
design/formats.1376697080.txt.gz · Last modified: 2013/08/16 16:51 by paul