User Tools

Site Tools


design:utf-8

This is an old revision of the document!


Some discussion of requirements, goals, and desires around non-ASCII characters are at https://www.rfc-editor.org/rse/wiki/doku.php?id=formatreq. The table below summarizes a taxonomy of cases where (non-ASCII) UTF-8 might or might not be allowed, along with some thoughts. The intent is that each row represents a separate policy decision.

We expect that any case we mark “Yes” for Consensus will also require providing an ASCII-only transliteration unless we explicitly note otherwise.

Case Section Use Consensus Comments
1a (title page) Author name Yes? Answer should match (6a)
1b (title page) Author affiliation Answer should match (6b)
1c (title page) Document title
2 Abstract Prose Could be same as (3e), but Abstracts may also be separately compiled into other indices so could have a different answer ​
3a Body or Appendix Example string E.g. fictional person name, IRI, EAI, domain name, etc. Currently there's no XML markup to denote example strings, so hard to distinguish from (3c)
3b Body or Appendix Code snippet
3c Body or Appendix Literal protocol element Required transliteration should use U+xxxx syntax
3d Body or Appendix Document title of a cited document Answer should match (4c)
3e Body or Appendix Prose No? e.g. use of “naïve” in http://tools.ietf.org/html/rfc4690#section-1.5.5
3f Body or Appendix Section title
4a References Author name Yes? Answer should match (1a)
4b References Author affiliation Answer should match (1b)
4c References Document title Not necessarily an RFC that's being referenced
4d References Document IRI No?
5a Acknowledgements Person name Yes?
5b Acknowledgements Organization name
6a Authors Addresses Author name Yes?
6b Authors Addresses Author affiliation
6c Authors Addresses Author email address (EAI)
6d Authors Addresses Author IRI No?
6e Authors Addresses Author postal address
7a (page footer) Author surname
7b (page header) Abbreviated document name
8a (metadata) Keywords

Other open questions:

  1. Where UTF-8 is allowed, what normalization form(s) are ok? (NFC, NFD, NFKC, or NFKD)
    1. Paul thinks: irrelevant. Characters are characters.
  2. Can you reference an external document that contains a non-ascii title/author/etc.? If you need a transliteration, where do you get it from?
    1. Paul thinks: Yes, definitely. This is important for non-ASCII author names. Transliteration can be guessed at by the RFC author.
    2. Heather: I would rather the author provide the transliteration. The RFC Editor shouldn't be guessing on behalf of the authors.

Strawman requirements (beyond those listed at https://www.rfc-editor.org/rse/wiki/doku.php?id=design:start):

  1. An implementer must be able to implement the specification without any confusion or ambiguity introduced by the use of UTF-8 rather than ASCII.
  2. Must be able to reference (cite) the document from elsewhere in a standard way, including from documents that only support ASCII.
  3. Must be able to reference (cite) other documents in an unambiguous way.
  4. Cross-references (including references to other documents) must be unambiguous even from a printed document.
  5. Must be able to index the document in various ways, so searching by keyword, author name, etc. can work.
  6. All documents will be UTF-8 encoded and MUST apply Normalization Form C to all metadata fields such as document name, authors, and references unless a specific exception is granted by the RSE. The body of the document MAY contain other normalization forms as declared necessary by the authors. Non-ASCII characters only allowed in Author Names, Examples, and References. Author Names will also require an ASCII representation of their name to encourage broader indexing.

Strawman principles (similar to RFC 6912 approach):

  1. If something could affect interoperability or would block an implementer from being able to implement, any use of UTF-8 must be accompanied by an ASCII transliteration. The transliteration must be called out in a way to make it clear it is a transliteration (instead of just a part of the original item).
  2. Do not assume that any non-ASCII character will necessarily be rendered correctly (or at all)
design/utf-8.1381849536.txt.gz · Last modified: 2013/10/15 08:05 by rsewikiadmin