[rfc-i] Feedback on Section 3.4 in draft-iab-rfc-nonascii-02, U+ syntax

Sean Leonard dev+ietf at seantek.com
Wed Aug 31 21:58:04 PDT 2016


On 8/31/2016 12:25 PM, Paul Hoffman wrote:
> On 31 Aug 2016, at 10:02, Sean Leonard wrote:
>
>> /(Sent this to the authors, and the suggestion was that this is the 
>> right mailing list for public discussion.)/
>>
>> **********
>> Hello draft-iab-rfc-nonascii-02 people, here is feedback on 
>> draft-iab-rfc-nonascii-02.
>>
>> Section 3.4 of draft-iab-rfc-nonascii-02 provides no less than six 
>> preferred alternatives for how to represent a single Unicode 
>> character or code point. They all pretty much say “the ___ character 
>> (___)” in various permutations. None of these are inherently wrong.
>>
>> However, The Unicode Standard itself (9.0.0 and prior versions) 
>> provides a specific convention in Appendix A:
>> “U+[x][x]xxxx NAME OF CHARACTER”
>>
>> Notably, the convention does not use “the ___ character” formulation. 
>> Grammatically, the convention is a character, so an article is 
>> omitted. A conforming example would be:
>>
>>  1.  Temperature changes in the Temperature Control Protocol are
>>      indicated by U+2206 INCREMENT.
>>
>> I would like to propose that this be used as at least a priority 
>> alternative.
>
> Disagree. That formulation is harder to read in running text, and 
> running text is exactly the formulation we are aiming for. The fact 
> that TUC likes a particular format should not impinge on our choice 
> for readability.

I respectfully disagree.

As an editorial matter, draft-iab-rfc-nonascii does not express "our 
[the IETF's] choice for readability". It offers no less than six 
"preferred" options, and one "acceptable" option. Then it says that it's 
all context-dependent.

Where I am coming from is that six or seven different options are not 
helpful in the text, especially when a very commonly used option 
(Appendix A of [UnicodeCurrent]) in the industry is not illuminated. I 
actually just did a search of the RFC series, with the regex:

/U\+([0-9A-Fa-f]){4,6}/

and found that variations of U+hhhh[h][h] NAME have been very common. 
(Variations include putting the NAME first, putting either the U+ or the 
NAME in parens or quotes, etc., but in general, closer to Appendix A of 
[UnicodeCurrent] than to draft-iab-rfc-nonascii.) The second most common 
variation is straight up U+hhhh[h][h] notation with no further 
embellishments.

My overall editorial point is that Section 3.4 be simplified to:

3.4.  Body of the Document

    When the mention of Unicode characters is required for correct
    protocol operation and understanding, the characters' Unicode
    character names or code points MUST be included in the text.
    For a single Unicode character, at least two of the following
    three pieces of data MUST be included:
    the character itself, the character name or name alias,
    and the character code point.

    o  Characters beyond the ASCII range will require identifying
       the Unicode code point.

    o  Use of the actual character (e.g., Δ) is encouraged so
       that a reader can more easily see what the character is, if their
       device can render the text.

    o  The use of the Unicode character names or name aliases
       like "INCREMENT" in
       addition to the use of Unicode code points is also encouraged.
       When used, Unicode character names should be in all capital
       letters.

    Examples:

    OLD [RFC7564]:

    However, the problem is made more serious by introducing the full
    range of Unicode code points into protocol strings.  For example,
    the characters U+13DA U+13A2 U+13B5 U+13AC U+13A2 U+13AC U+13D2 from
    the Cherokee block look similar to the ASCII characters  "STPETER" as
    they might appear when presented using a "creative" font family.

    NEW/ALLOWED:

However, the problem is made more serious by introducing the full
range of Unicode code points into protocol strings.  For example,
the characters U+13DA U+13A2 U+13B5 U+13AC U+13A2 U+13AC U+13D2
(ᏚᎢᎵᎬᎢᎬᏒ) from the Cherokee block look similar to the ASCII
characters "STPETER" as they might appear when presented using a
"creative" font family.

    ALSO ACCEPTABLE:

However, the problem is made more serious by introducing the full
range of Unicode code points into protocol strings.  For example,
the characters "ᏚᎢᎵᎬᎢᎬᏒ" (U+13DA U+13A2 U+13B5 U+13AC U+13A2
U+13AC U+13D2) from the Cherokee block look similar to the ASCII
characters "STPETER" as they might appear when presented using a
"creative" font family.

    How the Unicode character, code point, and name or name
    alias are written in the body may
    depend on context and the specific character(s) in question.  All are
    acceptable within an RFC.  BCP 137, "ASCII Escaping of Unicode
    Character" describes the pros and cons of different options for
    identifying Unicode characters in an ASCII document [BCP137];
    see also Appendix A of [UnicodeCurrent].



With respect to Section 3.6:

3.6.  Code Components

    The RFC Editor encourages the use of the U+ notation
    (Appendix A of [UnicodeCurrent])
    except within a
    code component where you must follow the rules of the programming
    language in which you are writing the code.

    Code components are generally expected to use fixed-width fonts.
    Where such fonts are not available for a particular script, the best
    script- appropriate font will be used for that part of the code
    component.


Regards,

Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.rfc-editor.org/pipermail/rfc-interest/attachments/20160831/a96c5ae5/attachment.html>


More information about the rfc-interest mailing list