[rfc-i] Feedback on Section 3.4 in draft-iab-rfc-nonascii-02, U+ syntax

Martin J. Dürst duerst at it.aoyama.ac.jp
Thu Sep 1 03:18:08 PDT 2016


I have to agree with Sean in the sense that the consistent use of the 
word 'character' may start to sound extremely repetitive if the number 
of characters crosses a certain limit.

My guess is that the document editors and the RFC editor will become 
aware of such cases quickly, and that in the end, the guidelines will be 
adapted.

Regards,   Martin.

P.S.: While I'm at it, in the sentence:
                               BCP 137, "ASCII Escaping of Unicode
    Character" describes the pros and cons of different options for
    identifying Unicode characters in an ASCII document BCP137 [BCP137].
there's just a bit too many "BCP 137" for my (and I hope everybody 
else's) taste. (Unless this is an error produced by the html tools version.)


On 2016/09/01 04:25, Paul Hoffman wrote:
> On 31 Aug 2016, at 10:02, Sean Leonard wrote:
>
>> /(Sent this to the authors, and the suggestion was that this is the
>> right mailing list for public discussion.)/
>>
>> **********
>> Hello draft-iab-rfc-nonascii-02 people, here is feedback on
>> draft-iab-rfc-nonascii-02.
>>
>> Section 3.4 of draft-iab-rfc-nonascii-02 provides no less than six
>> preferred alternatives for how to represent a single Unicode character
>> or code point. They all pretty much say “the ___ character (___)” in
>> various permutations. None of these are inherently wrong.
>>
>> However, The Unicode Standard itself (9.0.0 and prior versions)
>> provides a specific convention in Appendix A:
>> “U+[x][x]xxxx NAME OF CHARACTER”
>>
>> Notably, the convention does not use “the ___ character” formulation.
>> Grammatically, the convention is a character, so an article is
>> omitted. A conforming example would be:
>>
>>  1.  Temperature changes in the Temperature Control Protocol are
>>      indicated by U+2206 INCREMENT.
>>
>> I would like to propose that this be used as at least a priority
>> alternative.
>
> Disagree. That formulation is harder to read in running text, and
> running text is exactly the formulation we are aiming for. The fact that
> TUC likes a particular format should not impinge on our choice for
> readability.
>
>>
>> In The Unicode Standard, two other conventions are noted:
>>
>> U+1F631 “😱” FACE SCREAMING IN FEAR
>>
>> U+1F631 “😱”
>>
>> These conventions show all-caps, and small-caps (which for PDF
>> presentation purposes, are actually stored as lowercase). They also
>> show curly quotes. I asked the Unicode mailing list over the weekend
>> and the general sense is that the uppercase is normative in plain text
>> (as shown in the UCD) but case distinctions, along with space and
>> (nearly all) hyphens, are not relevant for unambiguous identification.
>
> Neither of these are easier to read in running text than the ones in the
> draft.
>
>>
>> draft-iab-rfc-nonascii-02 is only concerned with characters, not
>> semantics or presentation formats (unlike xml2rfc format). Assuming
>> that plain text is the norm for purposes of draft-iab-rfc-nonascii-02,
>> I suppose that it is sufficient for the plain text to have an ALL-CAPS
>> name. I was going to suggest a novel xml2rfc element for Unicode code
>> points, such as <ucode name="yes">😱</ucode> that would be transformed
>> into the output above in plain text mode. However, the xml2rfc
>> transformer can detect such text by looking for the presence of
>> “U+1F631 FACE SCREAMING IN FEAR”, and apply CSS to it in the html
>> output instead, viz.:
>> span.uniname {                   /* CHAR STYLES */
>> text-transform: lowercase;
>> font-variant: small-caps;
>> font-size: 110%;
>> }
>>
>> As discussed here:
>> <http://www.unicode.org/mail-arch/unicode-ml/y2016-m08/0055.html>
>>
>> Personally I do not see the need for quotations around the character.
>> U+____ SP 😱 SP NAME ought to be good enough: the single 😱 is going
>> to be non-ASCII anyway. However there are implications for combining
>> marks, with or without quotes…this needs to be thought through. Consider:
>> U+0308 “◌̈” COMBINING DIAERESIS vs.
>> U+0308 ◌̈ COMBINING DIAERESIS vs.
>> U+0308 “̈” COMBINING DIAERESIS vs.
>> U+0308 ̈ COMBINING DIAERESIS.
>> See
>> <http://stackoverflow.com/questions/2224772/whats-the-unicode-glyph-used-to-indicate-combining-characters>
>>
>>
>> The question is what happens when the 😱 is a specific protocol
>> element, which frequently (but not always) is quoted, such as "+" and
>> treated as verbatim text <spanx style="verb"> or the new <tt> in
>> xml2rfc v3.
>
> This is another good reason for the current rules.
>
>>
>> Section 3.6 (and elsewhere) discusses “U+ notation” without a
>> reference. Appendix A of [UnicodeCurrent] is appropriate.
>
> That seems fine.
> _______________________________________________
> rfc-interest mailing list
> rfc-interest at rfc-editor.org
> https://www.rfc-editor.org/mailman/listinfo/rfc-interest


More information about the rfc-interest mailing list