[rfc-i] Unicode in ABNF (in RFC) draft-seantek-unicode-in-abnf-01.txt

Sean Leonard dev+ietf at seantek.com
Tue Oct 4 11:08:33 PDT 2016


On 10/3/2016 2:53 AM, Martin J. Dürst wrote:
> I don't see the need to use %su for Unicode strings. The code points 
> speak for themselves, just use %s. Leaving %i/%iu undefined for 
> Unicode is indeed advisable, although it could be based on default 
> case folding, but we know that this would be imperfect, in particular 
> for Turkish.

I like %su because it notifies the reader, and a parser, to expect UTF-8 
and "deal with it" in a way that %s alone doesn't. For example, accented 
e can be é (U+00E9) or é (U+0065 U+0301). When printed or in a medium 
that doesn't provide direct access to the encoded data (screenshot? 
mobile app? etc.), the quoted string is ambiguous. Saying %s"foo" means 
you know that foo is always in the ASCII range, and can't possibly be 
composed of anything else (including, for example, FULLWIDTH ASCII in 
U+FF00-U+FF5E). Are %s"foo" and %s"foo" the same? How about %s"·˙•․‥…‧"? 
%s"µ" and %s"μ"? And the bajillion different dashes? Then there is the 
issue that even if a code point is objectively, graphically distinct in 
this version of Unicode, some future version may assign a code point to 
a character that commonly looks exactly the same as an existing character.

Responding to your point, defining %su"" would mean %s"" is undefined 
for Unicode, which avoids the temptation of %i"" (or nothing "" aka the 
traditional approach). Perhaps if a need develops for a case-insensitive 
version in a protocol, %iu could take a parameter that indicates the 
language tailoring, such as %iu[tr]"çilek". (But, I suppose, one could 
make a converse argument that %i[tr]"çilek" would be a natural evolution.)

Those are a couple of arguments. I am happy to go with whatever (rough) 
consensus emerges, however.

Regards,

Sean




More information about the rfc-interest mailing list