[rfc-i] Unicode in ABNF (in RFC) draft-seantek-unicode-in-abnf-01.txt
dev+ietf at seantek.com
Tue Oct 4 11:08:33 PDT 2016
On 10/3/2016 2:53 AM, Martin J. Dürst wrote:
> I don't see the need to use %su for Unicode strings. The code points
> speak for themselves, just use %s. Leaving %i/%iu undefined for
> Unicode is indeed advisable, although it could be based on default
> case folding, but we know that this would be imperfect, in particular
> for Turkish.
I like %su because it notifies the reader, and a parser, to expect UTF-8
and "deal with it" in a way that %s alone doesn't. For example, accented
e can be é (U+00E9) or é (U+0065 U+0301). When printed or in a medium
that doesn't provide direct access to the encoded data (screenshot?
mobile app? etc.), the quoted string is ambiguous. Saying %s"foo" means
you know that foo is always in the ASCII range, and can't possibly be
composed of anything else (including, for example, FULLWIDTH ASCII in
U+FF00-U+FF5E). Are %s"foo" and %s"ｆｏｏ" the same? How about %s"·˙•․‥…‧"?
%s"µ" and %s"μ"? And the bajillion different dashes? Then there is the
issue that even if a code point is objectively, graphically distinct in
this version of Unicode, some future version may assign a code point to
a character that commonly looks exactly the same as an existing character.
Responding to your point, defining %su"" would mean %s"" is undefined
for Unicode, which avoids the temptation of %i"" (or nothing "" aka the
traditional approach). Perhaps if a need develops for a case-insensitive
version in a protocol, %iu could take a parameter that indicates the
language tailoring, such as %iu[tr]"çilek". (But, I suppose, one could
make a converse argument that %i[tr]"çilek" would be a natural evolution.)
Those are a couple of arguments. I am happy to go with whatever (rough)
consensus emerges, however.
More information about the rfc-interest