[rfc-i] Unicode in ABNF (in RFC) draft-seantek-unicode-in-abnf-01.txt

Martin J. Dürst duerst at it.aoyama.ac.jp
Mon Oct 3 02:53:55 PDT 2016

Hello Sean,

A few quick comments from a cursory reading:

First, I note that the choice you have made for representing Unicode 
codepoints seems to be the same that we made for RFC 3987, which is the 
one that I think RFC 5234 and its predecessors also implicitly suggest. 
If you have seen some discrepancies, I would appreciate a pointer. You 
may also want to reference some of the

In the Introduction, you mention security problems, but they are not 
detailed (no specifics, no examples) there and neither in the Security 
Considerations section.

In contrast to ASCII, Unicode (in any of its encoding forms) essentially 
introduces multiple levels at which protocols can be described: bytes, 
[code units (in the case of UTF-16xx),] code points, grapheme 
clusters,... I'm fine with limiting this document to the code point 
level, which is clearly what we need now, but it would be good to say 
somewhere at least that this document doesn't deal with other levels.

Starting sections/paragraphs with parentheticals (e.g. "(Consult Section 
2.3 of [RFC5234] in relation to this paragraph.)") is far away from good 
writing. At the minimum, put these parentheticals at the end of the 
paragraphs, but even better would be to convert them to actual text (in 
most cases still at the end of the paragraphs) and say explicitly what 
the "relation" is. (RFC 7405 looks much better in this respect.)

In the appendix, there are a lot of mostrosities such as 
"UVCHARBEYONDLATIN1". Why not change that to something a bit more 
readable, at the minimum something like UV_CHAR_BEYOND_LATIN_1 or so?

I don't see the point of defining aliases for C1 controls; it should be 
difficult to use these explicitly, not easy.

For some of the aliases, a property-based approach seems to be the right 
thing to do, although this may be difficult to align with the ABNF 

The draft says:
    Formally, this document updates [RFC5234] but does not modify it in
    situ. Authors need to reference this document if they want to include
    these enhancements; bare references to [RFC5234] do not include this
    specification (or, for that matter, [RFC7405]).
There's no text whatsoever in RFC 7405 that would say that it doesn't 
update RFC 5234 directly. But I may be missing something. Please clarify.

I don't see the need to use %su for Unicode strings. The code points 
speak for themselves, just use %s. Leaving %i/%iu undefined for Unicode 
is indeed advisable, although it could be based on default case folding, 
but we know that this would be imperfect, in particular for Turkish.

Section 6 uses an example with actual Unicode characters. I'd definitely 
wait for the new way of publishing drafts/RFCs before the final 
publication of this document, so that this example (and hopefully a few 
more) can use actual Unicode characters.
(I'd also change 'notated' to 'annotated'. (several occurrences))

That's about it, hope it helps.

Regards,   Martin.

On 2016/10/03 15:28, Sean Leonard wrote:
> Dear ABNF-Discuss (and rfc-interest):
> This draft by Chris Newman and I addresses an interesting topic: how to
> do Unicode in ABNF. Unicode has showed up in several different ways in
> protocols that are described in ABNF. These ways are not consistent
> across the RFC series, but now that Unicode is a pretty stable standard
> (for its basic parts) and now that UTF-8 RFCs are becoming a reality per
> draft-iab-rfc-nonascii-02, it is a good time to look at this issue. This
> is a fork from draft-seantek-abnf-more-core-rules.
> This draft is currently proposed as Experimental. Special thanks to Paul
> Kyzivat for discussing the matters in this draft, although he is not
> formally a co-author.
> The draft tries to be very conservative in its approach. Please read the
> draft for details. Some stuff was intentionally omitted as out-of-scope
> or too complicated for a general-purpose ABNF syntax parser, whether
> humans or machines.
> Comments and feedback are appreciated.
> Regards,
> Sean
> ********
> A new version of I-D, draft-seantek-unicode-in-abnf-01.txt
> has been successfully submitted by Sean Leonard and posted to the
> IETF repository.
> Name:        draft-seantek-unicode-in-abnf
> Revision:    01
> Title:        Unicode in ABNF
> Document date:    2016-10-01
> Group:        Individual Submission
> Pages:        11
> URL:
> https://www.ietf.org/internet-drafts/draft-seantek-unicode-in-abnf-01.txt
> Status:
> https://datatracker.ietf.org/doc/draft-seantek-unicode-in-abnf/
> Htmlized:
> https://tools.ietf.org/html/draft-seantek-unicode-in-abnf-01
> Diff:
> https://www.ietf.org/rfcdiff?url2=draft-seantek-unicode-in-abnf-01
> Abstract:
>    This experimental document adds support for Unicode strings in ABNF
>    (Augmented Backus-Naur Form), and provides certain symbols related to
>    Unicode code point ranges.
> _______________________________________________
> rfc-interest mailing list
> rfc-interest at rfc-editor.org
> https://www.rfc-editor.org/mailman/listinfo/rfc-interest
> .

Martin J. Dürst
Department of Intelligent Information Technology
Collegue of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan

More information about the rfc-interest mailing list