[rfc-i] Unicode in ABNF (in RFC) draft-seantek-unicode-in-abnf-01.txt

Sean Leonard dev+ietf at seantek.com
Mon Oct 3 16:44:03 PDT 2016

Thanks for checking it out!

> On Oct 3, 2016, at 2:53:55.000AM, Martin J. Dürst <duerst at it.aoyama.ac.jp> wrote:
> Hello Sean,
> A few quick comments from a cursory reading:
> First, I note that the choice you have made for representing Unicode codepoints seems to be the same that we made for RFC 3987, which is the one that I think RFC 5234 and its predecessors also implicitly suggest. If you have seen some discrepancies, I would appreciate a pointer. You may also want to reference some of the

? got cut off?

The main alternate syntax is direct UTF-8 encoding, where each terminal symbol represents an octet. I did a regular expression search through the RFC series for VCHAR and friends (%x21-7E, %d33-126, permutations thereof, etc.), and saw a lot come up.

See RFC 3629:

   UTF8-octets = *( UTF8-char )
   UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
   UTF8-1      = %x00-7F
   UTF8-2      = %xC2-DF UTF8-tail
   UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
                 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
   UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
                 %xF4 %x80-8F 2( UTF8-tail )
   UTF8-tail   = %x80-BF

Some RFCs cite RFC 3629’s ABNF directly. Others copy it. Yet others use some variation that boils down to UTF-8 in intent, but is not the modern formulation of UTF-8. For example, I have seen RFCs (mostly pre-2000) that allow up to 6 octets, formatted according to pre-modern UTF-8 (and also, notably, that permit “ill-formed” encodings). Some RFCs are really lazy and say *(%x80-FF) is “UTF-8”.

The approach to specify actual UTF-8 octets is not necessarily “wrong”. Some might argue it is “better” or has certain things to commend to it. I could go on with various arguments for and against, but suffice to say, it’s different.

That style of specifying Unicode is not addressed in draft-seantek-unicode-in-abnf-01. We may well address it in the next draft.

> In the Introduction, you mention security problems, but they are not detailed (no specifics, no examples) there and neither in the Security Considerations section.

I will let Chris Newman chime in on that one. :)

> In contrast to ASCII, Unicode (in any of its encoding forms) essentially introduces multiple levels at which protocols can be described: bytes, [code units (in the case of UTF-16xx),] code points, grapheme clusters,... I'm fine with limiting this document to the code point level, which is clearly what we need now, but it would be good to say somewhere at least that this document doesn't deal with other levels.

Good point; yes that should be mentioned explicitly.

> Starting sections/paragraphs with parentheticals (e.g. "(Consult Section 2.3 of [RFC5234] in relation to this paragraph.)") is far away from good writing. At the minimum, put these parentheticals at the end of the paragraphs, but even better would be to convert them to actual text (in most cases still at the end of the paragraphs) and say explicitly what the "relation" is. (RFC 7405 looks much better in this respect.)


Basically that part was trying to get at the relationship between “terminal values” (which are just non-negative integers in ABNF land) and “external encodings” (bits and bytes on the wire). An assumption crops up, when dealing with an “ASCII”-oriented protocol, that we are dealing in 8-bit bytes/code units and that the high-order byte is always 0 when it’s an ASCII encoding. This is because the 8-bit byte “won” in the format wars, but it was not always the case. When dealing with Unicode code points and characters, 8-bit byte != code unit. This is related to the UTF-8 issue above.

> In the appendix, there are a lot of mostrosities such as "UVCHARBEYONDLATIN1". Why not change that to something a bit more readable, at the minimum something like UV_CHAR_BEYOND_LATIN_1 or so?

I don’t know about “monstrosities”; how about “gnarly” ? 🤘✌️🏄

It’s worth a discussion.

The low line _ is not a valid ABNF rule name character. Only ALPHA, DIGIT, and - are permitted. One could use - I suppose.

Appendix A is a set of Core Rules, just like RFC 5234 Appendix B.1. The aesthetic preference for ABNF’s Core Rules appears to be SHORTALLCAPSWITHNODELIMITERS. Examples: CHAR, CRLF, DQUOTE, HEXDIG, LWSP, VCHAR, WSP.

Basically, draft-01’s Appendix A tries to preserve that aesthetic. Hence: ASCII -> UNICODE, VCHAR -> UVCHAR, CHAR -> UCHAR, etc. The new thing is specifying ranges beyond the “Basic” ranges of %x00-7F (ASCII) %x00-9F (C1 controls), %x00-FF (Latin-1), and %x00-FFFF (BMP).

I also admit that I was in a Star Trek mood. 🖖 (But I did have a discussion about “Beyond” sometime last year on the Unicode mailing list.)

> I don't see the point of defining aliases for C1 controls; it should be difficult to use these explicitly, not easy.

Basically I respectfully disagree about this.

Some people love to hate on the control characters, but people and protocols do use them. For some protocols (JSON-text-sequence), a control character (RS) turns out to be the natural choice. The same could be said about sentinel characters (aka non-characters). Judicious use of unused code points in a protocol, means you don’t have to escape or quote, which means less processing power and less risk of security bugs due to improper escaping or quoting.

The C1 controls are not used nearly as much; however, that is all the more reason to use them for out-of-band purposes (as RS is used in JSON text sequence). Another way to put it is: “why would anyone define an ABNF rule <SS2> that means anything other than SINGLE SHIFT 2”? Essentially this draft makes it a point to discourage that sort of thing, by having a common language.

We can discuss whether it makes sense to define a larger number of rules for the undisplayable characters. Beyond C1 (compare with C0 in draft-seantek-abnf-more-core-rules), the only ones I/we defined are NBSP (obvious), SHY (obvious), and LS and PS (obvious because SP and HT / HTAB are defined). I think that is a good number. For line-oriented protocols, PS should never appear, hence another good reason to call it out.

> For some of the aliases, a property-based approach seems to be the right thing to do, although this may be difficult to align with the ABNF straightjacket.


> The draft says:
>   Formally, this document updates [RFC5234] but does not modify it in
>   situ. Authors need to reference this document if they want to include
>   these enhancements; bare references to [RFC5234] do not include this
>   specification (or, for that matter, [RFC7405]).
> There's no text whatsoever in RFC 7405 that would say that it doesn't update RFC 5234 directly. But I may be missing something. Please clarify.

I understand that there is some kind of crazy thread on ietf at ietf.org about “what is the meaning of an RFC Update?” that I am not participating in because I am not presently subscribed to that list. Whatever comes out of that conversation is probably relevant to that text.

I think the point is that if an RFC references [RFC7405], then %s"foo" and %i"foo" are fair game in the ABNF. But if it only references [RFC5234], then that syntax is not supposed to appear. Ditto here (in intent).

> Section 6 uses an example with actual Unicode characters. I'd definitely wait for the new way of publishing drafts/RFCs before the final publication of this document, so that this example (and hopefully a few more) can use actual Unicode characters.

Yes, that’s fine.

> (I'd also change 'notated' to 'annotated'. (several occurrences))

Okay; I will take a look.

Best regards,


(%su"foo" issue in next reply)

> That's about it, hope it helps.
> Regards,   Martin.
> On 2016/10/03 15:28, Sean Leonard wrote:
>> Dear ABNF-Discuss (and rfc-interest):
>> This draft by Chris Newman and I addresses an interesting topic: how to
>> do Unicode in ABNF. Unicode has showed up in several different ways in
>> protocols that are described in ABNF. These ways are not consistent
>> across the RFC series, but now that Unicode is a pretty stable standard
>> (for its basic parts) and now that UTF-8 RFCs are becoming a reality per
>> draft-iab-rfc-nonascii-02, it is a good time to look at this issue. This
>> is a fork from draft-seantek-abnf-more-core-rules.
>> This draft is currently proposed as Experimental. Special thanks to Paul
>> Kyzivat for discussing the matters in this draft, although he is not
>> formally a co-author.
>> The draft tries to be very conservative in its approach. Please read the
>> draft for details. Some stuff was intentionally omitted as out-of-scope
>> or too complicated for a general-purpose ABNF syntax parser, whether
>> humans or machines.
>> Comments and feedback are appreciated.
>> Regards,
>> Sean
>> ********
>> A new version of I-D, draft-seantek-unicode-in-abnf-01.txt
>> has been successfully submitted by Sean Leonard and posted to the
>> IETF repository.
>> Name:        draft-seantek-unicode-in-abnf
>> Revision:    01
>> Title:        Unicode in ABNF
>> Document date:    2016-10-01
>> Group:        Individual Submission
>> Pages:        11
>> URL:
>> https://www.ietf.org/internet-drafts/draft-seantek-unicode-in-abnf-01.txt
>> Status:
>> https://datatracker.ietf.org/doc/draft-seantek-unicode-in-abnf/
>> Htmlized:
>> https://tools.ietf.org/html/draft-seantek-unicode-in-abnf-01
>> Diff:
>> https://www.ietf.org/rfcdiff?url2=draft-seantek-unicode-in-abnf-01
>> Abstract:
>>   This experimental document adds support for Unicode strings in ABNF
>>   (Augmented Backus-Naur Form), and provides certain symbols related to
>>   Unicode code point ranges.
>> _______________________________________________
>> rfc-interest mailing list
>> rfc-interest at rfc-editor.org
>> https://www.rfc-editor.org/mailman/listinfo/rfc-interest
>> .
> -- 
> Martin J. Dürst
> Department of Intelligent Information Technology
> Collegue of Science and Engineering
> Aoyama Gakuin University
> Fuchinobe 5-1-10, Chuo-ku, Sagamihara
> 252-5258 Japan

More information about the rfc-interest mailing list