[rfc-i] Unicode in ABNF (in RFC) draft-seantek-unicode-in-abnf-01.txt

Martin J. Dürst duerst at it.aoyama.ac.jp
Mon Oct 3 21:05:36 PDT 2016

Hello Sean,

On 2016/10/04 08:44, Sean Leonard wrote:
> Thanks for checking it out!
>> On Oct 3, 2016, at 2:53:55.000AM, Martin J. Dürst <duerst at it.aoyama.ac.jp> wrote:
>> Hello Sean,
>> A few quick comments from a cursory reading:
>> First, I note that the choice you have made for representing Unicode codepoints seems to be the same that we made for RFC 3987, which is the one that I think RFC 5234 and its predecessors also implicitly suggest. If you have seen some discrepancies, I would appreciate a pointer. You may also want to reference some of the
> ? got cut off?

Yes, I meant to say that you might want to cite some of these RFCs as 

> The approach to specify actual UTF-8 octets is not necessarily “wrong”. Some might argue it is “better” or has certain things to commend to it. I could go on with various arguments for and against, but suffice to say, it’s different.

I agree with "not necessarily wrong". I think the main point is that it 
may be appropriate in a protocol that's otherwise just binary.

The main problem is not only that one has to repeat the definition of 
UTF-8 in terms of bytes, but that it becomes more and more difficult to 
specify various categories of characters.

> That style of specifying Unicode is not addressed in draft-seantek-unicode-in-abnf-01. We may well address it in the next draft.

If you mean that you point to the advantages and disadvantages of such a 
style, I'm all for it. But I don't think you need to go any further.

>> In contrast to ASCII, Unicode (in any of its encoding forms) essentially introduces multiple levels at which protocols can be described: bytes, [code units (in the case of UTF-16xx),] code points, grapheme clusters,... I'm fine with limiting this document to the code point level, which is clearly what we need now, but it would be good to say somewhere at least that this document doesn't deal with other levels.
> Good point; yes that should be mentioned explicitly.
>> Starting sections/paragraphs with parentheticals (e.g. "(Consult Section 2.3 of [RFC5234] in relation to this paragraph.)") is far away from good writing. At the minimum, put these parentheticals at the end of the paragraphs, but even better would be to convert them to actual text (in most cases still at the end of the paragraphs) and say explicitly what the "relation" is. (RFC 7405 looks much better in this respect.)
> Okay.
> Basically that part was trying to get at the relationship between “terminal values” (which are just non-negative integers in ABNF land) and “external encodings” (bits and bytes on the wire). An assumption crops up, when dealing with an “ASCII”-oriented protocol, that we are dealing in 8-bit bytes/code units and that the high-order byte is always 0 when it’s an ASCII encoding. This is because the 8-bit byte “won” in the format wars, but it was not always the case. When dealing with Unicode code points and characters, 8-bit byte != code unit. This is related to the UTF-8 issue above.

I didn't check what the relationship was in particular, but rather than 
just point to a section in RFC 5234, it might be better to say 
explicitly what the relationship is.

>> In the appendix, there are a lot of mostrosities such as "UVCHARBEYONDLATIN1". Why not change that to something a bit more readable, at the minimum something like UV_CHAR_BEYOND_LATIN_1 or so?
> I don’t know about “monstrosities”; how about “gnarly” ? 🤘✌️🏄
> It’s worth a discussion.
> The low line _ is not a valid ABNF rule name character. Only ALPHA, DIGIT, and - are permitted. One could use - I suppose.

If the '_' is prohibited, but '-' is okay, then please use '-'.

> Appendix A is a set of Core Rules, just like RFC 5234 Appendix B.1. The aesthetic preference for ABNF’s Core Rules appears to be SHORTALLCAPSWITHNODELIMITERS. Examples: CHAR, CRLF, DQUOTE, HEXDIG, LWSP, VCHAR, WSP.

That "aesthetic" preference makes sense for the actual examples because 
they are extremely short. But it makes less and less sense the longer 
these become.

> Basically, draft-01’s Appendix A tries to preserve that aesthetic. Hence: ASCII -> UNICODE, VCHAR -> UVCHAR, CHAR -> UCHAR, etc. The new thing is specifying ranges beyond the “Basic” ranges of %x00-7F (ASCII) %x00-9F (C1 controls), %x00-FF (Latin-1), and %x00-FFFF (BMP).

For UCHAR, and maybe even for UVCHAR, it should work. But for longer 
ones, the eyes and the brains of the readers would definitely appreciate 
some help with parsing.

> I also admit that I was in a Star Trek mood. 🖖 (But I did have a discussion about “Beyond” sometime last year on the Unicode mailing list.)

As long as the BEYOND is clearly separated and thus easily recognizable, 
I'm fine with it.

>> I don't see the point of defining aliases for C1 controls; it should be difficult to use these explicitly, not easy.
> Basically I respectfully disagree about this.
> Some people love to hate on the control characters, but people and protocols do use them.

Yes, but how often?

> For some protocols (JSON-text-sequence), a control character (RS) turns out to be the natural choice.

Yes, but how many similar protocols are there?
(I already think that JSON-text-sequence was overkill; it could have 
been done as a zip file or as a JSON array. But that's not the point here.)
We should have actual usage examples (and not just one or two) to 
support listing all these definitions. A format that decides it needs RS 
can always just define it locally at no additional cost to the whole 

> The same could be said about sentinel characters (aka non-characters).

Non-characters may be used internally in an application. So NEVER use 
them in an actual protocol.

> Judicious use of unused code points in a protocol,

Unused code points may be assigned sooner or later, or may have been set 
aside for a specific purpose. If not, then they may still be used by 
somebody already. Using them as separators in the hope that nobody else 
uses them is just betting on unlucky circumstances.

> means you don’t have to escape or quote, which means less processing power and less risk of security bugs due to improper escaping or quoting.

Improper escaping is a problem that can be fixed. Missing escaping when 
it may still be needed is more difficult to fix, so we better don't 
start it.

> The C1 controls are not used nearly as much; however, that is all the more reason to use them for out-of-band purposes (as RS is used in JSON text sequence). Another way to put it is: “why would anyone define an ABNF rule <SS2> that means anything other than SINGLE SHIFT 2”? Essentially this draft makes it a point to discourage that sort of thing, by having a common language.

The main point is not "why would anybody use SS2 for anything other than 
SINGLE SHIFT 2?". The main point is "if anybody ever gets the idea of 
using SS2, then let that be their problem, let's not pretend we suggest 
using SS2 by giving it an alias." Unicode already lists SS2 for U+008E. 
Everybody interested in knowing that can easily find it. The common 
language is already there. We don't need to repeat it for the odd case 
that somebody eventually may use it.

By the way, "gov't health warning: figment" is difficult to understand, 
better be more explicit.

> We can discuss whether it makes sense to define a larger number of rules for the undisplayable characters. Beyond C1 (compare with C0 in draft-seantek-abnf-more-core-rules), the only ones I/we defined are NBSP (obvious), SHY (obvious), and LS and PS (obvious because SP and HT / HTAB are defined). I think that is a good number. For line-oriented protocols, PS should never appear, hence another good reason to call it out.

Why don't you just mention a convention for converting unicode names 
into ABNF aliases, and leave it to specifications who need such aliases 
to define them?

>> For some of the aliases, a property-based approach seems to be the right thing to do, although this may be difficult to align with the ABNF straightjacket.
> Correct.
>> The draft says:
>>   Formally, this document updates [RFC5234] but does not modify it in
>>   situ. Authors need to reference this document if they want to include
>>   these enhancements; bare references to [RFC5234] do not include this
>>   specification (or, for that matter, [RFC7405]).
>> There's no text whatsoever in RFC 7405 that would say that it doesn't update RFC 5234 directly. But I may be missing something. Please clarify.
> I understand that there is some kind of crazy thread on ietf at ietf.org about “what is the meaning of an RFC Update?” that I am not participating in because I am not presently subscribed to that list. Whatever comes out of that conversation is probably relevant to that text.

I see. Let's wait for that, then.

Regards,   Martin.

> I think the point is that if an RFC references [RFC7405], then %s"foo" and %i"foo" are fair game in the ABNF. But if it only references [RFC5234], then that syntax is not supposed to appear. Ditto here (in intent).
>> Section 6 uses an example with actual Unicode characters. I'd definitely wait for the new way of publishing drafts/RFCs before the final publication of this document, so that this example (and hopefully a few more) can use actual Unicode characters.
> Yes, that’s fine.
>> (I'd also change 'notated' to 'annotated'. (several occurrences))
> Okay; I will take a look.
> Best regards,
> Sean
> (%su"foo" issue in next reply)

More information about the rfc-interest mailing list