[rfc-i] [Json] v3imp #8 Fragment tagging on sourcecode

Sean Leonard dev+ietf at seantek.com
Sat Jan 31 07:13:16 PST 2015


On 1/30/2015 3:32 PM, Bjoern Hoehrmann wrote:
> * Sean Leonard wrote:
>> On 1/28/2015 3:02 PM, Nico Williams wrote:
>>>> As a broader useful point (not directed specifically to
>>>> draft-ietf-json-text-sequence-13), I think it would be nice if
>>>> future RFC ABNF can assume that the symbols in RFC 20 for %x00-20 /
>>>> %x7F are rules that can be used as-is. Hence NUL = %x00, SUB = %x1A,
>>>> DEL = %x7F, etc.
>>> I agree.  These should be added to RFC5234 (i.e., we should publish an
>>> update RFC listing them).
>> I would be willing to write up/work on a concise RFC (Internet-Draft)
>> that does this, prior to IETF 92. I would not mind adding a few
>> additional rules to the Core Rules if there is near-unanimous support
>> for such rules. (Hard to think of any in particular, except maybe some
>> generic UTF8 character rules. Last I checked, ABNF is still
>> octet-oriented and US-ASCII focused.)
> It is integer-oriented, you have to define the alphabet in prose and are
> free to say it's "octets" or "Unicode scalar values" or whatever else.

Ok. I was not aware of that, however it is stated clearly enough in RFC 
5234 Section 2.3.

Many specifications use ABNF as octets, and define UTF-8 data in terms 
of octets. E.g., LDAP [RFC4512]; see "UTFMB". It makes equal sense (in 
some contexts, perhaps better sense) to use ABNF in integer form for 
integers 0 - 0x10FFFF, i.e., Unicode scalar values.

However those folks who want to define formal import syntaxes will have 
to grapple with this issue--namely that integer values do not represent 
the same thing across different specifications. I see this as another 
reason not to pursue import syntaxes.

> I do not think an RFC that updates RFC 5234 just to add names for three
> symbols is a good idea.

*Only* the following symbols are being proposed:
NUL = %d0
SOH = %d1
STX = %d2
ETX = %d3
EOT = %d4
ENQ = %d5
ACK = %d6
BEL = %d7
BS = %d8
HT = %d9 ; also defined as HTAB
LF = %d10 ; already defined
VT = %d11
FF = %d12 ; (literally used in every RFC)
CR = %d13 ; already defined
SO = %d14
SI = %d15
DLE = %d16
DC1 = %d17
DC2 = %d18
DC3 = %d19
DC4 = %d20
NAK = %d21
SYN = %d22
ETB = %d23
CAN = %d24
EM = %d25
SUB = %d26
ESC = %d27
FS = %d28
GS = %d29
RS = %d30
US = %d31
SP = %d32 ; already defined
DEL = %d127

These are all taken from RFC 20, and (mercifully) have the same symbol 
names in Unicode.

> It's very rare that the particular characters
> are useful,

NUL is used more often than you (or many others) might care to admit.

draft-ietf-json-text-sequence-13 makes use of RS, and that usage is very 
appropriate given the engineering problem (that the WG has probably 
already explored).

ESC is used in ANSI escape codes, which form an integral part of a 
variety of standards (maybe few IETF standards, but ISO and 
industry-specific standards such as those in the healthcare industry) 
and are used in modern command-lines and consoles.

SO and SI are used in ISO 2022 character sets.

There are probably a number of historical examples.

>   and ordinarily you will have to define other rules on your
> own aswell;

Well yes, but now you will not need to define as many 
rules--particularly rules for unprintable characters.

I note also that XML 1.0 prohibits C0 control characters except for HT, 
LF, CR, SP, and DEL. Why the standard prohibits FF and BS, but lets DEL 
in, is rather mysterious to me. However if you want to represent these 
characters for their semantics, say, in xml2rfc V3, RS is possible but 
the actual "0x1E" is a harder sell.

XML 1.1 permits all C0 control characters except NUL. Therefore, a XML 
fragment streaming format akin to draft-ietf-json-text-sequence could 
easily use NUL to separate records on the wire.

>   on the software side you would then have ABNF tools that do
> not know the new Core rules, I would have to update my Parse::ABNF Perl
> module for instance,
"lazy"? :) Laziness is not a valid reason to prohibit new work...

For some of these rules (e.g., DC1) perhaps the principal reason to 
include them in the Core Rules is to discourage future spec writers from 
using that rule name to define something else.

>   and there is the issue that some specifications may
> already be using the symbol names, and dealing with clashes with Core
> rules is a bit nasty.

As with "lazy" there may be clashes with the Core Rules, but that can be 
taken as evidence that the Core Rules should have been extended (or 
defined) earlier--not as evidence to prohibit new work. As it is, if you 
use a Core Rule (as draft-josefsson-pkix-textual does), you can't take 
the reference for granted; the IESG will ask you to reference [RFC5234].

The only spec that I am aware of that would cause problems is LDAP's RFC 
4512, which defines ESC = %x5C ("\" backslash). However, RFC 4512 also 
proposes a lot of completely unnecessary rule names (NULL, SPACE, 
DOLLAR, USCORE, EQUALS, DOT), restates at least one Core Rule (DQUOTE), 
and has a bona-fide conflict with existing Core Rules (SP and WSP). 
Since RFC 5234 was defined after 4512, this is evidence in support of 
standardization for future RFCs.

Cheers,

Sean



More information about the rfc-interest mailing list