[rfc-i] Re: ABNF (RFC2234) vs HTTP's augmented BNF syntax (RFC822 + RFC2616)

Bruce Lilly blilly at erols.com
Tue Feb 15 07:39:58 PST 2005


On Mon February 14 2005 13:05, Keith Moore wrote:

> not clear.
[...]
> The problem with 822's syntax isn't that it uses the implied LWS rule,

It's certainly not the only problem, and if all developers were
careful to observe the implications, it wouldn't be a practical
problem.  However, it has proven to be a problem in practice;
ignoring the issues with numeric components vs. month names for
the moment, whitespace around the colons in the time has been
an issue overlooked by some developers, leading to interoperability
problems (and colon is an RFC 822 "special", so there's no "atom"
excuse).  Note that RFC 2822 is clear that for generation there
should be no comments, whitespace, or line folding around the
colons (to avoid tickling bugs in some deployed parsers), while
parsers are required (via 2822's obs- rules) to recognize CFWS
around the colons (for compatibility with RFC 822-legal content
in archived messages).

Whitespace, comments, and line-folding around the special '.'
in domain names is another practical interoperability problem
related to the 822 implicit rules that has been clarified by RFC
2822.  If you want to go back to 822's predecessor, RFC 733,
things were worse still; it permitted whitespace within atoms,
so
   at at at at at at at
could be
   at at atatatatat
or
   atat at atatatat
or
   atatat at atatat
etc.  That combination of permitting whitespace within atoms with
special treatment of " at " wasn't merely a recipe for
interoperability problems, it was a fundamental unresolvable
ambiguity in the specification.

> it's that (a) it doesn't make the distinction between lexical analysis
> and parsing sufficiently clear (the rules for parsing and the rules
> for lexical analysis are intermixed)

To a large extent that is, and should be, an implementation
detail.  Ideally a specification should permit a variety of
implementations (rather than forcing a particular one) while
making crystal clear what the permitted syntactic variations
are.

> (b) by using different tokens  
> for dates (including dates in received fields)

Actually, 822 and 2822 are consistent w.r.t. date-time in
Received vs. other fields (mind you, there are some errors in
822, as noted in 1123, though 1123 botched the relevant section
number).  (821 and 2821) vs. (822 and 2822) is another matter;
821 doesn't specify any "implied" rules w.r.t. comments or
line-folding, and specifies only a single SP as a separator
between field body components in Received fields. 821 had no
provision for day-of-week, and the seconds component was
mandatory rather than optional.

And 822 could have required space via specification of
mandatory SP, as it's companion RFC 821 did in Received
fields, if it was intended that space was required rather
than optional between day and month, and between month and
year.

> it forces lexical analysis to be 
> context-sensitive.

There are many things that force lexical analysis to be
context sensitive.  Parentheses bracket comments in some
structured fields, but not in others (MIXER fields forbid
comments and use parentheses for other purposes; RFC 2533
as used by 2530, 2912, 3297, and the recent fax-esmtp-conneg
draft uses parentheses heavily, but is silent about comments).
RFC 2047 encoded-words have meaning in unstructured fields,
in an 822 "phrase" and in comments, except in comments in a
Received field.  As noted above, determining whether or not
there *are* comments depends on whether the field is structured
or unstructured, and if structured it depends on details of
the specification (as in MIXER and feature-based fields);
handling encoded-words in comments in structured fields requires
yet another special-case exception for Received fields.  Context
sensitivity is a fact of life in the Internet message format.

> In other words, the way 822's grammar is written 
> the lexical analyzer has to know that it's expecting a date and 
> recognize "1" as 1*2DIGIT rather than recognizing "1Jan2004" as atom.

Given that a lexical analyzer may need to know
a. whether or not the field is structured
b. if structured, whether or not parentheses delimit comments
c. where comments, line-folding, and whitespace are permitted/
   forbidden/required
the knowledge that a date-time is being parsed is a minor
issue.  Strictly speaking, the lexical analyzer need not
be informed that date-time is being parsed; if it returns
numerical tokens separately from alphabetic ones, the higher-
level parsing can take care of handling those tokens as
individual units or by recombining them, depending on context.
Indeed, handling numbers separately from letters, and spaces
separately from horizontal tabs, is a virtual necessity because
of the way some field bodies are structured, at least for
validating parsers.
 
> Either that or you have to rewrite the grammar [...]

A bit late for that particular approach; about the best that
can be done is something along the RFC 2822 lines of specifying
generate and parse grammars to address interoperability and
compatibility issues, making syntax clear while not unduly
constraining methods of implementation.

> and verify the individual tokens outside of the lexical analyzer.

That's another fact of life; some lexically valid components
are illegal per higher-level syntax (e.g. 31 Feb) that a
lexical analyzer can't reasonably be expected to catch.
Overzealousness at low levels can result in errors such as
declaring legal content illegal, as in the legal field:
Date: 31 Dec 1998 23:59:60 -0000
where 60 in the seconds field denotes the (most recent as
of this time) leap second.

> There's nothing inherently wrong with the implied LWS rule.  Almost
> every programming language uses a similar rule and manages to do so
> without creating interoperability problems.

The ones I've seen generally call out the rule explicitly in
BNF, where BNF is provided.  Interoperability problems have
existed and do exist in the programming language world.

> And 822's grammar is  
> much cleaner and easier to understand than 2822's grammar.  

While there are some difficulties with 2822 grammar (hence the
effort to simplify and remove ambiguities), 822's grammar was
hopelessly vague (your "not clear" statement about sums it up).
Had it in fact been easy to understand, there would not have
been interoperability problems.

Speaking of grammars, ABNF, and ambiguities, RFC 2234's ABNF
for ABNF has several problems; I've started to look into some,
so this is just a preliminary list:
a. The text specifies that comments must be delimited on their
   lines by a semicolon, however the ABNF ABNF contains several
   lines apparently intended as comments that are not so delimited:
                                  without DQUOTE
                                  without angles
                                  last resort
                                  excluding NUL
   (those are noted on the RFC Editor Errata page, but the content
   of the "prose-val" production is changed significantly, and it
   is unclear whether the very different syntax on the Errata page
   was intended by the RFC 2234 authors; either way, the issue ought
   to have been addressed promptly by a revision -- the issue was
   reported more than 26 months ago, and RFC 2234 has been at Proposed
   for more than seven years (vs. the RFC 2026 requirement for review
   after 24 months at Proposed and every year thereafter)).
b. It specifies that ABNF lines end with CRLF, whereas RFCs that
   use ABNF may end lines with CRLF or with a newline alone;
   indeed, the canonical form at ISI.edu uses newline line endings
   (RFC 2223).
c. There is no provision for whitespace before the "rulename"
   identifier in a rule! (Every instance of ABNF that I know of
   has at least one rule that has leading whitespace; in RFC 2822
   it is the rule:
   obs-FWS         =       1*WSP *(CRLF 1*WSP)
^^^
   .)
d. It has a number of shift/reduce and reduce/reduce conflicts
   which complicate parsing (LR(1) parsing is precluded, for
   example).  Consider the legal ABNF:
a
=
b
c
d
=
f
   in which d cannot be determined to be part of a "rule" as
   opposed to an "element" until after reading the "=", which
   then affects handling of 'c' and 'd' (and the example might
   have included whitespace and/or comments on various lines,
   further complicating matters).
e. There are some understandable but odd (from a parsing POV)
   inconsistencies in handling whitespace (including line
   endings). E.g. while the example above and
a = b c
d = f
   are legal and equivalent,
a = b c d = f
   which differs only slightly, is illegal.
f. Context sensitivity exists in the ABNF ABNF also; whether
   '0' is a "BIT" or "DIGIT" or "HEXDIG" or "OCTET" or "VCHAR"
   or "CHAR" depends on context, for example.  In ABNF parsing,
   one *must* treat numerical strings separately from alphabetic
   strings or make lexicaal anlysis context dependent, else one
   will botch constructs like "3DIGIT" (repetition).


More information about the rfc-interest mailing list