[rfc-i] ABNF (RFC2234) vs HTTP's augmented BNF syntax (RFC822 + RFC2616)

Bruce Lilly blilly at erols.com
Thu Feb 17 07:20:35 PST 2005


On Tue February 15 2005 14:45, Keith Moore wrote:

> In over 20 years of Internet mail I have never seen a date generated
> that didn't contain white space between the day of the month, 
> name of the month, year, and time.

I have. Unfortunately, the tendency of some developers of
applications using Internet protocols is to take your "I have
never seen..." and assume that such things never happen. And
lo and behold, an interoperabilty incompatibility is born.

> [...] produced by implementations whose creators apparently didn't bother
> to read the specification at all.

There's the problem. To some extent, examples in specifications
may be counterproductive, because of the lazy developer issue.
 
> It's tempting to speculate on why removing LWS and comments during
> lexical analysis is a problem for authors of email readers and not 
> a problem for implementors of programming languages. 

I think Dave came pretty close; compiler (etc.) writers writing
for a well-defined language are dealing in their area of expertise.
Application developers may be parsing experts or experts in the
specific problem domain; experts in both areas are rare.  Compounding
the problem is vague definitions which do not lend themselves to
analysis for ambiguities and conflicts, and/or straightforward
translation into a parser.

> The implementation shouldn't be required to separate
> lexical analysis and parsing in the same way as the specification does,
> as long as it produces equivalent results.

The problem is that a vague specification doesn't specify results
other than vaguely.

> But it makes sense for the 
> specification to separate things in such a way that it's easy to 
> write efficient implementations that obviously conform to the 
> specification.

The easiest way would be to have BNF which is suitable for automatic
generation of a parser; compiler writers solved that problem 3 decades
ago with yacc and related programs.  RFC 2234 ABNF can't easily be
turned into a yacc specification (indeed, as noted in the case of
ANBF ABNF itself, it's not amenable to LR(1) parsing).
 
> 822 had a clean grammar but didn't make the specification between
> lexical analysis and parsing clear, so implementors tended to write
> ad hoc parsers rather than recognizers for the grammar.

Agreed that 822 lack of clarity has stymied attempts to produce
a formal grammar-based parser.  I seem to recall some attempts in
the 90's, and it is something that is periodically requested by
developers.

> 2822 has 
> a very complex grammar, and my belief is that people are even less
> likely to try to implement a recognizer for its grammar than they 
> were for 822's grammar.

In mid-2001, I tired of writing (and rewriting) ad-hoc parsers and
decided to write a formal grammar-based parser.  In researching the
issue (to avoid reinventing the wheel) I found none of the attempts
that I had vaguely recalled form the 90's, but found that 2821 and
2822 had been recently issued.  2822 made implementation considerably
easier than it would have been with only 822 and the couple of dozen
RFCs that amended it (not to mention those, like 821, that
contradicted it).  However, 2822 has a number of ambiguities which
have led to the attempt to simplify its grammar by removing those
ambiguities. [Incidentally, there are still conflicts between
2821 and 2822; messrs. Resnick and Klensin are aware of them.]

> What 2822 does is to further restrict 
> the language that senders can use in order to minimize the chance
> that a poor implementation on the receiver end will mis-parse
> the message, and that is probably a good idea.  Unfortunately,
> this got specified in the grammar rather than elsewhere.

Given a specification with:
1. examples
2. ABNF
3. normative prose
it appears that most developers consult those in the order given
above, proceeding to the next step only if they find themselves
in a bind.  The 2822 clarifications absolutely belong in the ABNF;
burying them elsewhere would have meant that they would have been
ignored.  In many ways 2822 ABNF doesn't go far enough; it ignores
MIME, including the (rather complex) rules for handling encoded-words
in field bodies.

> So 
> while senders might start producing cleaner messages, recognizers
> aren't likely to improve.

A few issues:
1. 2822 changes the rules, e.g. by now forbidding NUL in places
   where it has traditionally been legal; to that extent, new
   development based solely on 2822 will not be able to parse
   some messages which were legal according to 822 and its
   predecessors.
2. Part of the problem is inertia.  Existing parsers are unlikely
   to change (full stop).
3. Parsing is inherently a more difficult problem than generation
   (printf is easy).

> I generally tell people that they should use 822 as a guide to
> writing a recognizer, and 2822 as a guide to writing a generator.

There is actually good advice in 2822's obs- rules for parsing.
Ignoring those rules may lead to interoperability problems.

> Ideally the specification should make it easy to write an obviously
> correct implementation on most platforms.

See above re. ABNF vs. yacc.

> Take MD5, which for practical purposes is defined
> by a C program.  It's been implemented on a wide variety of platforms
> with few interoperability problems.  (yes, I've read the specification 
> and am aware that the C program is not the actual definition of MD5,
> but for practical purposes that's what tends to be used, even when
> the implementation language isn't C).

Sure, whether called such or not, it's a reference implementation,
and such things tend to get used rather than reinvented (sometimes,
as with SNMP v. 1, with awful results).

> You're missing the point I was trying to make which is that
> 822's lexical analysis is context-sensitive.

Got that; agreed.

> [...] what you'd
> like to do is have a generalized lexical analyzer for any
> structured field.

Nice idea (perhaps something for mail-ng), but the current crop
of Internet protocols haven't developed that way.

> Those are later botches, not botches in RFC 822.
> 
> (but apropos of this thread, those botches are good examples 
> of what can happen if later extensions don't adhere to the 
> style and syntax conventions of the core protocol)

I wouldn't necessarily call those all "botches", and some of
them date to the same time frame as 822 (the first MIXER RFC
was RFC 987).
  
> > RFC 2047 encoded-words have meaning in unstructured fields,
> > in an 822 "phrase" and in comments, except in comments in a
> > Received field.
> 
> ...but they don't affect parsing or message semantics.

Depends on the application; if one is parsing fields for
presentation, parsing is absolutely affected.  If one is
parsing encoded-words for presentation and encounters a
Received field, one has to invoke special handling. Indeed,
if one encounters an extension field (not user-defined) but
does not know the precise field syntax, one is out of luck
w.r.t. encoded-words (as you know, RFC 2047 has an escape
clause for user-defined fields, but is silent about extension
fields with unknown syntax).
 
> > Context
> > sensitivity is a fact of life in the Internet message format.
> 
> Yes, and it can be dealt with if the implementor understands that
> this is what is necessary.  But neither 822 nor 2822 really make
> this clear. 

True to the extent that context can be determined (see above re.
encoded-words and extension fields). Is a by-product of the
informality of early specifications (822 and predecessors); even
RFC 561, which attempted formal BNF specification, contains some
vagueness (e.g. "a standard host name").

> > Strictly speaking, the lexical analyzer need not
> > be informed that date-time is being parsed; if it returns
> > numerical tokens separately from alphabetic ones, the higher-
> > level parsing can take care of handling those tokens as
> > individual units or by recombining them, depending on context.
> 
> Sure, if you rewrite the grammar and implement a recognizer
> for the rewritten grammar.

Rewriting the grammar isn't necessary.  There is always an
engineering tradeoff between what is handled in lexical analysis
vs. what is handled in parsing.  Some issues, e.g. the introduction
of '*' as a special symbol by RFC 2184, and the inconsistent
handling of digits when RFC 2184 was obsoleted by RFC 2231, strongly
encourage handling numbers separately from alphabetic strings,
leaving the parser sort out legal combinations of letters and digits
and other symbols (another reason to handle numbers and alphabetics
separately is the changes in rules for domain names).

> If you're trying to write a clear specification what
> you want to do is minimize the opportunity for errors by making
> it easy for an implementor to translate the specification into
> correct code.

See above again re. ABNF vs. yacc.  It is easier to translate a
complete specification into correct code. It is nearly impossible
to translate a vague, incomplete specification into code that can
convincingly be argued as being correct. 

> I don't recall any ambiguity in the 822 grammar.  The ambiguities exist
> outside of the grammar.

What about the very issue we've been discussing; whether or not
whitespace is required between day and month and between month
and year (or, for that matter, between year and hour (N.B. the
hour component has fixed width)).


More information about the rfc-interest mailing list