[rfc-i] Re: ABNF (RFC2234) vs HTTP's augmented BNF syntax (RFC822 + RFC2616)

Keith Moore moore at cs.utk.edu
Tue Feb 15 11:45:27 PST 2005


warning - long discussion about obscure details follows.  you probably
don't want to read this unless you care a lot about email specifically.

executive summary:

1. when extending an existing protocol, stick to the style and conventions
   of the core protocol rather than writing them according to the latest
   fashion.  

2. be careful about citing 2822 in general, or 2822's decision to make
   white space, comments, etc. explicit parts of the grammar, as an 
   good example of how to write a specification.  the jury is still
   out.

*************************************************************************

> > not clear.
> [...]
> > The problem with 822's syntax isn't that it uses the implied LWS
> > rule,
> 
> It's certainly not the only problem, and if all developers were
> careful to observe the implications, it wouldn't be a practical
> problem.  However, it has proven to be a problem in practice;

In over 20 years of Internet mail I have never seen a date generated
that didn't contain white space between the day of the month, 
name of the month, year, and time.  It's simply not a problem in
practice.  There are lots of other problems with dates that occur 
quite frequently, notably incorrect timezone syntax (generating
correct timezones is apparently much harder than it seems) and dates 
produced by implementations whose creators apparently didn't bother
to read the specification at all. 

> ignoring the issues with numeric components vs. month names for
> the moment, whitespace around the colons in the time has been
> an issue overlooked by some developers, leading to interoperability
> problems (and colon is an RFC 822 "special", so there's no "atom"
> excuse).  Note that RFC 2822 is clear that for generation there
> should be no comments, whitespace, or line folding around the
> colons (to avoid tickling bugs in some deployed parsers), while
> parsers are required (via 2822's obs- rules) to recognize CFWS
> around the colons (for compatibility with RFC 822-legal content
> in archived messages).

It's tempting to speculate on why removing LWS and comments during
lexical analysis is a problem for authors of email readers and not 
a problem for implementors of programming languages. 

> Whitespace, comments, and line-folding around the special '.'
> in domain names is another practical interoperability problem
> related to the 822 implicit rules that has been clarified by RFC
> 2822.  If you want to go back to 822's predecessor, RFC 733,
> things were worse still; it permitted whitespace within atoms,
> so
>    at at at at at at at
> could be
>    at at atatatatat
> or
>    atat at atatatat
> or
>    atatat at atatat
> etc.  That combination of permitting whitespace within atoms with
> special treatment of " at " wasn't merely a recipe for
> interoperability problems, it was a fundamental unresolvable
> ambiguity in the specification.

Yes, that was a botch, but it hasn't been a _practical_ 
interoperability problem in at least 20 years.  733 was replaced
by 822, and implementations had been updated to generate 822
syntax, long before there was any significant growth in the Internet.

> > it's that (a) it doesn't make the distinction between lexical
> > analysis and parsing sufficiently clear (the rules for parsing and
> > the rules for lexical analysis are intermixed)
> 
> To a large extent that is, and should be, an implementation
> detail. 

I'm talking about clarity of the specification, not how to implement it.
The problem is that 822 is written with the assumption that there will
be a separate lexical analysis phase, so that things like white space
and comments in structured fields are treated only  as separators and
not as lexical tokens.  That's a good way to make the grammar simpler
and easier to understand.  It happens that it's also a good way to write
a mail parser.  The implementation shouldn't be required to separate
lexical analysis and parsing in the same way as the specification does,
as long as it produces equivalent results.  But it makes sense for the
specification to separate things in such a way that it's easy to 
write efficient implementations that obviously conform to the 
specification.

822 had a clean grammar but didn't make the specification between
lexical analysis and parsing clear, so implementors tended to write
ad hoc parsers rather than recognizers for the grammar.  2822 has
a very complex grammar, and my belief is that people are even less
likely to try to implement a recognizer for its grammar than they 
were for 822's grammar.  What 2822 does is to further restrict
the language that senders can use in order to minimize the chance
that a poor implementation on the receiver end will mis-parse
the message, and that is probably a good idea.  Unfortunately,
this got specified in the grammar rather than elsewhere.  So
while senders might start producing cleaner messages, recognizers
aren't likely to improve.

I generally tell people that they should use 822 as a guide to
writing a recognizer, and 2822 as a guide to writing a generator.

> Ideally a specification should permit a variety of
> implementations (rather than forcing a particular one) while
> making crystal clear what the permitted syntactic variations
> are.

Ideally the specification should make it easy to write an obviously
correct implementation on most platforms.  It's useful to permit 
some variability in how things are implemented for the sake of 
efficiency or adaptability to a variety of programming languages, 
but ease of correct implementation may be more important than this
kind of flexibility.  Take MD5, which for practical purposes is defined
by a C program.  It's been implemented on a wide variety of platforms
with few interoperability problems.  (yes, I've read the specification 
and am aware that the C program is not the actual definition of MD5,
but for practical purposes that's what tends to be used, even when
the implementation language isn't C).

> > (b) by using different tokens  
> > for dates (including dates in received fields)
> 
> Actually, 822 and 2822 are consistent w.r.t. date-time in
> Received vs. other fields (mind you, there are some errors in
> 822, as noted in 1123, though 1123 botched the relevant section
> number). 

You're missing the point I was trying to make which is that
822's lexical analysis is context-sensitive.  A lexical analyzer
that has seen Date: or a ";" within a Received field needs to 
start scanning for 1*2DIGIT rather than atoms, white space, 
comments, etc.  This is sort of a pain because what you'd
like to do is have a generalized lexical analyzer for any
structured field.

> > it forces lexical analysis to be 
> > context-sensitive.
> 
> There are many things that force lexical analysis to be
> context sensitive.  Parentheses bracket comments in some
> structured fields, but not in others (MIXER fields forbid
> comments and use parentheses for other purposes; RFC 2533
> as used by 2530, 2912, 3297, and the recent fax-esmtp-conneg
> draft uses parentheses heavily, but is silent about comments).

Those are later botches, not botches in RFC 822.

(but apropos of this thread, those botches are good examples 
of what can happen if later extensions don't adhere to the 
style and syntax conventions of the core protocol)

> RFC 2047 encoded-words have meaning in unstructured fields,
> in an 822 "phrase" and in comments, except in comments in a
> Received field.

...but they don't affect parsing or message semantics.

> As noted above, determining whether or not
> there *are* comments depends on whether the field is structured
> or unstructured, and if structured it depends on details of
> the specification (as in MIXER and feature-based fields);
> handling encoded-words in comments in structured fields requires
> yet another special-case exception for Received fields.  Context
> sensitivity is a fact of life in the Internet message format.

Yes, and it can be dealt with if the implementor understands that
this is what is necessary.  But neither 822 nor 2822 really make
this clear. 
 
> > In other words, the way 822's grammar is written 
> > the lexical analyzer has to know that it's expecting a date and 
> > recognize "1" as 1*2DIGIT rather than recognizing "1Jan2004" as
> > atom.
> 
> Given that a lexical analyzer may need to know
> a. whether or not the field is structured
> b. if structured, whether or not parentheses delimit comments
> c. where comments, line-folding, and whitespace are permitted/
>    forbidden/required
> the knowledge that a date-time is being parsed is a minor
> issue. 

Actually it needs to know more than that, because the set of
terminal symbols for MIME header fields are different than those
for 822 header fields, even though the two can be mixed in the
same message header.  Having the lexical analysis be sensitive
to date-time parsing is different in that it either requires 
feedback from within the parser or it requires a lexical analyzer
that knows enough about the language that it knows when to 
switch modes.  (feedback from the parser is easier).

> Strictly speaking, the lexical analyzer need not
> be informed that date-time is being parsed; if it returns
> numerical tokens separately from alphabetic ones, the higher-
> level parsing can take care of handling those tokens as
> individual units or by recombining them, depending on context.

Sure, if you rewrite the grammar and implement a recognizer
for the rewritten grammar.  But 2822's grammar is already
too complex, and rewriting it just increases the opportunity for
errors.  If you're trying to write a clear specification what
you want to do is minimize the opportunity for errors by making
it easy for an implementor to translate the specification into
correct code.

> > and verify the individual tokens outside of the lexical analyzer.
> 
> That's another fact of life; some lexically valid components
> are illegal per higher-level syntax (e.g. 31 Feb) that a
> lexical analyzer can't reasonably be expected to catch.

Yes, indeed.  But since some tokens need this verification anyway,
this is an argument for not trying to do as much verification as
possible in the grammar.

> > There's nothing inherently wrong with the implied LWS rule.  Almost
> > every programming language uses a similar rule and manages to do so
> > without creating interoperability problems.
> 
> The ones I've seen generally call out the rule explicitly in
> BNF, where BNF is provided.  

Some of them are written so that the terminal symbols are individual 
characters, but most seem to be written so that terminal symbols 
can consist of multiple characters.  It's easier to distingish
"while" (a keyword) from "whil3" (an identifier) in the lexical
analysis phase than in a parser.

> Interoperability problems have
> existed and do exist in the programming language world.

But parsing problems are comparatively rare.

> > And 822's grammar is  
> > much cleaner and easier to understand than 2822's grammar.  
> 
> While there are some difficulties with 2822 grammar (hence the
> effort to simplify and remove ambiguities), 822's grammar was
> hopelessly vague (your "not clear" statement about sums it up).
> Had it in fact been easy to understand, there would not have
> been interoperability problems.

I don't recall any ambiguity in the 822 grammar.  The ambiguities exist
outside of the grammar.



More information about the rfc-interest mailing list