[rfc-i] New version: draft-hoffman-utf8-rfcs-04.txt

John C Klensin john+rfc at jck.com
Tue Nov 4 07:52:17 PST 2008



--On Tuesday, 04 November, 2008 00:28:51 -0800 Joe Touch
<touch at ISI.EDU> wrote: 

>>> If support for UTF-8 was in fact as universal as asserted in
>>> this doc, why is a BOM needed at all?
>> 
>> That has nothing to do with UTF-8 support being universal or
>> not.
>> 
>> The issue is that once encoding information is lost (such as
>> when transferred via FTP, or loaded from the file system),
>> many clients use a default encoding. 
> 
> So the default is ASCII, not UTF-8.

If we are talking about text/plain, it is clearly and
unambiguously ASCII if a charset parameter is not present.
Every attempt someone has made to change that has led to
problems.  The question, however, is what a local system decides
to do if it receives data identified as text/plain, with no
charset indicator, but that still contains octets with the high
bit turned on.  There is (IMO correctly) no standard for that
situation.  Different systems make different assumptions, often
involving a local character coding convention and sometimes
involving UTF-8.   Other systems may simply treat those
characters as unknown and undisplayable and do whatever they do
(typically dropping or substituting an indicator character) in
those situations, and a few may just drop the high bit and treat
the result as ASCII.

> Text/plain is ASCII; UTF-8 creates the problem by deliberately
> overloading text/plain to also mean UTF-8.

Or we created the problem in 1992 by not defining text/plain
differently or by presuming that parameters would be processed
more carefully and carried around much more than they are today.
It is a little late to try to change that.

I don't have a strong opinion on the BOM issue.  On the one hand
they are ugly and I've seen an editor or two over the years that
could handle UTF-8 reasonably well but would choke on BOMs.  On
the other,  it might be the best pragmatic approach for the RFC
case (if one were willing to accept UTF-8 RFCs at all, which I
still don't believe has been settled).

    john



More information about the rfc-interest mailing list