[rfc-i] Byte Order Marks for UTF-8

Dave Thaler dthaler at microsoft.com
Wed Jul 18 18:50:10 PDT 2012


> -----Original Message-----
> From: "Martin J. Dürst" [mailto:duerst at it.aoyama.ac.jp]
> Sent: Wednesday, July 18, 2012 6:10 PM
> To: Dave Thaler
> Cc: Paul Hoffman; Tim Bray; rfc-interest at rfc-editor.org
> Subject: Re: [rfc-i] Byte Order Marks for UTF-8
> 
> On 2012/07/19 3:57, Dave Thaler wrote:
> >> -----Original Message-----
> >> From: rfc-interest-bounces at rfc-editor.org
> >> [mailto:rfc-interest-bounces at rfc- editor.org] On Behalf Of Paul
> >> Hoffman
> >> Sent: Wednesday, July 18, 2012 9:39 AM
> >> To: Tim Bray
> >> Cc: rfc-interest at rfc-editor.org
> >> Subject: Re: [rfc-i] Byte Order Marks for UTF-8
> >>
> >> On Jul 18, 2012, at 9:23 AM, Tim Bray wrote:
> >>
> >>> That's probably a good recommendation, if we couple it with a
> >>> mandate to
> >> never generate UTF-16.
> >>
> >> Did I misread the messages from yesterday? I thought some
> >> text-reading software worked when it saw a UTF8 BOM but not if it
> >> didn't. If I misunderstood, then Phill's idea (don't include it in
> >> generated text formats) is fine. If not, the RFC Editor should investigate
> further.
> >
> > Right, there's plenty of software that displays UTF8 text correctly
> > when a UTF8 BOM is present and does not display it correct when it's
> > absent.  (Usually because there's many possible encodings, and UTF8
> > isn't the default guess of that software.)
> 
> Wordpad was mentioned yesterday, and confirmed. Notepad was mentioned
> too, but it seems it can do without. Do you know others? Even if UTF-8 isn't
> the default guess, it's the encoding that's easiest to guess because of its
> regular structure.

We should never expect all apps to "guess".   A few will, yes.
But we should never rely on it, and in any case guessing can be error prone.

-Dave

> It may be that it depends on how much actual UTF-8 (not just US-ASCII which
> is UTF-8 by default) appears how early. It may be that the 'ü' in my name is
> not enough in some cases, but the Japanese name of my University helped.
> But it would be difficult to require everybody (or their organizations) to adopt
> a non-Latin name just for the purpose of
> UTF-8 detection.
> 
> I can easily make available a version with BOM, and/or a version without my
> University in Kanji, but I'll only tell this list. I don't want to have umpteen
> minor variants mentioned in the draft itself.
> 
> Regards,   Martin.



More information about the rfc-interest mailing list