[rfc-i] Byte Order Marks for UTF-8

"Martin J. Dürst" duerst at it.aoyama.ac.jp
Wed Jul 18 18:09:44 PDT 2012

On 2012/07/19 3:57, Dave Thaler wrote:
>> -----Original Message-----
>> From: rfc-interest-bounces at rfc-editor.org [mailto:rfc-interest-bounces at rfc-
>> editor.org] On Behalf Of Paul Hoffman
>> Sent: Wednesday, July 18, 2012 9:39 AM
>> To: Tim Bray
>> Cc: rfc-interest at rfc-editor.org
>> Subject: Re: [rfc-i] Byte Order Marks for UTF-8
>> On Jul 18, 2012, at 9:23 AM, Tim Bray wrote:
>>> That's probably a good recommendation, if we couple it with a mandate to
>> never generate UTF-16.
>> Did I misread the messages from yesterday? I thought some text-reading
>> software worked when it saw a UTF8 BOM but not if it didn't. If I
>> misunderstood, then Phill's idea (don't include it in generated text formats) is
>> fine. If not, the RFC Editor should investigate further.
> Right, there's plenty of software that displays UTF8 text correctly when a UTF8
> BOM is present and does not display it correct when it's absent.  (Usually because
> there's many possible encodings, and UTF8 isn't the default guess of that software.)

Wordpad was mentioned yesterday, and confirmed. Notepad was mentioned 
too, but it seems it can do without. Do you know others? Even if UTF-8 
isn't the default guess, it's the encoding that's easiest to guess 
because of its regular structure.

It may be that it depends on how much actual UTF-8 (not just US-ASCII 
which is UTF-8 by default) appears how early. It may be that the 'ü' in 
my name is not enough in some cases, but the Japanese name of my 
University helped. But it would be difficult to require everybody (or 
their organizations) to adopt a non-Latin name just for the purpose of 
UTF-8 detection.

I can easily make available a version with BOM, and/or a version without 
my University in Kanji, but I'll only tell this list. I don't want to 
have umpteen minor variants mentioned in the draft itself.

Regards,   Martin.

More information about the rfc-interest mailing list