[rfc-i] draft-flanagan-rfc-framework-00 and byte order mark (BOM)

Heather Flanagan (RFC Series Editor) rse at rfc-editor.org
Thu Sep 11 12:03:47 PDT 2014


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 9/11/14, 11:59 AM, Russ Housley wrote:
> Heather:
>
>>> In the discussion of plan text files,
>>> draft-flanagan-rfc-framework-00 says:
>>>
>>> o  A Byte Order Mark (BOM) will be added at the start of each file
>>>
>>>
>>> This seems like it will hinder transition because many editors
>>> will display the BOM as a few nonsensical characters.
>>>
>>> The Unicode Standard permits the BOM in UTF-8; however, it does
>>> not require or even recommend its use.  So, the Unicode standards
>>> does not seem to be the reason to include a BOM.
>>>
>>> I think we should have a UTF-8 file that is most likely to be
>>> consumed by widely deployed plaintext editors.
>>>
>>
>> As you might expect, discussion of whether or not to include a BOM was
>> an active topic within the design team.  Thanks to testing by Dave
>> Thaler, we concluded that including a BOM would allow for the widest
>> support possible for viewing the plain-text files.
>>
>> His research is included below, with permission:
>>
>> ========
>> I just ran a test with two UTF-8 files, one with a BOM and one without.
>>
>> In case you want to try them yourself, they're at
>>
>> http://research.microsoft.com/~dthaler/Utf8NoBom.txt
>>
>> http://research.microsoft.com/~dthaler/Utf8WithBom.txt
>>
>> It includes Latin, Greek, and Cyrillic.
>>
>> I tried opening them with a bunch of utilities, and browsers (opening
>> local files not using HTTP), and used browsershots.org to get
>> screenshots of HTTP access across many browsers and platforms.
>>
>> Note the HTTP server provides no content encoding headers so it's up
>> to the app to detect.
>>
>> I just copied the files to a generic web server, and we may expect
>> others would do the same with their own I-Ds and RFC mirrors.
>>
>> Results:
>>
>> 1) Some apps worked fine with both files.  These include things like
>> notepad, outlook, Word, file explorer, Visual Studio 2012
>>
>> 2) Some apps failed with both files (probably written to be ASCII
>> only). These include things like Windiff, stevie (a vi clone),
>> textpad, and the Links browser (on Ubuntu), and the Konquerer browser
>> (on Ubuntu)
>>
>> 3) Everything else, including almost all browsers, only displayed the
>> file correctly with the BOM
>>
>> This included:
>>
>> Windows apps: Wordpad
>> Windows using local files (no HTTP): IE, Firefox, Chrome
>> Windows using HTTP: IE, Firefox, Chrome, Navigator
>> Mac OSX: Safari, Camino
>> Debian: Opera, Dillo
>> Ubuntu: Luakit, Iceape
>>
>> Conclusion: If we want people to use UTF-8 RFCs and I-Ds with existing
>> tools and browsers today, any UTF-8 text format needs to include a BOM.
>
> Thanks.  This is a solid analysis.
>
> It seems that the MIME type (text/plain vs. text/html with
charset=utf-8) becomes quite important.  Since ASCII is a subset of
UTF-8, maybe the answer is to always include the charset.  Otherwise,
some database needs to know which charset is appropriate when delivering
each .txt file.
>
>
Is there a way I don't know about to include charset in a .txt file?  I
didn't think there was...?

- -Heather

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJUEfITAAoJEER/xjINbZoGFqQH/2avcfiAuDhVT6abMkvO5Gg0
T54BOIGcQ9Y106qERwhRbLm7FdZMTgq+OeU30kWL9OwSvXpufEnUPPeCBjVm8TyU
y1eQIjW4DtUS6h0BtvpBGlWEH5PkCEMUfo9F/37RkwH54L4BmMN0nfeAfyg/jKSu
2jp+OsFcGz2I0T9UAGkWPcntSM4V36pfm3lsXYJO9piqi8OBEgahuYfuyYDrw4HZ
FY9e4LfopIkDVdIHTfRoewAdWjiyDWeNDcHB628XX5mRAO0sn2Wyh8aVgz3W30Tl
DOp9TNBwxq/VF3J/ham6Mn6w+QCoD8f6Vx/cNe6dE/wgepRW7XXYBgd8ROH7Zj0=
=KLLp
-----END PGP SIGNATURE-----



More information about the rfc-interest mailing list