[rfc-i] draft-flanagan-rfc-framework-00 and byte order mark (BOM)

Russ Housley housley at vigilsec.com
Thu Sep 11 11:59:19 PDT 2014


Heather:

>> In the discussion of plan text files,
>> draft-flanagan-rfc-framework-00 says:
>> 
>> o  A Byte Order Mark (BOM) will be added at the start of each file
>> 
>> 
>> This seems like it will hinder transition because many editors
>> will display the BOM as a few nonsensical characters.
>> 
>> The Unicode Standard permits the BOM in UTF-8; however, it does
>> not require or even recommend its use.  So, the Unicode standards
>> does not seem to be the reason to include a BOM.
>> 
>> I think we should have a UTF-8 file that is most likely to be 
>> consumed by widely deployed plaintext editors.
>> 
> 
> As you might expect, discussion of whether or not to include a BOM was
> an active topic within the design team.  Thanks to testing by Dave
> Thaler, we concluded that including a BOM would allow for the widest
> support possible for viewing the plain-text files.
> 
> His research is included below, with permission:
> 
> ========
> I just ran a test with two UTF-8 files, one with a BOM and one without.
> 
> In case you want to try them yourself, they're at
> 
> http://research.microsoft.com/~dthaler/Utf8NoBom.txt
> 
> http://research.microsoft.com/~dthaler/Utf8WithBom.txt
> 
> It includes Latin, Greek, and Cyrillic.
> 
> I tried opening them with a bunch of utilities, and browsers (opening
> local files not using HTTP), and used browsershots.org to get
> screenshots of HTTP access across many browsers and platforms.
> 
> Note the HTTP server provides no content encoding headers so it's up
> to the app to detect.
> 
> I just copied the files to a generic web server, and we may expect
> others would do the same with their own I-Ds and RFC mirrors.
> 
> Results:
> 
> 1) Some apps worked fine with both files.  These include things like
> notepad, outlook, Word, file explorer, Visual Studio 2012
> 
> 2) Some apps failed with both files (probably written to be ASCII
> only). These include things like Windiff, stevie (a vi clone),
> textpad, and the Links browser (on Ubuntu), and the Konquerer browser
> (on Ubuntu)
> 
> 3) Everything else, including almost all browsers, only displayed the
> file correctly with the BOM
> 
> This included:
> 
> Windows apps: Wordpad
> Windows using local files (no HTTP): IE, Firefox, Chrome
> Windows using HTTP: IE, Firefox, Chrome, Navigator
> Mac OSX: Safari, Camino
> Debian: Opera, Dillo
> Ubuntu: Luakit, Iceape
> 
> Conclusion: If we want people to use UTF-8 RFCs and I-Ds with existing
> tools and browsers today, any UTF-8 text format needs to include a BOM.

Thanks.  This is a solid analysis.

It seems that the MIME type (text/plain vs. text/html with charset=utf-8) becomes quite important.  Since ASCII is a subset of UTF-8, maybe the answer is to always include the charset.  Otherwise, some database needs to know which charset is appropriate when delivering each .txt file.

Russ





More information about the rfc-interest mailing list