[rfc-i] draft-flanagan-rfc-framework-00 and byte order mark (BOM)

Tim Bray tbray at textuality.com
Thu Sep 11 12:31:03 PDT 2014


Very few things about the Web work properly without Internet Media Types;
that’s why we put so much work into them here at the IETF.  If you don’t
have one, you’re relying on the client software sniffing inside the
document to figure out what it is. This is a bad, insecure practice.
 Fortunately, on almost every web server, it is easy to set up content
types and in most cases the right thing happens by default.  For example,
virtually every web server I’ve been near will by default serve files whose
names end in “.txt” with Content-type: text/plain and then everything will
work.  I was shocked when I probed with curl and you’re right, no
Content-type header. Which I’d call broken.  (Also both files are now
empty…).

Anyhow, the notion that we should try to make our stuff work without
Internet Media Types by encouraging client sniffing is really lame.

On Thu, Sep 11, 2014 at 12:03 PM, Heather Flanagan (RFC Series Editor) <
rse at rfc-editor.org> wrote:

>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 9/11/14, 11:59 AM, Russ Housley wrote:
> > Heather:
> >
> >>> In the discussion of plan text files,
> >>> draft-flanagan-rfc-framework-00 says:
> >>>
> >>> o  A Byte Order Mark (BOM) will be added at the start of each file
> >>>
> >>>
> >>> This seems like it will hinder transition because many editors
> >>> will display the BOM as a few nonsensical characters.
> >>>
> >>> The Unicode Standard permits the BOM in UTF-8; however, it does
> >>> not require or even recommend its use.  So, the Unicode standards
> >>> does not seem to be the reason to include a BOM.
> >>>
> >>> I think we should have a UTF-8 file that is most likely to be
> >>> consumed by widely deployed plaintext editors.
> >>>
> >>
> >> As you might expect, discussion of whether or not to include a BOM was
> >> an active topic within the design team.  Thanks to testing by Dave
> >> Thaler, we concluded that including a BOM would allow for the widest
> >> support possible for viewing the plain-text files.
> >>
> >> His research is included below, with permission:
> >>
> >> ========
> >> I just ran a test with two UTF-8 files, one with a BOM and one without.
> >>
> >> In case you want to try them yourself, they're at
> >>
> >> http://research.microsoft.com/~dthaler/Utf8NoBom.txt
> >>
> >> http://research.microsoft.com/~dthaler/Utf8WithBom.txt
> >>
> >> It includes Latin, Greek, and Cyrillic.
> >>
> >> I tried opening them with a bunch of utilities, and browsers (opening
> >> local files not using HTTP), and used browsershots.org to get
> >> screenshots of HTTP access across many browsers and platforms.
> >>
> >> Note the HTTP server provides no content encoding headers so it's up
> >> to the app to detect.
> >>
> >> I just copied the files to a generic web server, and we may expect
> >> others would do the same with their own I-Ds and RFC mirrors.
> >>
> >> Results:
> >>
> >> 1) Some apps worked fine with both files.  These include things like
> >> notepad, outlook, Word, file explorer, Visual Studio 2012
> >>
> >> 2) Some apps failed with both files (probably written to be ASCII
> >> only). These include things like Windiff, stevie (a vi clone),
> >> textpad, and the Links browser (on Ubuntu), and the Konquerer browser
> >> (on Ubuntu)
> >>
> >> 3) Everything else, including almost all browsers, only displayed the
> >> file correctly with the BOM
> >>
> >> This included:
> >>
> >> Windows apps: Wordpad
> >> Windows using local files (no HTTP): IE, Firefox, Chrome
> >> Windows using HTTP: IE, Firefox, Chrome, Navigator
> >> Mac OSX: Safari, Camino
> >> Debian: Opera, Dillo
> >> Ubuntu: Luakit, Iceape
> >>
> >> Conclusion: If we want people to use UTF-8 RFCs and I-Ds with existing
> >> tools and browsers today, any UTF-8 text format needs to include a BOM.
> >
> > Thanks.  This is a solid analysis.
> >
> > It seems that the MIME type (text/plain vs. text/html with
> charset=utf-8) becomes quite important.  Since ASCII is a subset of
> UTF-8, maybe the answer is to always include the charset.  Otherwise,
> some database needs to know which charset is appropriate when delivering
> each .txt file.
> >
> >
> Is there a way I don't know about to include charset in a .txt file?  I
> didn't think there was...?
>
> - -Heather
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
> Comment: GPGTools - http://gpgtools.org
>
> iQEcBAEBAgAGBQJUEfITAAoJEER/xjINbZoGFqQH/2avcfiAuDhVT6abMkvO5Gg0
> T54BOIGcQ9Y106qERwhRbLm7FdZMTgq+OeU30kWL9OwSvXpufEnUPPeCBjVm8TyU
> y1eQIjW4DtUS6h0BtvpBGlWEH5PkCEMUfo9F/37RkwH54L4BmMN0nfeAfyg/jKSu
> 2jp+OsFcGz2I0T9UAGkWPcntSM4V36pfm3lsXYJO9piqi8OBEgahuYfuyYDrw4HZ
> FY9e4LfopIkDVdIHTfRoewAdWjiyDWeNDcHB628XX5mRAO0sn2Wyh8aVgz3W30Tl
> DOp9TNBwxq/VF3J/ham6Mn6w+QCoD8f6Vx/cNe6dE/wgepRW7XXYBgd8ROH7Zj0=
> =KLLp
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> rfc-interest mailing list
> rfc-interest at rfc-editor.org
> https://www.rfc-editor.org/mailman/listinfo/rfc-interest
>



-- 
- Tim Bray (If you’d like to send me a private message, see
https://keybase.io/timbray)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.rfc-editor.org/pipermail/rfc-interest/attachments/20140911/6e252f1e/attachment.html>


More information about the rfc-interest mailing list