[rfc-i] Unicode or UTF-8

Dave Thaler dthaler at microsoft.com
Thu Mar 29 02:45:16 PDT 2012

Paul Hoffman writes: 
> On Mar 28, 2012, at 6:57 PM, Tim Bray wrote:
> > I confess that I can never resist a chance at character-encoding
> > pedantry.  The BOM is actually not there to identify UTF-8, it's there
> > because the BOM character exists to help sort out byte order in other
> > encodings that actually have byte-order issues (UTF-8 doesn't) and
> > since it's a Unicode character, there's a UTF-8 encoding for it.  The
> > issue of how you identify the encoding of a chunk of bytes,
> > particularly in the Web context, is a vexed one, particularly with
> > XML, which makes the encoding of a document self-identifying; so
> > should you believe what the doc says about itself, or the server's
> > opinion as expressed in the Content-type; but I digress... -T
> +1. RFC 3829 says that using the BOM in a UTF-8 file "is useless". Let's not go
> there.

I assume you meant RFC 3629, not RFC 3829.

I'm trying to find the statement you refer to, and so far I'm not seeing it.
I can find a statement that identifying *byte-order* is useless, but that's
far from saying the BOM in a UTF-8 file is useless.

I was paraphrasing from the Unicode FAQ:

The exact text is
"a BOM can be used as a signature no matter how the Unicode text is 
transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the 
BOM will be whatever the Unicode character U+FEFF is converted into 
by that transformation format. In that form, the BOM serves to indicate
both that it is a Unicode file, and which of the formats it is in."

The last sentence matches what I was referring to.   The fact that it tells
byte-order of UTF-16 and UTF-32 is not relevant here, just the fact that
it indicates that the file is encoded in UTF-8 and not some odd ANSI or
whatever encoding, without relying on something else like a .utf8
filename extension.

The UTF-8 BOM in files is often used for that purpose.


More information about the rfc-interest mailing list