[rfc-i] Unicode or UTF-8

Paul Hoffman paul.hoffman at vpnc.org
Thu Mar 29 04:56:31 PDT 2012


On Mar 29, 2012, at 11:45 AM, Dave Thaler wrote:

> Paul Hoffman writes: 
>> On Mar 28, 2012, at 6:57 PM, Tim Bray wrote:
>>> I confess that I can never resist a chance at character-encoding
>>> pedantry.  The BOM is actually not there to identify UTF-8, it's there
>>> because the BOM character exists to help sort out byte order in other
>>> encodings that actually have byte-order issues (UTF-8 doesn't) and
>>> since it's a Unicode character, there's a UTF-8 encoding for it.  The
>>> issue of how you identify the encoding of a chunk of bytes,
>>> particularly in the Web context, is a vexed one, particularly with
>>> XML, which makes the encoding of a document self-identifying; so
>>> should you believe what the doc says about itself, or the server's
>>> opinion as expressed in the Content-type; but I digress... -T
>> 
>> +1. RFC 3829 says that using the BOM in a UTF-8 file "is useless". Let's not go
>> there.
> 
> I assume you meant RFC 3629, not RFC 3829.

Yes.

> I'm trying to find the statement you refer to, and so far I'm not seeing it.

First paragraph of Section 6:
   The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known
   informally as "BYTE ORDER MARK" (abbreviated "BOM").  This character
   can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but
   the BOM name hints at a second possible usage of the character:  to
   prepend a U+FEFF character to a stream of UCS characters as a
   "signature".  A receiver of such a serialized stream may then use the
   initial character as a hint that the stream consists of UCS
   characters and also to recognize which UCS encoding is involved and,
   with encodings having a multi-octet encoding unit, as a way to
   recognize the serialization order of the octets.  UTF-8 having a
   single-octet encoding unit, this last function is useless and the BOM
   will always appear as the octet sequence EF BB BF.


> I can find a statement that identifying *byte-order* is useless, but that's
> far from saying the BOM in a UTF-8 file is useless.

Using the character in a file is not useless; using it as a byte order mark is useless.

> I was paraphrasing from the Unicode FAQ:
> http://www.unicode.org/faq/utf_bom.html#bom4
> 
> The exact text is
> "a BOM can be used as a signature no matter how the Unicode text is 
> transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the 
> BOM will be whatever the Unicode character U+FEFF is converted into 
> by that transformation format. In that form, the BOM serves to indicate
> both that it is a Unicode file, and which of the formats it is in."
> 
> The last sentence matches what I was referring to.   The fact that it tells
> byte-order of UTF-16 and UTF-32 is not relevant here, just the fact that
> it indicates that the file is encoded in UTF-8 and not some odd ANSI or
> whatever encoding, without relying on something else like a .utf8
> filename extension.
> 
> The UTF-8 BOM in files is often used for that purpose.


I think you are saying that a BOM at the beginning of a UTF-8 file is good for encoding-sniffing. That's true, but maybe it is also proposing a bad practice.

--Paul Hoffman



More information about the rfc-interest mailing list