[rfc-i] Unicode or UTF-8

Tony Hansen tony at att.com
Thu Mar 29 05:56:59 PDT 2012


If it's utf8-encoded, it doesn't matter whether the document starts with 
a BOM character or not. After all, it really is just a zero-width space 
character and that's exactly how the character should be treated in a 
utf8-encoded document: a character that doesn't really do anything.

     Tony

On 3/29/2012 8:49 AM, Joe Hildebrand wrote:
> Strongly recommend that whatever format we pick, it's ALWAYS utf8-encoded,
> so no need for a BOM.  If it's HTML, I'd recommend adding:
>
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
>
> to the header section, in case the file is loaded directly from disk.  In
> such cases there's no chance for HTTP headers to help out.
>
>
> On 3/29/12 1:56 PM, "Paul Hoffman"<paul.hoffman at vpnc.org>  wrote:
>
>> On Mar 29, 2012, at 11:45 AM, Dave Thaler wrote:
>>
>>> Paul Hoffman writes:
>>>> On Mar 28, 2012, at 6:57 PM, Tim Bray wrote:
>>>>> I confess that I can never resist a chance at character-encoding
>>>>> pedantry.  The BOM is actually not there to identify UTF-8, it's there
>>>>> because the BOM character exists to help sort out byte order in other
>>>>> encodings that actually have byte-order issues (UTF-8 doesn't) and
>>>>> since it's a Unicode character, there's a UTF-8 encoding for it.  The
>>>>> issue of how you identify the encoding of a chunk of bytes,
>>>>> particularly in the Web context, is a vexed one, particularly with
>>>>> XML, which makes the encoding of a document self-identifying; so
>>>>> should you believe what the doc says about itself, or the server's
>>>>> opinion as expressed in the Content-type; but I digress... -T
>>>> +1. RFC 3829 says that using the BOM in a UTF-8 file "is useless". Let's not
>>>> go
>>>> there.
>>> I assume you meant RFC 3629, not RFC 3829.
>> Yes.
>>
>>> I'm trying to find the statement you refer to, and so far I'm not seeing it.
>> First paragraph of Section 6:
>>     The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known
>>     informally as "BYTE ORDER MARK" (abbreviated "BOM").  This character
>>     can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but
>>     the BOM name hints at a second possible usage of the character:  to
>>     prepend a U+FEFF character to a stream of UCS characters as a
>>     "signature".  A receiver of such a serialized stream may then use the
>>     initial character as a hint that the stream consists of UCS
>>     characters and also to recognize which UCS encoding is involved and,
>>     with encodings having a multi-octet encoding unit, as a way to
>>     recognize the serialization order of the octets.  UTF-8 having a
>>     single-octet encoding unit, this last function is useless and the BOM
>>     will always appear as the octet sequence EF BB BF.
>>
>>
>>> I can find a statement that identifying *byte-order* is useless, but that's
>>> far from saying the BOM in a UTF-8 file is useless.
>> Using the character in a file is not useless; using it as a byte order mark is
>> useless.
>>
>>> I was paraphrasing from the Unicode FAQ:
>>> http://www.unicode.org/faq/utf_bom.html#bom4
>>>
>>> The exact text is
>>> "a BOM can be used as a signature no matter how the Unicode text is
>>> transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the
>>> BOM will be whatever the Unicode character U+FEFF is converted into
>>> by that transformation format. In that form, the BOM serves to indicate
>>> both that it is a Unicode file, and which of the formats it is in."
>>>
>>> The last sentence matches what I was referring to.   The fact that it tells
>>> byte-order of UTF-16 and UTF-32 is not relevant here, just the fact that
>>> it indicates that the file is encoded in UTF-8 and not some odd ANSI or
>>> whatever encoding, without relying on something else like a .utf8
>>> filename extension.
>>>
>>> The UTF-8 BOM in files is often used for that purpose.
>>
>> I think you are saying that a BOM at the beginning of a UTF-8 file is good for
>> encoding-sniffing. That's true, but maybe it is also proposing a bad practice.
>>
>> --Paul Hoffman
>>
>> _______________________________________________
>> rfc-interest mailing list
>> rfc-interest at rfc-editor.org
>> https://www.rfc-editor.org/mailman/listinfo/rfc-interest
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.rfc-editor.org/pipermail/rfc-interest/attachments/20120329/1d6e6b0e/attachment.htm>


More information about the rfc-interest mailing list