[rfc-i] Unicode or UTF-8

Tim Bray tbray at textuality.com
Wed Mar 28 09:57:43 PDT 2012


I confess that I can never resist a chance at character-encoding
pedantry.  The BOM is actually not there to identify UTF-8, it’s there
because the BOM character exists to help sort out byte order in other
encodings that actually have byte-order issues (UTF-8 doesn’t) and
since it’s a Unicode character, there’s a UTF-8 encoding for it.  The
issue of how you identify the encoding of a chunk of bytes,
particularly in the Web context, is a vexed one, particularly with
XML, which makes the encoding of a document self-identifying; so
should you believe what the doc says about itself, or the server’s
opinion as expressed in the Content-type; but I digress... -T

On Wed, Mar 28, 2012 at 9:47 AM, Dave Thaler <dthaler at microsoft.com> wrote:
> Iljitsch van Beijnum writes:
> [...]
>> If we want to go beyond ASCII, UTF-8 is a no-brainer, because there is no
>> difference between a file that is in US ASCII and a file that is in UTF-8 but just
>> happens to have no code points > 127
> [...]
>
> Not entirely true.  A file that is in UTF-8 may start with a 3-byte BOM (EF BB BF)
> that identifies it as being encoded in UTF-8.
>
> -Dave
>
> _______________________________________________
> rfc-interest mailing list
> rfc-interest at rfc-editor.org
> https://www.rfc-editor.org/mailman/listinfo/rfc-interest


More information about the rfc-interest mailing list