[rfc-i] Byte order marks

John C Klensin john+rfc at jck.com
Wed Nov 5 15:56:26 PST 2008

--On Wednesday, 05 November, 2008 13:25:38 -0500 Tony Hansen
<tony at att.com> wrote:

> Some more random thoughts:
> While it would best if we could just say ".txt means utf8",
> I'm becoming convinced that we won't get there. If we went the
> path of a .utf8 file, the .txt file *could* be considered
> secondary to the .txt and even auto-generated from the .utf8.
> Consider this scenario:
>   *	I-D upload accepts .utf8 files as a primary source
>   *	the .txt version is auto-generated,
> 	o replacing each utf8 sequence with U+####
> 	o add a note somewhere (say, as the very first line)
> indicating 	  that the authoritative version is the UTF8
> version
> This could be a potential way forward.

FWIW, this makes a lot of sense to me, especially if
incorporated into a transition strategy in which we deploy the
UTF-8 versions as secondary first, get some experience with
them, and then switch what is primary.  Note that an obvious
variation of the model of auto-generating one file from the
other (or generating them in parallel) is that there could also
be a note in the UTF-8 file pointing to the text file with
appropriate prose, i.e., pointers in both directions.

This pairing would also satisfy the need I tried to describe in
an earlier note -- having U+#### or some other notation
available to those who had to get the information that was
present in a UTF-8 file but who could, for whatever reason, only
see indications of undisplayable characters and, by making that
process automatic, reduce the potential for errors.


More information about the rfc-interest mailing list