[rfc-i] Byte order marks
John C Klensin
john+rfc at jck.com
Wed Nov 5 15:56:26 PST 2008
--On Wednesday, 05 November, 2008 13:25:38 -0500 Tony Hansen
<tony at att.com> wrote:
> Some more random thoughts:
>
> While it would best if we could just say ".txt means utf8",
> I'm becoming convinced that we won't get there. If we went the
> path of a .utf8 file, the .txt file *could* be considered
> secondary to the .txt and even auto-generated from the .utf8.
> Consider this scenario:
>
> * I-D upload accepts .utf8 files as a primary source
> * the .txt version is auto-generated,
> o replacing each utf8 sequence with U+####
> o add a note somewhere (say, as the very first line)
> indicating that the authoritative version is the UTF8
> version
>
> This could be a potential way forward.
FWIW, this makes a lot of sense to me, especially if
incorporated into a transition strategy in which we deploy the
UTF-8 versions as secondary first, get some experience with
them, and then switch what is primary. Note that an obvious
variation of the model of auto-generating one file from the
other (or generating them in parallel) is that there could also
be a note in the UTF-8 file pointing to the text file with
appropriate prose, i.e., pointers in both directions.
This pairing would also satisfy the need I tried to describe in
an earlier note -- having U+#### or some other notation
available to those who had to get the information that was
present in a UTF-8 file but who could, for whatever reason, only
see indications of undisplayable characters and, by making that
process automatic, reduce the potential for errors.
john
More information about the rfc-interest
mailing list