[rfc-i] UTF-8 and Unicode examples
Henning Schulzrinne
hgs at cs.columbia.edu
Mon May 3 13:50:34 PDT 2004
I suspect we're talking about (at least) two different things. I'm
specifically *not* suggesting that RFCs contain actual UTF-8 (or other
non-ASCII) symbols, for the reasons you allude to.
On the contrary, I'm hoping for a common "escape" notation that will
allow writers to indicate the presence of a non-ASCII character (e.g.,
encoded as UTF-8, as that's the most common case) while sticking to
ASCII characters. As noted, Unicode itself provides a way of expressing
characters in such a notation, using a hexadecimal notation. For
example, U+00B0 would be the degree sign
(http://www.unicode.org/charts/PDF/U0080.pdf).
There are several plausible solutions:
(1) Simply state that the U+ notation is to be used, even though the
actual (UTF-8) encoding will not consist of two octets. For example, one
might write
M+00FCnchen
to indicate the German city Muenchen (Munich).
(2) Designate another escape convention that indicates that a non-ASCII
UTF-8 sequence should be at that point in the document.
Henning
Bob Braden wrote:
> The issue of extended character sets has been on the back-burner at the
> RFC Editor for the past several years. Yes, it "would be nice".
> Unfortunately, it appears to us that this path is full of deep, deep
> pits of non-interoperability! There are no widely-available tools for
> reading, searching, comparing, editing, or printing documents
> containing UTF-8 and friends, as far as we know. (We would be happy to
> be proven wrong.)
>
> Bob Braden
More information about the rfc-interest
mailing list