[rfc-i] UTF-8 and Unicode examples

Henning Schulzrinne hgs at cs.columbia.edu
Mon May 3 13:50:34 PDT 2004


I suspect we're talking about (at least) two different things. I'm 
specifically *not* suggesting that RFCs contain actual UTF-8 (or other 
non-ASCII) symbols, for the reasons you allude to.

On the contrary, I'm hoping for a common "escape" notation that will 
allow writers to indicate the presence of a non-ASCII character (e.g., 
encoded as UTF-8, as that's the most common case) while sticking to 
ASCII characters. As noted, Unicode itself provides a way of expressing 
characters in such a notation, using a hexadecimal notation. For 
example, U+00B0 would be the degree sign 
(http://www.unicode.org/charts/PDF/U0080.pdf).

There are several plausible solutions:

(1) Simply state that the U+ notation is to be used, even though the 
actual (UTF-8) encoding will not consist of two octets. For example, one 
might write

M+00FCnchen

to indicate the German city Muenchen (Munich).

(2) Designate another escape convention that indicates that a non-ASCII 
UTF-8 sequence should be at that point in the document.

Henning

Bob Braden wrote:

> The issue of extended character sets has been on the back-burner at the
> RFC Editor for the past several years.  Yes, it "would be nice".
> Unfortunately, it appears to us that this path is full of deep, deep
> pits of non-interoperability!  There are no widely-available tools for
> reading, searching, comparing, editing, or printing documents
> containing UTF-8 and friends, as far as we know. (We would be happy to
> be proven wrong.)
> 
> Bob Braden



More information about the rfc-interest mailing list