[rfc-i] UTF-8 and Unicode examples

Julian Reschke julian.reschke at gmx.de
Tue May 4 11:13:12 PDT 2004


Alex Rousskov wrote:
> On Tue, 4 May 2004, Henning Schulzrinne wrote:
> 
> 
>>Thanks for the pointer. I would suggest that the following convention be
>>adopted:
>>
>>- Unicode strings use the <U+1234,U+1234> notation, as in "M<U+00BC>nchen"
>>
>>- UTF-8 strings use the <xx xx> notation, where xx are hexadecimal digits.
>>
>>- The literal < just uses the Unicode rendition in those cases where
>>this can be misinterpreted, i.e., where it is followed by U+ or a hex digit.
>>
>>Does this work?
> 
> 
> Using [] instead of <> might be a good idea to reduce the number
> of confused applications that would try to XML-ify the escape
> sequence.
> 
> Should tools like xml2rfc accept/interpret raw UTF-8, the escape
> sequence above, or both? This matters because these tools produce both
> ASCII text and HTML versions of specs.

I'd find it very dangerous if tools like xml2rfc would keep the 
non-ASCII characters in HTML output, but escape them in TXT output. 
People frequently only check the HTML output, but in the end what 
matters is readable TXT output.

On the other hand, I think it would make a *lot* of sense to discuss 
allowing at least certain non-ASCII characters inside TXT versions 
(encoded as UTF-8).

Best regards, Julian


-- 
<green/>bytes GmbH -- http://www.greenbytes.de -- tel:+492512807760



More information about the rfc-interest mailing list