[rfc-i] UTF-8 and Unicode examples
hgs at cs.columbia.edu
Tue May 4 07:19:12 PDT 2004
Thanks for the pointer. I would suggest that the following convention be
- Unicode strings use the <U+1234,U+1234> notation, as in "M<U+00BC>nchen"
- UTF-8 strings use the <xx xx> notation, where xx are hexadecimal digits.
- The literal < just uses the Unicode rendition in those cases where
this can be misinterpreted, i.e., where it is followed by U+ or a hex digit.
Does this work?
Kurt D. Zeilenga wrote:
> Why not just use conventions established for use in the Unicode
> standard itself? For instance, M\u00FFnchen.
> Note at times, these conventions may infer with the
> syntax of the protocol. In such cases, other conventions
> will have to be designed which do not inter with the syntax
> of the protocol. For instance, if the protocol itself supports
> such escaping, then the specification has to something like:
> The Unicode string "M<U+00FF>nchen", where <U+00FF> is
> the LATIN SMALL LETTER Y WITH DIAERESIS character,
> is transferred as the ASCII string "M\u00FFnchen".
> And sometimes, it's good to use base16 or base64 to
> show proper UTF-8 encoding. For instance, in a protocol
> which supports transfer of UTF-8 encoded Unicode:
> The Unicode string "M\u00FFnchen", where \u00FF is
> the LATIN SMALL LETTER Y WITH DIAERESIS character, is
> transferred as the UTF-8 encoded Unicode represented
> in the following octet sequence (base 16):
> 4d c3 bf 6e 63 68 65 6e
> At 04:30 PM 5/3/2004, Bob Braden wrote:
>> *> There are several plausible solutions:
>> *> (1) Simply state that the U+ notation is to be used, even though the
>> *> actual (UTF-8) encoding will not consist of two octets. For example, one
>> *> might write
>> *> M+00FCnchen
>>That is unpleasantly ambiguous looking. A computer knows to look for
>>4 hex characters, but to a human it is harder to parse. Maybe:
>> M+00FC'nchen? or M'00FC'nchen?
>>and of course you have to be able to escape the +.
>>rfc-interest mailing list
>>rfc-interest at rfc-editor.org
More information about the rfc-interest