[rfc-i] UTF-8 and Unicode examples

Henning Schulzrinne hgs at cs.columbia.edu
Tue May 4 07:19:12 PDT 2004


Thanks for the pointer. I would suggest that the following convention be 
adopted:

- Unicode strings use the <U+1234,U+1234> notation, as in "M<U+00BC>nchen"

- UTF-8 strings use the <xx xx> notation, where xx are hexadecimal digits.

- The literal < just uses the Unicode rendition in those cases where 
this can be misinterpreted, i.e., where it is followed by U+ or a hex digit.

Does this work?

Kurt D. Zeilenga wrote:

> Why not just use conventions established for use in the Unicode
> standard itself?  For instance, M\u00FFnchen.
> 
> http://www.unicode.org/versions/Unicode4.0.0/Preface.pdf#G1771
> 
> Note at times, these conventions may infer with the
> syntax of the protocol.  In such cases, other conventions
> will have to be designed which do not inter with the syntax
> of the protocol.  For instance, if the protocol itself supports
> such escaping, then the specification has to something like:
> 
>   The Unicode string "M<U+00FF>nchen", where <U+00FF> is
>   the LATIN SMALL LETTER Y WITH DIAERESIS character,
>   is transferred as the ASCII string "M\u00FFnchen".
> 
> And sometimes, it's good to use base16 or base64 to
> show proper UTF-8 encoding.  For instance, in a protocol
> which supports transfer of UTF-8 encoded Unicode:
> 
>   The Unicode string "M\u00FFnchen", where \u00FF is
>   the LATIN SMALL LETTER Y WITH DIAERESIS character, is
>   transferred as the UTF-8 encoded Unicode represented
>   in the following octet sequence (base 16):
>       4d c3 bf 6e 63 68 65 6e
> 
> Kurt
> 
> 
> At 04:30 PM 5/3/2004, Bob Braden wrote:
> 
> 
>>  *> 
>> *> There are several plausible solutions:
>> *> 
>> *> (1) Simply state that the U+ notation is to be used, even though the 
>> *> actual (UTF-8) encoding will not consist of two octets. For example, one 
>> *> might write
>> *> 
>> *> M+00FCnchen
>>
>>That is unpleasantly ambiguous looking.  A computer knows to look for
>>4 hex characters, but to a human it is harder to parse.  Maybe:
>>
>>       M+00FC'nchen? or M'00FC'nchen?
>>
>>and of course you have to be able to escape the +.
>>
>>Bob
>>_______________________________________________
>>rfc-interest mailing list
>>rfc-interest at rfc-editor.org
>>http://www.rfc-editor.org/mailman/listinfo/rfc-interest



More information about the rfc-interest mailing list