[rfc-i] UTF-8 and Unicode examples

Kurt D. Zeilenga Kurt at OpenLDAP.org
Mon May 3 21:23:54 PDT 2004


Why not just use conventions established for use in the Unicode
standard itself?  For instance, M\u00FFnchen.

http://www.unicode.org/versions/Unicode4.0.0/Preface.pdf#G1771

Note at times, these conventions may infer with the
syntax of the protocol.  In such cases, other conventions
will have to be designed which do not inter with the syntax
of the protocol.  For instance, if the protocol itself supports
such escaping, then the specification has to something like:

  The Unicode string "M<U+00FF>nchen", where <U+00FF> is
  the LATIN SMALL LETTER Y WITH DIAERESIS character,
  is transferred as the ASCII string "M\u00FFnchen".

And sometimes, it's good to use base16 or base64 to
show proper UTF-8 encoding.  For instance, in a protocol
which supports transfer of UTF-8 encoded Unicode:

  The Unicode string "M\u00FFnchen", where \u00FF is
  the LATIN SMALL LETTER Y WITH DIAERESIS character, is
  transferred as the UTF-8 encoded Unicode represented
  in the following octet sequence (base 16):
      4d c3 bf 6e 63 68 65 6e

Kurt


At 04:30 PM 5/3/2004, Bob Braden wrote:

>   *> 
>  *> There are several plausible solutions:
>  *> 
>  *> (1) Simply state that the U+ notation is to be used, even though the 
>  *> actual (UTF-8) encoding will not consist of two octets. For example, one 
>  *> might write
>  *> 
>  *> M+00FCnchen
>
>That is unpleasantly ambiguous looking.  A computer knows to look for
>4 hex characters, but to a human it is harder to parse.  Maybe:
>
>        M+00FC'nchen? or M'00FC'nchen?
>
>and of course you have to be able to escape the +.
>
>Bob
>_______________________________________________
>rfc-interest mailing list
>rfc-interest at rfc-editor.org
>http://www.rfc-editor.org/mailman/listinfo/rfc-interest




More information about the rfc-interest mailing list