[rfc-i] Draft: Representation of Unicode and UTF-8 characters

Alex Rousskov rousskov at measurement-factory.com
Fri May 14 21:03:11 PDT 2004


On Fri, 14 May 2004, Henning Schulzrinne wrote:

> * Representation of Unicode and UTF-8 characters
>
> For names in acknowlegements and protocol examples, it is often
> desirable to represent Unicode characters, either abstractly or as
> the character would be coded in UTF-8 (RFC 3629).

FWIW, I would suggest to make the above more generic to avoid an
implication that "names in acknowledgments and protocol examples" are
the only use cases. I am also not sure what "abstract representation"
is. How about something along these lines:

	IETF documents may need to represent Unicode characters while
	obeying the US-ASCII encoding rules. Unicode use cases include
	protocol examples and human names in acknowledgments. To
	represent Unicode characters in IETF documents, authors should
	use the	following conventions:

> To avoid violating the US-ASCII-only rule for RFCs, it is suggested
> to write these characters using the following textual conventions:
>
> - Unicode strings use the <U+1234,U+1234> notation suggested by the
> Unicode specification
> (http://www.unicode.org/versions/Unicode4.0.0/Preface.pdf#G1771), for
> example "M<U+00BC>nchen" for the Bavarian city Munich.

Why is U+1234 repeated above? If you meant to illustrate that "," is a
character separator, consider using different characters.

> - UTF-8 strings enumerate the bytes as uppercase hexadecimal digits in
> angled brackets, e.g., <C2 A9> for the <U+00A9> (copyright) character.

Is space the only correct delimiter here? Why is it inconsistent with
comma delimiter used above?

> - The literal < uses the Unicode rendition <U+003C> in those cases
> where this can be misinterpreted, i.e., where the open angle bracket
> is followed by U+ or a hex digit.

Could somebody with direct access to a large RFC repository please
scan it for "<U\+" and "<[A-H0-9]" patterns? I wonder what is the
probability that an unaware RFC uses the above convention [for
something else]?

To reduce the number of violations, should conflicts be restricted to
cases where the whole <> sequence matches the pattern and not just the
prefix?

> Documents may choose a different convention, but then need to
> explain the notation.

To make this work reliably, especially with automated tools, we need a
blob of text that authors can include to indicate they _are_ following
the above convention. RFCs without the blob would be assumed not to
follow the convention (by default). Otherwise, it might not be clear
whether an RFC is unaware of the above convention or knowingly uses
it!

An alternative is mandatory usage enforced by RFC Editor, but I am
guessing we do not want to go that far.

$0.02,

Alex.


More information about the rfc-interest mailing list