[rfc-i] open issues: character sets of examples

Bjoern Hoehrmann derhoermi at gmx.net
Thu May 31 12:23:09 PDT 2012

* Andrew Sullivan wrote:
>This is another attempt to discuss an issue raised by the RSE as not
>having consensus.  In this case, it is "Want the ability to denote
>protocol examples using the character sets those examples support";
>and, by implication, "Want broader character encoding for body of

http://www.w3.org/MarkUp/html-spec/charset-harmful.html I found the list
of issues in the wiki confusing because the terms are a little bit off.
If a protocol supports the character encoding scheme ISO-8859-1 you will
have a hard time showing that "literally" in a plain text document. I'll
take it this is understood as using Unicode outside the US-ASCII range.

>I don't personally care about diagrams; I don't think in diagrams, and
>I don't find them that helpful.  I am most comfortable with words.  As
>a result, I find examples helpful, and one way I find examples to be
>helpful is that they actually portray the case under discussion.
>Since I sometimes work on internationalization issues, this
>necessarily entails Unicode code points outside the ASCII repertoire.  

I have generally found that people are not good at picking characters to
use to illustrate, especially with restraint, once you venture outside
the bounds of using latin letters outside the US-ASCII range. Support in
software and configurations people typically use is also spotty. To give
some examples, I would not want to see, for the time being, people using
box drawing characters, mathematical symbols, typographic white space,
and similar things for formatting; I would not want authors to pick some
characters for humorous effect, the pile of poo comes to mind. Showing
homoglyphs by mixing latin and cyrillic is also likely to confuse many

>The counter argument appears to be that there is no reason to do this,
>because one can specify the code points without actually displaying
>them.  While this is true, it is not terribly convincing to say (for
>instance) that U+02BC looks a lot like U+0027.  If, however, I say
>that the character U+02BC, MODIFIER LETTER APOSTROPHE (?) often
>resembles the character U+0027, APOSTROPHE ('), then the claim will
>perhaps be more convincing (to those using Unicode in their display).

It's been some time since I tested it, but I suspect your U+02BC will
come out as a question mark in this response (sorry for that). Like I
said, software support isn't very good, if I put my real name into the
from header, I suspect I'll see it mangled within 24 hours, either in a
reply or in an online archive.

>"Ah," the counter-argument says, "but not everyone is using Unicode!"
>Surely, however, this is a case where an encoding tag solves that
>problem?  We seem to be capable of handling this in nearly every
>browser I have seen in many years.  Even my email client of choice --
>mutt -- has been able to cope with this for over 10 years on every
>terminal I have used.  Perhaps someone can make the counter-argument
>clearer to me?

I think you are confusing concepts here, clearly we would use Unicode
and use UTF-8, or US-ASCII and some escaping scheme, to reference the
characters in the "over the wire" format, and indicate that as would be
appropriate, so this is more a matter of the range of characters that
can be handled. As a simple example there, my smartphone cannot render
combining characters, so while "ö" renders fine in the browser on it,
"o" + U+0308 renders as "o" plus "missing glyph".

So, I am open to the idea, but this would need very clear guidelines to
avoid problems and annoyances, and I am not sure we would be better off
doing it now than revisiting the issue in a couple of years. I would be
supportive of allowing US-ASCII plus all latin letters as experiment if
you will to see what issues we may encounter there, which tools need to
be upgraded, and so on, and when those issues are fixed, extend the set.
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

More information about the rfc-interest mailing list