[rfc-i] New version: draft-hoffman-utf8-rfcs-04.txt

Tim Bray tbray at textuality.com
Wed Nov 5 16:28:44 PST 2008


I helped Paul with this draft and have been watching the conversation
with interest.  I thought an examination of costs and benefits might
be helpful.

1. Benefits arising from UTF8-enabling the I-D/RFC series

1.1 Ability to identify references accurately where the required text
includes non-ASCII characters.  Benefit: Minor, because enough
information can be provided in almost all cases to work around this
shortcoming and allow an intelligent reader to locate the referenced
object.

1.2 Ability to spell the names of contributors to IETF specifications
correctly.  Benefit: Dependent on one's world-view. For example, I
find it unacceptable, verging on bigotry, that in RFC5023, the name of
one of its editors is spelled incorrectly  When I tell my colleagues
who are not IETF veterans that the IETF allows only ASCII in working
documents, the typical reaction is along the lines of "You have *got*
to be kidding."  Apparently there are other members of the community
who do not see this problem as material.

1.3 Ability to include accurate, readable examples of the use of
non-ASCII characters in IETF protocols.  Benefit: Major.  In practice,
internationalization is observed to be a frequent source of
interoperability difficulties on the Internet.  Whereas the users of
IETF specifications, in theory, would be happy to work from the
normative prose and formal specifications, in practice the usability
of specifications is found to be increased by the inclusion of
high-quality examples.  In theory, such examples need not actually
contain non-ASCII characters; the familiar U+XXXX notation can be used
to stand in for them.  In practice, readability and terseness are
observed to improve the usability of specifications.

2. Problems induced by UTF-enabling the I-D/RFC series

2.1 Problems for authors in inserting non-ASCII characters on ASCII
keyboards.  Seriousness: Minor.  Essentially all modern text authoring
systems have one way or another of inserting characters which are not
available on the keyboard.  In the most extreme case where a spec
author doesn't have such tools, the RFC Editor could be instructed as
to which characters should be inserted and where.

2.2 Problems in displaying non-ASCII characters on-screen.
2.2.1 Display problem because system lacks the capacity to render the
characters.  Seriousness: Minor.  Virtually all modern computer
display subsystems can handle a wide range of non-ASCII characters
correctly.  However, best practices would be to avoid extremely
obscure characters, in particular non-BMP characters, for which the
availability of fonts may be problematic.
2.2.2 Display problem because system doesn't know the doc is in UTF-8:
Seriousness: Intermediate.  If the spec is being viewed in its HTML
version or being delivered over the Web, the problem mostly vanishes
because in these scenarios there are places to put character-set
metadata, enabling clients to act appropriately.  The problem arises
when an RFC is saved in its plain-text form to disk and no thus no
metadata is available to alert the display subsystem that the text is
in UTF-8.  Note that this is a general problem with plain-text files
and not specific to IETF documents.  This is a real problem.  It is
ameliorated by three things:
- Viewing RFCs in plain-text form as opposed to HTML is becoming
increasingly less common
- Systems which fail to display non-ASCII UTF-8 usually do so in an
obviously "broken" fashion
- Users who actually care about the non-ASCII characters are typically
aware of how to work around this problem informing the system that the
file in question is UTF-8.

2.3 Problems in printing non-ASCII characters.
The issues here are very similar to those around on-screen display.
Most modern computer systems are quite capable of accurately printing
UTF-8, unless extremely rare characters are used which fall outside
the font repertoire, and as long as they know that the data is in
UTF-8.   There is one other issue which is relevant here: A
substantial proportion of the community is already unable to print
plain-text RFCs correctly (I personally have rarely been able to get
this to work on either Windows or Mac) and thus resort to HTML when
printing is required; which also removes the UTF-8 problem.

3. Finding a balance

Perhaps I'm moralizing, but I find it entirely unacceptable that the
organization which standardizes Internet protocols, a majority of
whose textual payload is now non-ASCII, restricts the character
repertoire of its specifications in a fashion that can be seen as
discriminating against the majority of Internet users.   As a designer
and developer of Internet protocols, I find that the benefits
described in section 1 above entirely outweigh the problems described
in section 2.

To those who believe that this does not not actually constitute
discrimination, or that the discrimination is excused by the benefits
of the ASCII restriction, I would advise polishing your arguments,
because the issue will continue to be raised on a regular basis by
those who do perceive discrimination.

4. Suggestion

I suggest that those RFC sections which currently impose the
ASCII-only restriction on I-Ds and RFCs no longer enjoy consensus
support among the community.  Not even rough consensus.  Not even
close. Could we examine that issue?

 -Tim


More information about the rfc-interest mailing list