[rfc-i] Character sets, was Comments on draft-iab-rfcformat

Phillip Hallam-Baker hallam at gmail.com
Wed Dec 19 07:57:28 PST 2012


On Wed, Dec 19, 2012 at 3:09 AM, Brian E Carpenter <
brian.e.carpenter at gmail.com> wrote:

> On 19/12/2012 04:59, John R Levine wrote:
> ...
> >> I am a developer, and when I have to carefully read each word and
> >> analyze the meaning of each sentence, there is no better support for
> >> me than a printed page in ASCII.
> >
> > I also write my share of code that implmenents various standards, and I
> > prefer a printed page that looks typeset, not like a 1960s line printer,
> > or maybe something on my tablet, or legible on my screen.
>
> Isn't the point that the normative text MUST be unambiguous and
> universally displayable/printable? I don't think that imposes ASCII,
> but it is still a very strong constraint. In a sense, ASCII is the
> lazy way to implement that constraint. Saying "any Unicode and any
> font is allowed" clearly does not meet the constraint. We have to be
> somewhere in between.
>

Why does normative text need to be displayable on every machine?

If I get a punch card reader, does the IETF have to support that? How about
paper tape?

The technical change that is now driving this is the proliferation of
smartphone type devices that have the interesting property that they
support virtual keyboards. Until now the only way to input many languages
into a computer has been through a keyboard with a Latin alphabet and
Arabic numerals. So Japanese and Chinese users of the Internet have been
familiar with ASCII transliteration hacks as of necessity.

With the virtual keyboard, that requirement goes away... So assuming that
the Japanese speaker will know how to use a Western alphabet is going to be
an increasingly bad assumption.


Yesterday my wife discovered that the reason she could not use the company
VPN from home was that she has an ampersand in her password (they do demand
a non alphabetic character). The problem was that the developer had only
tested on a limited character set (and quite likely there is a SQL
injection issue).


What concerns me is that we get coverage of the significant use cases. If
we are going to write specs that can be used by German speakers in their
own language we have to go beyond 7 bit ASCII. If we are going to support
languages that have a non-Latin alphabet we need to go beyond the 8-bit
code to UNICODE.

I don't think we need to support Zapf's Dingbats but we should choose at
least one non-English language that uses a Latin character set plus accents
and one that uses a non-Latin character set.

The point here is to make the specs better by encouraging developers to
consider examples that are non English usage examples. We use Alice and Bob
as users all the time We could do with adding Anaïs and Benoît for French,
Алла and Борис for Russian.

Unicode is pretty regular. If something works for French and Russian it is
pretty likely to work for any written language based on an alphabet and
quite likely Japanese and Chinese. Add in one of those two and you can be
confident of supporting pretty much any living language other than Korean.

One of the features of communications is that the smaller the community of
speakers, the smaller the complexity that can be tolerated. A language like
Chinese or Korean which has been a written language for thousands of years
and has had millions of literate speakers can be very complex. A language
like Cornish which has a few thousand speakers is forced to use an existing
alphabet and can't make any special typographic demands if it is going to
survive.

So from the point of view of covering possible corner cases, the big
languages are more likely to give rise to them than the small ones.


-- 
Website: http://hallambaker.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.rfc-editor.org/pipermail/rfc-interest/attachments/20121219/8def5bd5/attachment-0001.htm>


More information about the rfc-interest mailing list