[rfc-i] draft-iab-rfc-nonascii-00

Heather Flanagan (RFC Series Editor) rse at rfc-editor.org
Tue Mar 1 10:38:11 PST 2016


On 3/1/16 10:10 AM, Julian Reschke wrote:
> On 2016-03-01 18:58, Heather Flanagan (RFC Series Editor) wrote:
>>>> People expect search engines to be able to perform searches such that
>>>> searching on "GEANT", for example, will return matches for both "GEANT"
>>>> and "GÉANT". The reverse would also be true. I expect this is
>>>> established enough behavior that we do not need to define it in more
>>>> detail (insert implied question here).
>>>
>>> OK, but how exactly does that affect the vocabulary?
>>
>> The XML vocabulary? I don't see why it would affect the vocabulary
>> beyond what we have already anticipated; we have the ascii attribute to
>> help clarify where necessary. Or do you mean something else?
> 
> I don't think the requirement above actually justifies the complexity we
> added to the vocabulary. Search engines and databases have been able to
> deal with these things without having an explicit ASCII alternative.
> 
> So what I'm trying to understand is who's the audience for this
> requirement? If if was removed, what effect would that have?

The audience is anyone or anything that still has difficulty rendering
nonASCII characters. We want a way to unambiguously and consistently
provide an alternative that people can refer to if they need to clarify
what characters are in use that they cannot render.

Yes, I know that's an increasingly slim edge case. We may eventually
decide that it is such a slim case that we do not need to support it any
more. I think the ascii tagging is a sensible way for now to transition
from a forty year old ASCII model to something else.

> 
>>>> ...
>>>>> "For names that include characters outside of the Unicode Latin and
>>>>> Latin Extended script, an author-provided, ASCII-only identifier is
>>>>> required to assist in search and indexing of the document."
>>>>>
>>>>> It would be good to be more precise about what non-ASCII characters
>>>>> are
>>>>> allowed (range?).
>>>>>
>>>>> <http://greenbytes.de/tech/webdav/draft-iab-rfc-nonascii-00.html#rfc.section.3.4.p.12>:
>>>>>
>>>>>
>>>>>
>>>>
>>>> My understanding is that "Latin Extended" is a reasonable way to
>>>> capture
>>>> Basic Latin (ASCII)
>>>> Latin-1 Supplement
>>>> Latin Extended-A
>>>> Latin Extended-B
>>>> Latin Extended-C
>>>> Latin Extended-D
>>>> Latin Extended-E
>>>> Latin Extended Additional
>>>
>>> OK, so the code ranges as per <http://www.unicode.org/charts/>, we may
>>> want to include those over here.
>>>
>>> (I also note that there's an "IPA Extensions" code page I'll have to
>>> look into...)
>>>
>>
>> Does the following change seem reasonable?
>>
>> OLD:
>> Person names may appear in several places within an RFC. In both the
>> front page header and the references section, when a non-Latin script is
>> used, the fullname of the author is required. Initials are supported and
>> encouraged if available. In all cases, valid Unicode is required. For
>> names that include characters outside of the Unicode Latin and Latin
>> Extended script, an author-provided, ASCII-only identifier is required to
>> assist in improving general readability as well as the searchability and
>> indexing of the document.
>>
>> PROPOSED:
>> Person names may appear in several places within an RFC. In both the
>> front page header and the references section, when a non-Latin script is
>> used, the fullname of the author is required. Initials are supported and
>> encouraged if available. In all cases, valid Unicode is required. For
>> names that include characters outside of the Unicode Latin and Latin
>> Extended script (Basic Latin (ASCII), Latin-1 Supplement, Latin
>> Extended-A,
>> Latin Extended-B, Latin Extended-C, Latin Extended-D, Latin Extended-E,
>> Latin Extended Additional) an author-provided, ASCII-only identifier is
>> required to assist in improving general readability as well as the
>> searchability and indexing of the document <xref
>> target="UNICODE-CHART"/>.
> 
> I'd move the <xref> closer to the text it refers to, so at the end of
> the script enumeration.

I don't like citation tags in the middle of a sentence if they can be
avoided.

> 
> This is better, but I think it would be even better to have a table that
> people can look at to see what *exact* character ranges are covered.
> 
> (I can make a proposal if you like)

The RSE can make the decision regarding what characters are allowed, but
this RSE at least would very much like to point to the source rather
than reinvent those lists. It should be on me to review what Unicode
has, but recreating and maintaining a separate list seems like a bad
idea. Or are you thinking of some other way to represent this?

> 
>>>>   ...
>>>>> "Keywords and citation tags must be ASCII only."
>>>>>
>>>>> What does "Keywords" refer to? The things we put into the xml2rfc
>>>>> <keyword> element?
>>>>
>>>> Yes.
>>>
>>> Ok. Maybe state that, as the keywords currently are invisible in the
>>> specs, so people might not get what this is about...
>>
>> OLD:
>> Keywords and citation tags must be ASCII only.
>>
>> NEW:
>> Keywords, as tagged with the <keyword> element in XML, and citation tags
>> must be ASCII only.
> 
> OK, maybe
> 
> "Keywords (as tagged with the <keyword> element in XML), and citation
> tags (as defined in the anchor attributes of <reference> elements) must
> be ASCII only."

That works for me. Will change for -02.

-Heather



More information about the rfc-interest mailing list