[rfc-i] Draft: Representation of Unicode and UTF-8 characters
Henrik Levkowetz
henrik at levkowetz.com
Sat May 15 02:15:15 PDT 2004
Hi Alex,
Friday 14 May 2004, Alex Rousskov wrote:
> > - The literal < uses the Unicode rendition <U+003C> in those cases
> > where this can be misinterpreted, i.e., where the open angle bracket
> > is followed by U+ or a hex digit.
>
> Could somebody with direct access to a large RFC repository please
> scan it for "<U\+" and "<[A-H0-9]" patterns? I wonder what is the
> probability that an unaware RFC uses the above convention [for
> something else]?
"<U+" occurs on 7 lines in RFCs, all in the same one (RFC 3454), and all
using it to indicate Unicode.
"<[A-H0-9][A-H0-9]( [A-H0-9][A-H0-9])*>" occurs on 444 lines in RFC's,
most of which use it to indicate carriage advance, form feed, or html
markup (e.g. <H1>). All those RFC's are numbered below 1000.
Henrik
---- Rawer data -------------------------------------------------------
$ grep "<U+" rfc*.txt
rfc3454.txt: "hemoglobin" vs. "h<U+00E6>moglobin" in American vs. British English.
rfc3454.txt: example,"<U+7EDF><U+4E00><U+7801>") vs. the equivalent traditional
rfc3454.txt: Chinese spelling (for example, "<U+7D71><U+4E00><U+78BC>").
rfc3454.txt: Language-specific equivalences such as "Aepfel" vs. "<U+00C4>pfel",
rfc3454.txt: definitions; Latin digits (<U+0030> through <U+0039>) are examples of
rfc3454.txt: Note that requirement 3 prohibits strings such as <U+0627><U+0031>
rfc3454.txt: ("aleph 1") but allows strings such as <U+0627><U+0031><U+0628>
$ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>" rfc*.txt | wc -l
444
$ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>" rfc*.txt | grep "<CA>" | wc -l
238
$ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>" rfc*.txt | grep "<H[0-9]>" | wc -l
60
$ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>" rfc*.txt | grep "<FF>" | wc -l
28
$ grep "<[A-H0-9][A-H0-9]\( [A-H0-9][A-H0-9]\)*>" rfc*.txt | tail
rfc732.txt: <17><IAC><SE>Telephone number:
rfc732.txt: <IAC><SB><DET><MOVE CURSOR><32><4><IAC><SE>
rfc732.txt: <24><IAC><SE>Social Security Number:
rfc732.txt: <0><11><IAC><SE> [Establish a field that
rfc732.txt: <IAC><SB><DET><MOVE CURSOR><32><5><IAC><SE>
rfc732.txt: Intensity=1><0><29><IAC><SE>
rfc732.txt: <IAC><GA>
rfc765.txt: (i.e., <CR>, <LF>, <NL>, <VT>, <FF>) which the printer
rfc929.txt: 4 Format effectors <BS> <CR> <LF> <FF> <HT> <VT>
rfc959.txt: (i.e., <CR>, <LF>, <NL>, <VT>, <FF>) which the printer
More information about the rfc-interest
mailing list