[rfc-i] Another example of a draft with non-ASCII characters (draft-ietf-iri-3987bis-12.txt)

Joe Hildebrand (jhildebr) jhildebr at cisco.com
Wed Jul 18 11:14:18 PDT 2012


On 7/18/12 1:53 AM, ""Martin J. Dürst"" <duerst at it.aoyama.ac.jp> wrote:


>> It has no context to go on, so it's having to sniff
>> out the encoding and guess based on the first bit of the file.
>
>I'm not sure why the dump Notepad gets it, but Wordpad doesn't. But then
>I'm not using either very much.

Hypothesis: Notepad reads more of the file to perform heuristics than
Wordpad. Often software that knows it's going to be fed poorly-described
encodings will try reading the first N bytes, seeing if there's any octets
> 127, then trying different encodings until it finds one that might fit.
>If there are a couple of things that are legal UTF8, and no strings of
>octets that are not UTF8 in those N bytes, the software can guess UTF8.

Again, this is a major limitation of any plaintext format in the modern
(post English-only Internet) world.

-- 
Joe Hildebrand




More information about the rfc-interest mailing list