[rfc-i] Another example of a draft with non-ASCII characters (draft-ietf-iri-3987bis-12.txt)
Joe Hildebrand (jhildebr)
jhildebr at cisco.com
Wed Jul 18 11:14:18 PDT 2012
On 7/18/12 1:53 AM, ""Martin J. Dürst"" <duerst at it.aoyama.ac.jp> wrote:
>> It has no context to go on, so it's having to sniff
>> out the encoding and guess based on the first bit of the file.
>
>I'm not sure why the dump Notepad gets it, but Wordpad doesn't. But then
>I'm not using either very much.
Hypothesis: Notepad reads more of the file to perform heuristics than
Wordpad. Often software that knows it's going to be fed poorly-described
encodings will try reading the first N bytes, seeing if there's any octets
> 127, then trying different encodings until it finds one that might fit.
>If there are a couple of things that are legal UTF8, and no strings of
>octets that are not UTF8 in those N bytes, the software can guess UTF8.
Again, this is a major limitation of any plaintext format in the modern
(post English-only Internet) world.
--
Joe Hildebrand
More information about the rfc-interest
mailing list