[rfc-i] v3imp #6 Byte preservation for figs

Sean Leonard dev+ietf at seantek.com
Fri Jan 23 01:06:16 PST 2015

Improvement Need
#6 Byte preservation for figs

This improvement calls for making principled distinctions between 
octet-oriented (byte-oriented) data and character-oriented data, and 
preserving the data accordingly.

<artwork> is comprised of characters--no complaints there.

However along with Improvement #5 comes the need to incorporate octets 
exactly as-is in the canonical format. Right now, when RFCs need to 
include octet-oriented data, there are various "hacks" employed to 
illustrate the data. Specifically, the spec-text makes some statements 
about "such-and-such figure is base64 encoded" or "such-and-such figure 
shows the octets in hexadecimal". These hacks are not amenable to 
automated processing.

When content (such as within the putative <msg>/<content> or <file> 
elements) is included in XML, it is safe to treat the stream as an 
abstract sequence of Unicode code points, i.e., Unicode characters. This 
will work pretty well for things like source code.

But if you want to include things like CBOR, DNS records, TCP/IP 
packets, X.690 (BER/CER/DER), or text that is not in Unicode (e.g., OEM 
Code Page 437), you gotta have other options that can include the 
0x00-0xFF range. For illustrative purposes, I will use <file>. I suggest:

<file encoding="base64">...base64-encoded data...</file>

<file encoding="quoted-printable">...quoted-printable data...</file>

<file encoding="iso-8859-1">...U+0080 - U+00FF map to 0x80 - 0xFF; 
regrettably 0x00-0x1F are not cleanly representable unless you use the 
ASN.1 XML trick of <soh/> etc. elements...</file>

I also suggest that data pulled from <file src=... /> always be treated 
as octet-oriented data. (Comparatively, data pulled from <sourcecode 
src=... /> could always be treated as UTF-8 character-oriented 
data--thus checking for ill-formed UTF-8 sequences is a validation issue 
for <sourcecode src=""> but not for <file src="">.)

Consensus should be reached to address whether "files" (i.e., elements 
marked as <file> or some such) are always octet-oriented, or whether 
they can be labeled explicitly as character-oriented or octet-oriented.

Regarding Internet message data (<content> or <msg>): consensus should 
be reached to address UTF-8 (EAI/SMTPUTF8) and BINARYMIME (BDAT) 
extensions. I am ideologically in favor of both, especially when the 
data is delivered via a @src attribute. However, mixing both has some 
editing "issues" unless your editor really knows what it's doing.


More information about the rfc-interest mailing list