[rfc-i] v3imp #6 Byte preservation for figs
dev+ietf at seantek.com
Fri Jan 23 01:06:16 PST 2015
#6 Byte preservation for figs
This improvement calls for making principled distinctions between
octet-oriented (byte-oriented) data and character-oriented data, and
preserving the data accordingly.
<artwork> is comprised of characters--no complaints there.
However along with Improvement #5 comes the need to incorporate octets
exactly as-is in the canonical format. Right now, when RFCs need to
include octet-oriented data, there are various "hacks" employed to
illustrate the data. Specifically, the spec-text makes some statements
about "such-and-such figure is base64 encoded" or "such-and-such figure
shows the octets in hexadecimal". These hacks are not amenable to
When content (such as within the putative <msg>/<content> or <file>
elements) is included in XML, it is safe to treat the stream as an
abstract sequence of Unicode code points, i.e., Unicode characters. This
will work pretty well for things like source code.
But if you want to include things like CBOR, DNS records, TCP/IP
packets, X.690 (BER/CER/DER), or text that is not in Unicode (e.g., OEM
Code Page 437), you gotta have other options that can include the
0x00-0xFF range. For illustrative purposes, I will use <file>. I suggest:
<file encoding="base64">...base64-encoded data...</file>
<file encoding="quoted-printable">...quoted-printable data...</file>
<file encoding="iso-8859-1">...U+0080 - U+00FF map to 0x80 - 0xFF;
regrettably 0x00-0x1F are not cleanly representable unless you use the
ASN.1 XML trick of <soh/> etc. elements...</file>
I also suggest that data pulled from <file src=... /> always be treated
as octet-oriented data. (Comparatively, data pulled from <sourcecode
src=... /> could always be treated as UTF-8 character-oriented
data--thus checking for ill-formed UTF-8 sequences is a validation issue
for <sourcecode src=""> but not for <file src="">.)
Consensus should be reached to address whether "files" (i.e., elements
marked as <file> or some such) are always octet-oriented, or whether
they can be labeled explicitly as character-oriented or octet-oriented.
Regarding Internet message data (<content> or <msg>): consensus should
be reached to address UTF-8 (EAI/SMTPUTF8) and BINARYMIME (BDAT)
extensions. I am ideologically in favor of both, especially when the
data is delivered via a @src attribute. However, mixing both has some
editing "issues" unless your editor really knows what it's doing.
More information about the rfc-interest