RFC Errata


Errata Search

 
Source of RFC  
Summary Table Full Records

Found 2 records.

Status: Reported (2)

RFC 9485, "I-Regexp: An Interoperable Regular Expression Format", October 2023

Source of RFC: jsonpath (art)

Errata ID: 7990
Status: Reported
Type: Technical
Publication Format(s) : TEXT, PDF, HTML

Reported By: Bjoern Hoehrmann
Date Reported: 2024-06-13

Section 5.1 says:

5.1.  Multi-Character Escapes

   I-Regexp does not support common multi-character escapes (MCEs) and
   character classes built around them.  These can usually be replaced
   as shown by the examples in Table 1.

                      +============+===============+
                      | MCE/class: | Replace with: |
                      +============+===============+
                      | \S         | [^ \t\n\r]    |
                      +------------+---------------+
                      | [\S ]      | [^\t\n\r]     |
                      +------------+---------------+
                      | \d         | [0-9]         |
                      +------------+---------------+

It should say:

5.1.  Multi-Character Escapes

   I-Regexp does not support common multi-character escapes (MCEs) and
   character classes built around them.  These can usually be replaced
   as shown by the examples in Table 1.

                      +============+===============+
                      | MCE/class: | Replace with: |
                      +============+===============+
                      | \d         | [0-9]         |
                      +------------+---------------+

Notes:

`\S` excludes the form feed and vertical tabulation characters in Perl, ECMAScript, and other implementations, while the suggested replacement includes them. Given the entire document is about interoperable regular expressions, misrepresentation of the common definition of `\S` runs counter to that. Including form feed and vertical tabulation literally in the replacement expression is not likely to be helpful, so removing the misleading rows seems to be the best option.

Errata ID: 8505
Status: Reported
Type: Technical
Publication Format(s) : TEXT, PDF, HTML

Reported By: Joe Hildebrand
Date Reported: 2025-07-05

Section 5.1 says:

   The construct \p{IsBasicLatin} is essentially a reference to legacy
   ASCII; it can be replaced by the character class [\u0000-\u007f].

It should say:

Three possible approaches in the Notes section.

Notes:

Neither "\p{IsBasicLatin}" nor "[\u0000-\u007f]" are valid i-regexp's. I see three possible approaches:

1) The implication of this section is that the sense of this sentence is how to convert an existing regexp that works in an existing regex engine to i-regexp. If that is the actual intent, then the best fix is for the SingleCharEsc ABNF rule to support unicode character escapes. That will be pulled in to the charClassExpr rule to make this example correct. When this is fixed, some thought should be given to non-BMP characters that would either need to be escaped as two UTF-16 code points or with one escape of the form \u{xxxxx}.

2) If the intent of this section is to describe how to convert an i-regexp to the syntax of an existing regexp engine, then the charProp rule will need to be expanded to support "IsBasicLatin", which it currently does not. This will be difficult to get correct and have it stay correct over time as Unicode adds new properties. It also places a relatively difficult burden on implementers.

3) Remove this sentence entirely. Presumably this was added because \p{IsBasicLatin} comes up often enough that this would otherwise be a frequently-asked question. That means that the spec should be fixed for this important use case, rather than ignoring the problem, in my opinion.

I believe the first option is correct, and will fix a hole in the RFC, which currently has inadequate support for the many Unicode characters that are difficult to enter, visualize, and document in their unescaped forms. As an example, getting the suggested \u0000 into a string without escaping is difficult on systems that use null-terminated strings.

Furthermore, if even the *authors* of the spec believe that the i-regexp variant should have had Unicode escapes, this feels like a mistake in the ABNF that warrants a -bis version of this RFC.

Report New Errata



Advanced Search