Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readability vs Context-Sensitive Validity #29

Closed
benjamingr opened this issue Jun 20, 2015 · 62 comments
Closed

Readability vs Context-Sensitive Validity #29

benjamingr opened this issue Jun 20, 2015 · 62 comments
Labels

Comments

@benjamingr
Copy link
Collaborator

Currently I see two opposites:

  • Minimal and readable output - We provide an implementation that only escapes what it must via RegExp.escape to support passing in a context-free way to the RegExp constructor. This means dropping ] and } from the escaped set and telling people that they should be aware of context. This would provide a very readable output.
  • Maximal and safe - We provide an implementation that deals with context and is context sensitive, we additionally cater for eval cases and returns of regular expressions from the server. This would provide a much less readable but safer RegExp. Doing this would require escaping numerics to literals to avoid capturing groups as @nikic points out, escaping / to allow eval** as @allenwb has suggested and escaping capturing group identifiers : and !.

It sounds like more people tend to prefer the latter, I'm not convinced because of lacking usage data indicating that the problems it solves happen in real code (data indicates otherwise) but a single counter-example would go a long way to convince me that this is a problem we need to address. I'm very tempted by the safety guarantees it provides for context sensitivity.

  • Escaping everything was ruled out as every single system that did it moved away from it.

** but not whitespace as there is no ״ignore whitespace mode" in JS.

@cscott
Copy link

cscott commented Jun 20, 2015

I think it's reasonable and possible to "escape everything" without impairing readability too much. Numeric literals are the most "surprising", but they would only need be escaped if they are the first character in the string. Similarly, I believe only : ! also only need to be escaped at the beginning of the string.

Being able to use the output in character classes and after '(' in groups seems definitely worth it, in terms of making this useful to programmers without surprising corner cases. The output is not meant to be human readable.

@ljharb
Copy link
Member

ljharb commented Jun 20, 2015

What is the perceived value in readability of the escaped string? It seems to me that the primary usage is going to be along the lines of new RegExp(RegExp.escape(str)) or similar - I can't think of a use case for working with the escaped string in any form except as an intermediary.

Surely the final regex will be potentially less readable, but again, a dynamic regex from user input? I imaging that will be immediately used for matches, a test, or a replacement, and the intermediary regex discarded.

@benjamingr
Copy link
Collaborator Author

@cscott

What is the perceived value in readability of the escaped string?

I often had to debug the resulting string of a compound regular expression, having a less readable string is a real pain and the discussions in Python and Perl prove that I'm not the only one with that problem as they've opted out of that strategy.

Numeric literals are the most "surprising", but they would only need be escaped if they are the first character in the string. Similarly, I believe only : ! also only need to be escaped at the beginning of the string.

Yes, numeric literals would need escaping at the beginning of the string only as would : and !.

Being able to use the output in character classes and after '(' in groups seems definitely worth it, in terms of making this useful to programmers without surprising corner cases.

I tend to think so too. What do you think about the eval use case @allenwb talked about (and the evidence)?

The output is not meant to be human readable.

Well, I'd like to do my best to keep it as readable as possible given the changes and trends in other languages. I hold Python in high regard and they've reached a reverse conclusion (moving from "escape everything" to "escape only a small set"). (cc @ljharb )

@cscott
Copy link

cscott commented Jun 20, 2015

I have no problem with escaping /. Basically, I don't see any big readability problem with escaping any punctuation characters. The numeric literals gave me more pause, since I'd naively expect RegExp.escape("123") to result in 123, but I can live with \x3123.

So basically: put me in the camp that says we should escape everything that might be a gotcha in a reasonable use case, including "["+RegExp.escape(x)+"]" and "/"+RegExp.escape+"/", but I think we can/should make concessions to readability for string-initial escapes, such as digits, : and !.

@minitech
Copy link

Is new RegExp("\\u111" + RegExp.escape(s)) worth considering, where a–f (and A–F) would additionally need to be escaped? /\u111/ on its own is a valid regular expression matching the literal text u111.

I also don’t see how ! or : need to be escaped.

@benjamingr
Copy link
Collaborator Author

@minitech good point, the :! Escaping is for looks lookahead/behind

@cscott
Copy link

cscott commented Jun 25, 2015

Sure, escape leading a-f.

@ljharb
Copy link
Member

ljharb commented Jun 25, 2015

Wouldn't it need to escape any hex character then? eg, [a-fA-F0-9]?

@minitech
Copy link

@benjamingr, could you give an example of where escaping !/: would be necessary?

@benjamingr
Copy link
Collaborator Author

Sure

var a = "(?" + RegExp.escape(":") + ")";
var b = RegExp(a); // should error, but doesn't

@cscott
Copy link

cscott commented Jun 25, 2015 via email

@benjamingr
Copy link
Collaborator Author

To clarify, does this mean we'd rather go in the "escape on the really safe side" way and not the "escape without context assumptions" way?

@cscott
Copy link

cscott commented Jun 25, 2015

Yes. Encode the fewest characters consistent with safety. I expect that
the first (and last?) positions in the string will thus require special
escapes that area not necessary for must characters in the string.
On Jun 25, 2015 1:21 PM, "Benjamin Gruenbaum" [email protected]
wrote:

To clarify, does this mean we'd rather go in the "escape on the really
safe side" way and not the "escape without context assumptions" way?


Reply to this email directly or view it on GitHub
#29 (comment)
.

@minitech
Copy link

@benjamingr, I don’t think it’s important to do that; /(?x)/ is a syntax error, for example. In (), only ? needs to be escaped – and it would be anyway for its other meaning.

@benjamingr
Copy link
Collaborator Author

Right, you'd sometimes get a syntax error, and sometimes not - depending on
if you're trying to escape something that'd be interpreted literally.

On Fri, Jun 26, 2015 at 1:27 AM, Ryan O’Hara [email protected]
wrote:

@benjamingr https:/benjamingr, I don’t think it’s important
to do that; /(?x)/ is a syntax error, for example.


Reply to this email directly or view it on GitHub
#29 (comment)
.

@jdalton
Copy link
Member

jdalton commented Jun 27, 2015

I don't think hex need escaping as they're different from the capture group number reference because \1 is complete and adding 2 to it changes its meaning, \12, where as \uffff is complete and adding a to it doesn't change its meaning.

@cscott
Copy link

cscott commented Jun 27, 2015 via email

@jdalton
Copy link
Member

jdalton commented Jun 27, 2015

Naw, you're getting into completing partially escaped things which is a rabbit hole. What if they did

RegExp('\\' + RegExp.escape(whatever))

I think it's reasonable to scope it to completed escape sequences that would have their meaning changed if an unescape character were added.

@cscott
Copy link

cscott commented Jun 27, 2015

I agree it's a corner case. It would be nice to draw a bright line here.
But your counter-example isn't actually a counter-example, because
is valid, and not an error. So although your example is dodgy,
and definitely could affect the semantics of the result, it doesn't
introduce a nondeterministic exception. That's the place I'd like to draw
the line.

But if you'd like to say that the first character (whatever it is) should
always be escaped, in order to get even-more-deterministic behavior in the
face of odd prefixes, I could get on board with that.
--scott

@minitech
Copy link

The reason I brought it up is that /\ufff/ (and /\ufff%!~/) is valid, whereas /\/ isn’t. ! and : are being escaped, too, even though (?${RegExp.escape(x)}) contains the same sort of mistake.

@cscott
Copy link

cscott commented Jun 27, 2015 via email

@jdalton
Copy link
Member

jdalton commented Jun 27, 2015

@cscott

But your counter-example isn't actually a counter-example, because
is valid, and not an error.

It may not error but it's the beginning of an escape sequence so many characters have very different meanings after that.

@minitech /\ufff/ is matching 'ufff' but I see your angle.

@cscott
Copy link

cscott commented Jun 27, 2015 via email

@benjamingr
Copy link
Collaborator Author

@cscott the first guarantee I'd like to give is that passing the result of RegExp.escape to the regexp constructor and then testing it against the source string always matches true.

A safety guarantee would look like "the behavior should be identical to the behavior you'd get if the escaped sequence fulfills the first guarantee and in addition it is wrapped in a non-capturing group". An octal escape guarantee is also OK I guess.

That said, an ad-hoc set of escapes is perfectly fine, there are only very few special cases in RegExp in JavaScript with context since RegExp syntax is pretty simple after all. You're either inside a character set or not, and you might be added as part of an escape sequence. Those are pretty much all the special cases that need to be addresse.

@cscott
Copy link

cscott commented Jun 29, 2015

@minitech, @benjamingr: Consider:

RegExp.escape(str, /[0-9]$/)

versus:

RegExp.escape(str).replace(/[0-9]$/, /* what goes here? */);

and then what happens with the latter code if str is "\\010" (ie, using a literal backlash) or $ (since RegExp.escape('$') == "\\024").

It would also be nice to be able to do:

str.replace(/something/, (c) => RegExp.escape(c, /[^]/g));

that is, to be able to easily get a version of RegExp.escape that safely encodes every character it is given.

@domenic
Copy link
Member

domenic commented Jun 30, 2015

IMO it would suffice to just have two modes: escape every character ever, and escape the characters that are necessary in new RegExp(RegExp.escape(s)).

@benjamingr
Copy link
Collaborator Author

@cscott interesting, would you mind coming up with a use case where we'd want to escape every character, or escape every number?

Also, what would RegExp.escape escape with this mode? How does the default escape set look?

@cscott
Copy link

cscott commented Jul 1, 2015

@benjamingr The use case to escape every character is to accomodate users like the str.replace example I gave. That is, in order to decouple your code, it is occasionally useful to have an "escape everything" routine available that you can pass into str.replace or into your yacc-generated tokenizer or whatever complicated thing you have which is identifying some substring to escape.

I think allowing a flexible "extra escape" mode would eliminate the need for some of the oddball escapes, such as @allenwb's /' escape. So I believe I am proposing something like:

// "escapeAll" -- encode all characters as shortest-possible escape sequence
var escapeHelper = function(x) {
    var cp = x.codePointAt(0), len = cp < 0x10000 ? 1 : 2;
    var rest = x.length > len ? escapeHelper(x.slice(len)) : '';
    if (cp == 0) {
      return '\\0' + rest; // XXX is this desirable?
    } else if (cp <= 26) {
      return '\\c' + String.fromCodePoint(64 + cp) + rest; // XXX is this desirable?
    } else if (cp <= 0x7F && !/^[DdWwSstrnvf0CcXxUuBb0-9]/.test(x)) {
      return '\\' + String.fromCodePoint(cp) + rest; // XXX is this desirable?
    } else {
      var result = x.toString(16);
      switch(result.length) {
      case 1: return '\\x0' + result + rest;
      case 2: return '\\x' + result + rest;
      case 3: return '\\u0' + result + rest;
      case 4: return '\\u' + result + rest;
      default: return '\\u{' + result + '}' + rest; // requires u flag to RegExp
      }
  }
}
RegExp.escape = function(str, regex) {
  // XXX: type check str and regexp?
  if (typeof regex == 'string') {
     // optional, but it illustrates an interesting use case for RegExp.escape
     regex = new RegExp('[' + RegExp.escape(regex) + ']', 'gu');
  }
  var re = /[-\\^$*+?.()[\]{}]|^[0-9A-Fa-f]/gu;
  if (regex!==undefined) re = re.alt(regex); // RegExp combinator
  // XXX should we force 'g' and 'u' flags on re?
  return String(s).replace(re, escapeHelper);
}

I'm inclined to think it would be better if RegExp.escape (via escapeHelper) always used 6-character unicode escapes for any escaped character, rather than trying to get clever and use the shortest-possible escape string. That would aid predictability of the result and allow easier post-processing of the escape string (although the second parameter is intended to avoid the need for such).

This uses RegExp#alt which combines two regular expressions via an alternation. Here's one implementation:

RegExp.prototype.alt = function(re) {
  return new RegExp('(?:'+this.source+')|(?:'+RegExp(re).source+')', this.flags + re.flags);
}

I think there are some questions about how the flags should be combined here, which generalize to questions about the second parameter of RegExp.escape. Should RegExp.escape(str, /a/) mean that we escape only the first occurrence of a? Or should we use force the use of the g (and u?) flags? I'm inclined to think we should force g and u.

@benjamingr
Copy link
Collaborator Author

@cscott why is var re = /[-\\^$*+?.()[\]{}]|^[0-9A-Fa-f]/gu; (the second part with the ^ I mean)?

@cscott
Copy link

cscott commented Jul 1, 2015

@benjamingr As noted above, we need to protect leading hex digits, in case of:

r1 = new RegExp('\1' + RegExp.escape(str));
r2 = new RegExp('\ufff' + RegExp.escape(str));

The first one was uncontroversial, I think. The second follows from @bergus' formalization of the goal.

@benjamingr
Copy link
Collaborator Author

Oh, you are absolutely right. I mistook what the ^ meant with the gu flags. That makes sense.

So, you're proposing:

  • Escape hex sequences for capturing groups, nulls and other edge cases demonstrated here.
  • Allow for users to escape additional characters with a RegExp like \ or whitespace for eval use cases.

I'm leaning towards this idea at the moment minus the string as the second parameter. Why are : and ! not escaped by default in the first character by the way?

@cscott
Copy link

cscott commented Jul 1, 2015

@benjamingr I'm using @bergus' statement of the goal, which means that new RegExp('(?' + RegExp.escape(str) + ')') is "silly" (really, "a syntax fragment whose concatentation with an arbitrary Disjunction may produce a syntax error"), and so we don't need to escape : and ! (or =).

You can always do RegExp.escape(str, /^[:!=]/g) if you like. :)

I could be persuaded otherwise, of course. I think if you use my goal (something like "the semantics should be the same as if every character was replaced by its unicode escape") you'd have to add those characters, and also tweak escapeHelper to always use the unicode escape (at least if the escaped character was the first in the string).

Oh -- and I think - doesn't strictly need to be there using @bergus' goal? I threw it in because it makes character classes easier. But you could force them to use:

r = new RegExp('[' + RegExp.escape(chars, /-/g) + ']', 'g');

instead.

Maybe we should tweak @bergus' goal to mention character classes.

@bergus
Copy link
Contributor

bergus commented Jul 1, 2015

@cscott Ah, an implementation!

  • /^[DdWwSstrnvf0CcXxUuBb0-9]/ is scary :-) This needs to be extensible for syntax extensions, not sure how MyRegExpSubclass.escape might need to be done.
  • I don't understand '\\c' + String.fromCharCode(64 + cp). This doesn't seem to work with \n (code point 10), which yields "\cJ".
  • Does this script ever just return "\\"+x (+rest)? Do we really need decimal/hexadecimal/unicode escapes for everything? Or was return '\\' + cp + rest; meant to do this?
  • Should \\r, \\n, \\t etc be considered as escape sequences? Your script comment says "shortest possible".
  • I don't like passing a string for the second parameter either. A predicate callback function (to be invoked on each character) if at all :-)
  • Combination of flags: /gu makes sense, not sure whether passing any odd custom flags could do harm to the behaviour of the standard re expression?

@benjamingr

Why are : and ! not escaped by default in the first character by the way?

We consider usage of RegExp.escape where a :, ! or = could do harm in the first character as "silly" :-)

@benjamingr
Copy link
Collaborator Author

We consider usage of RegExp.escape where a :, ! or = could do harm in the first character as "silly" :-)

Don't you think earlier errors are worth it though?

I can look into finding data to see if people actually concat strings this way.

@cscott
Copy link

cscott commented Jul 1, 2015

@bergus:

  • Agreed about /^[DdWwSstrnvf0CcXxUuBb0-9]/ being scary and subclass-hostile. That's why I'd prefer just to use a single type of escapes (which would have to be unicode escapes). That reduces the burdens on subclasses -- as long as they don't break unicode escapes, they can use RegExp.escape with an appropriate additional second parameter (see below).
  • /\cJ/.test('\n') === true. Try it in your node console. No, I'd never used that before either.
  • I believe that escapeHelper('(') will return "\\(" for instance.
  • You're probably right about \r, \n, etc. But I really dislike the "shortest possible" (after trying to implement it). Let's forget I ever said that and throw out half the code in escapeHelper.
  • "Codepoint", not "character", please.
  • The flag I was most worried about was i, for example:
class SafeRegExp extends RegExp { ... }
SafeRegExp.escape = function(s, re) {
 return super.escape(s, re ? /x/g.alt(re) : /x/g); // case-sensitive
}
// what should this print?  Is the `i` infectious?  ignored?
console.log(SafeRegExp.escape('XY', /y/i));
// ideally that would print "X\\u0059" but the implementation is tricky.

@bergus
Copy link
Contributor

bergus commented Jul 1, 2015

@cscott

/\cJ/.test('\n') === true . Try it in your node console. No, I'd never used that before either.

Astonishing. Confusing. OK, it seems to somehow work in regex, but not in string literals.

I believe that escapeHelper('(') will return "\\(" for instance.

I get \40 currently. I think you really meant x or x[0] or String.fromCharacterCodePoint(cp) in that line, instead of cp.

I really dislike the "shortest possible" after trying to implement it

I can see :-) Either output should be acceptable imo.

The flag I was most worried about was i

Yup, though that seems to be solved OK. Maybe we should just loop the string manually and test the standard re and the user-supplied one separately, instead of combining them.
I'm mostly worried about non-standard flags and RegExp subclasses in the second parameter.

@cscott
Copy link

cscott commented Jul 1, 2015

Hm -- is there any reason we should add [\u1000-\uFFFF] to the set of escaped characters, in order the ensure that RegExp.escape still works "as expected" even if the user forgets to add the u flag? I'm currently thinking, "no, this isn't necessary, because well-formed UTF-16 will still match a well-formed UTF-16 literal even without the u flag" . Character classes go awry, but I think you're still screwed if you forget the u flag whether you write /[𝌆]/ or /[\u1D306]/, because those both desugar to 2-character sequences.

And I think we should use the standard \uXXXX escapes instead of the \u{XXXX} escapes which only work if you specify the u flag (it would be one thing if \u{XXXX} threw an exception without the u flag, but instead it silently matches XXXX repetitions of u, which isn't the right thing at all).

See https://mathiasbynens.be/notes/es6-unicode-regex for more details. So I think the above implementation is okay, but I'm mentioning the issues here so others can sanity-check my logic. (Note that with the u flag \a is no longer valid in a regex, buttressing my argument that we should only use unicode escapes in the output of RegExp.escape instead of trying to use the shortest possible escape sequence.)

EDIT: I added \u{....} for unicode surrogates to escapeHelper in the implementation above.

@cscott
Copy link

cscott commented Jul 1, 2015

@bergus yep, you're right about the \( bug. I just edited the original comment to fix that, thanks.

Maybe we should just loop the string manually and test the standard re and the user-supplied one separately, instead of combining them. I'm mostly worried about non-standard flags and RegExp subclasses in the second parameter.

I think that implementation would be reasonable for a polyfill. It takes a little care, since without the g option, the lastIndex property isn't set correctly, and I think you'd have to be careful to make ^ work right. You'd probably have to force the g and y flags, even if you are executing each separately.

But as a matter of spec, it wouldn't be great if the Nth subclass of RegExp had to scan through the string N times looking for each component regexp. And you really want to be able to combine the regexps using something like RegExp.alt to implement subclasses. (Really, most of the corner cases here get pushed into the implementation of RegExp.alt: hjow does it deal with subclasses, etc.)

I think this is one of the things it would be easier to write correctly in the spec than implement as a javascript polyfill, since in the spec you have access to the RegExp internals and aren't limited to the methods exposed to user code.

@benjamingr
Copy link
Collaborator Author

Everyone, please see https:/benjamingr/RegExp.escape/blob/master/EscapedChars.md on what characters we're escaping in each proposal.

Comments appreciated here (or in separate issues about specific characters). LGTMs (and DLGTMs) also appreciated.

@anba
Copy link

anba commented Jul 6, 2015

From this comment (#35):

If it's required (or desirable) to prevent RegExp('\\u004' + RegExp.escape('A')) expanding to
RegExp('\\u004A'), then you probably also want to make sure RegExp('\\c' + RegExp.escape('G')) does not expand to RegExp('\\cG'). That means not only 0-9A-Fa-f, but
actually 0-9A-Za-z_ needs to be escaped.

Notes:

  • /\c/ is not handled as an IdentityEscape for c, but instead matches the string "\\c".
  • ControlLetter (cf. 21.2.1 Patterns) includes only A-Za-z, the extended range 0-9A-Za-z_ is needed for web-browser compatibility in CharacterClass. For example: /\c_/.test("\u001f") === false, but /[\c_]/.test("\u001f") === true.

@benjamingr
Copy link
Collaborator Author

To be completely honest I'm not sure what _ControlLetter_s are for and how
I can match one. Would you mind elaborating what these are for?

On Mon, Jul 6, 2015 at 5:44 PM, André Bargull [email protected]
wrote:

From this comment
#35 (comment)
(#35 #35):

If it's required (or desirable) to prevent RegExp('\u004' +
RegExp.escape('A')) expanding to
RegExp('\u004A'), then you probably also want to make sure RegExp('\c'
+
RegExp.escape('G')) does not expand to RegExp('\cG'). That means not
only 0-9A-Fa-f, but
actually 0-9A-Za-z_ needs to be escaped.

Notes:

  • /\c/ is not handled as an IdentityEscape for c, but instead matches
    the string "\c".
  • ControlLetter (cf. 21.2.1 Patterns
    http://ecma-international.org/ecma-262/6.0/#sec-patterns) includes
    only A-Za-z, the extended range 0-9A-Za-z_ is needed for web-browser
    compatibility in CharacterClass. For example: /\c_/.test("\u001f")
    === false, but /[\c_]/.test("\u001f") === true.


Reply to this email directly or view it on GitHub
#29 (comment)
.

@anba
Copy link

anba commented Jul 6, 2015

Control letter -> ASCII characters below U+0020.
/\cA/.test("\u0001") === true, /\cB/.test("\u0002") === true, etc.

@benjamingr
Copy link
Collaborator Author

@anba thanks. If you see anything else we missed I'd love to hear it. I was unable to find usages of this on GH.

@jbnicolai
Copy link

Not that these codepoints are not 'letters', which is why the wiki refers to them as control characters ;-)

@nightwing
Copy link

According to http://ecma-international.org/ecma-262/6.0/#sec-forbidden-extensions if u flag is specified any of [A-Za-z] cannot be used as an escape for itself, and /\c/u or /\u123/u must throw an error. So it makes sense to treat all partially escaped things as "silly".

@benjamingr
Copy link
Collaborator Author

This issue is to be decided in the next TC meeting. Will update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests