Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding EscapeRegExpPattern.md #26

Closed
allenwb opened this issue Jun 19, 2015 · 8 comments
Closed

Regarding EscapeRegExpPattern.md #26

allenwb opened this issue Jun 19, 2015 · 8 comments

Comments

@allenwb
Copy link
Member

allenwb commented Jun 19, 2015

In ES6 EscapeRegExpPattern is defined in 21.2.3.2.4. In ES5.1 it wasn't a named abstract operation but its semantics for escaping and setting the source property was specified in 15.10.4.1.

In both ES5.1 15.10.6.4 and Es6 21.2.5.14 RegExp.prototype.toString is specified to use the value of the source property. Both the ES5.1 and ES6 specs for include a note stating that the value returned should be in the form of a RegularExpressionLiteral that would evaluate to a RegExp object that would have the same matching behavior as the original object.

@benjamingr
Copy link
Collaborator

cc @EladRK

@benjamingr
Copy link
Collaborator

Thanks @allenwb , I'll keep this short since you're a busy guy:

Correct me if I'm wrong, EscapeRegExpPattern (as the name implies) takes a pattern and escapes it so that it can be represented as a string.

What RegExp.escape does is take a string and escapes it so it can be literally represented as a pattern.

While the two are related operations - I don't see how .escape can benefit from using EscapeRegExpString. Am I missing anything here?


I did not write EscapeRegExpPattern.md and it's WiP but I pinged the author and I'm sure he'll appreciate being pointed to the RegExp constructor for the source property. When he'll finish it (and a review of what it does and how/if implemetnations) we'll update the readme.

@allenwb
Copy link
Member Author

allenwb commented Jun 19, 2015

@benjamingr
I was just have similar thoughts, about the differences between EscapeRegExpString and RegExp.escape -- they are transformation that are going in different directions and so it may make sense for them to have different escaping rules. If so, we should try to have words in the spec. that make it clear why they are different. If we found this confusing, it's a good bet that future spec. readers will also find it confusing.

So, let's drill a bit more into the possible use cases for RegExp.escape and what they imply about what needs to be escaped. It seems to me that there are two ways the programmer are likely to use RegExp.escape.

  1. Use it to generate a string that will be passed as the first argument to the RegExp constructor.
    The string may or may not be concatenated with additional Pattern syntax before passing it to the constructor. In this case, / and LineTerminator characters don't need to be escaped because the constructor will treat them as _PatternCharacter_s.
  2. Use it to generate text that can be parsed as a RegularExpressionLiteral. The result string might be passed to eval but this technique might also be used by a transpiler or somebody who is dynamically generating JS code to feed into the module loader API. For this use case / and any LineTerimantor characters need to be escaped as the unescaped characters can not validly occurs in a RegularExpressionLiteral.

These are both valid use cases and a programmer with either problem might reasonably expect that RegExp.escape should be useful.

If / and LineTerminator characters are not escaped, then RegExp.escape is useless (by itself) for the second use case. But escaping / and LineTerminator characters does not have any detrimental effect upon the first use case.

To me, this makes it clear that the more broadly useful and less error prone semantics is to always escape / and LineTerminator characters.

@benjamingr
Copy link
Collaborator

@allenwb

they are transformations that are going in different directions and so it may make sense for them to have different escaping rules.

Agreed.

If so, we should try to have words in the spec. that make it clear why they are different. If we found this confusing, it's a good bet that future spec. readers will also find it confusing.

Definitely, I think EscapeRegExpString was named in a way that was fine when it was named but is confusing in light of RegExp.escape, I will add a note in the RegExp.escape specification explaining the relationship between the two.

  1. Use it to generate a string that will be passed as the first argument to the RegExp constructor. The string may or may not be concatenated with additional Pattern syntax before passing it to the constructor. In this case, / and LineTerminator characters don't need to be escaped because the constructor will treat them as PatternCharacters.

Agreed.

  1. Use it to generate text that can be parsed as a RegularExpressionLiteral. The result string might be passed to eval but this technique might also be used by a transpiler or somebody who is dynamically generating JS code to feed into the module loader API. For this use case / and any LineTerimantor characters need to be escaped as the unescaped characters can not validly occurs in a RegularExpressionLiteral.

This is an interesting case and I've spent some time looking into it in the past few days. I have scraped hundreds of thousands of code bases (10k npm most popular and 10k most dependent on packages, GH search result for various matches, top websites in alexa ranking, code and symbol search. Virtually no one is doing eval on regex in npm or websites, there are less than 10 relevant repos on GH that turned up by searches. The results are (alongside most of the code used to obtain them) in the /data folder.

I've consulted a computer engineering professor about it and they seemed content with the methodology.

These are both valid use cases and a programmer with either problem might reasonably expect that RegExp.escape should be useful.

These are use cases programmers don't actually do in RegExp. Other languages (even those with regexp delimiters like PHP, that don't need eval to need them) do not escape /. The only language that escapes whitespace is C# and that's only because it has a special "ignore whitespace" mode.

So practically, unless I'm missing something in the data, the second use case does not actually exist in the wild. If you have data or a code base that indicates otherwise I'm all up for escaping / and whitespace.

But escaping / and LineTerminator characters does not have any detrimental effect upon the first use case.

It makes the resulting string less readable and longer. Assuming the longer thing (more memory) isn't actually a problem - the readability is something that got brought up at least twice before in programming languages.

Perl switched its syntax because of it at one point, Python first changed their processing of _ because of it and as Martijn pointed out are changing their syntax again in their new regular expressions module to only escape an even smaller subset. There is a link to the Python thread from today's meeting but to make things easier here it is.

To be fair, I'm very amendable to adding - / and LineTerminator to the spec - they're not really expensive to add, I just need convincing data.

@allenwb
Copy link
Member Author

allenwb commented Jun 19, 2015

@benjamingr

Definitely, I think EscapeRegExpString was named in a way that was fine when it was named but is confusing in light of RegExp.escape, I will add a note in the RegExp.escape specification explaining the relationship between the two.

And, if necessary we can rename EscapeRegExpString

Virtually no one is doing eval on regex in npm or websites, there are less than 10 relevant repos on GH that turned up by searches. The results are (alongside most of the code used to obtain them) in the /data folder.

I've consulted a computer engineering professor about it and they seemed content with the methodology.

In other words you have proven that some people indeed do exactly what I've described. Concatenate /'s around a string and pass it to eval. So, it's quite reasonable to ask what such people would expect when they encounter the new feature RegExp.escape.

So practically, unless I'm missing something in the data, the second use case does not actually exist in the wild. If you have data or a code base that indicates otherwise I'm all up for escaping / and whitespace.

New features need to thought about from the perspective of the the future as well as the past. People do write source-to-source translators in JS and are likely to increasingly do so. I mentioned dynamic module generation. Another possibility that seems increasingly likely is such translations taking place within template string tag handlers.

But escaping / and LineTerminator characters does not have any detrimental effect upon the first use case.

It makes the resulting string less readable and longer. Assuming the longer thing (more memory) isn't actually a problem - the readability is something that got brought up at least twice before in programming languages.

Presumably most strings that are going to be escaped are intented to be mechanically processed. What are use cases where readability is of primary importance. In particular, the escaping under discussion would only occur if / or a LineTerminator character actually occurred in the string to be escaped. How would you expect the resulting string to appear if they weren't escape.

I think we can agree that the longer string is not a problem; so mentioning it doesn't contribute to discussion.

Perl switched its syntax because of it at one point, Python first changed their processing of _ because of it and as Martijn pointed out are changing their syntax again in their new regular expressions module to only escape an even smaller subset. There is a link to the Python thread from today's meeting but to make things easier here it is.

So, Python's 3.0 re.escape() does escape / and line terminators. I don't know anything about their new regular expression module work. Are there any indications that / and line terminators are among the characters they are no longer going to escape? Those are the only characters that a relevant to this discussion.

To be fair, I'm very amendable to adding - / and LineTerminator to the spec - they're not really expensive to add, I just need convincing data.

I don't think you need data. Not all design issues can be settled using data. It seems pretty clear that RegExp.escape covers a wider range of use cases if it escapes / and line terminators . And I've haven't heard any thing that convinces me that doing so is at all detrimental.

Personally, I was unsure about whether this additional escaping was necessary or desirable when I raised the question. The discussion has convinced me that it is, at least, desirable.

@benjamingr
Copy link
Collaborator

And, if necessary we can rename EscapeRegExpString

What about GetPatternStringRepresentation or GetEscapedStringForPattern or something like that?

In other words you have proven that some people indeed do exactly what I've described. Concatenate /'s around a string and pass it to eval. So, it's quite reasonable to ask what such people would expect when they encounter the new feature RegExp.escape.

I don't understand, there are less than 10 usage examples of the pattern, and all of them are wrong in some other way too like doing eval("x = x.test(/" + regex +"/"). There are a ton of people using the RegExp constructor. I could even PR every single code base that does this in a day. Why would we want to support esoteric usage?

Just to make it clear, by not supporting it we're not breaking anyone's code, we're simply not supporting a use case that doesn't actually exist in code bases on the web.

I'm not sure why we would want to support passing it to eval any more than we want to support passing Array#toString to eval as an array or Date#toString to eval. Passing to eval just sounds like such an esoteric use case and even languages that have the delimiters (/) as part of the regexp string itself do not escape it.

Again, if you can show people actually do this - this changes the picture completely.

New features need to thought about from the perspective of the the future as well as the past. People do write source-to-source translators in JS and are likely to increasingly do so. I mentioned dynamic module generation. Another possibility that seems increasingly likely is such translations taking place within template string tag handlers.

I have not found a source-to-source translator that does this (and I looked), if you find a counter example I'm all ears.

Presumably most strings that are going to be escaped are intended to be mechanically processed. What are use cases where readability is of primary importance. In particular, the escaping under discussion would only occur if / or a LineTerminator character actually occurred in the string to be escaped. How would you expect the resulting string to appear if they weren't escape.

Well, in Python people complained about readability, but even if they are mechanically processed they are still manually debugged. An escaped like "Hello There Sir/Madan how was your day" (presumably for text search) would become with the current proposal "Hello There Sir/Madan how was your day" but if we escape whitespace it becomes "Hello\\ There\\ Sir\\/Madan\\ how\\ was\\ your\\ day" - this is a simple example that is easy to contrive to more complicated regular expressions.

So, Python's 3.0 re.escape() does escape / and line terminators. I don't know anything about their new regular expression module work. Are there any indications that / and line terminators are among the characters they are no longer going to escape? Those are the only characters that a relevant to this discussion.

Python's old re.escape escapes pretty much everything. People have complained about the escape set being too large and they made explicit opt outs (_ is no longer escaped, data in other_languages doc). They are changing it as Martijn Pieters pointed out to escape a much stricter subset in particular neither / nor whitespace are escaped.

I don't think you need data. Not all design issues can be settled using data. It seems pretty clear that RegExp.escape covers a wider range of use cases if it escapes / and line terminators . And I've haven't heard any thing that convinces me that doing so is at all detrimental.

Not all design issues can be settled using data but in this case the data tells us:

  • The use case for escaping / and whitespace, as we understand it (code gen and eval) is extremely rare and esoteric in real JavaScript code.
  • Other languages (Python, Perl) that escaped / no longer do so because people have complained about it.
  • No other language now escapes whitespace except C#, C# does it because of a special ("ignore whitespace") regexp mode they have. This is for readability.

Personally, I was unsure about whether this additional escaping was necessary or desirable when I raised the question. The discussion has convinced me that it is, at least, desirable.

I don't think it's as clear. I still need convincing of an actual use case to justify the readability impact. Debugging generated regular expressions is something I've had to do several times and I did not enjoy it one bit so anything that helps with that is a win IMO.

I'm also considering allowing escaping more things - see #27

@benjamingr
Copy link
Collaborator

If we want to drop the readability guarantees I'd also like to:

  • Escape - (for sets).
  • Escape numbers to their hex conterparts (to prevent mixing with selection groups).
  • Escape context sensitive characters for lookahead/lookbehind.

This would make the output less readable but safe in a context sensitive environment.


As a side note, no one is using the escape polyfill (or _.escapeRegExp from lodash/underscore) with eval.

@benjamingr
Copy link
Collaborator

Superseding this with #29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants