LUCENE-9445 Add support for case insensitive regex searches in QueryParser #1708

markharwood · 2020-07-31T14:05:48Z

This PR uses the standard /.../i regex syntax to denote case insensitive queries, exposing the underlying case insensitive regex support added in LUCENE-9386

This could be considered a breaking change if users had a regex immediately followed by the letter i but I imagine a case insensitive search would have been the intention of the searcher all along.

Jira issue: https://issues.apache.org/jira/browse/LUCENE-9445

jimczi

+1 to add the support.
I wonder if the parsing should be more strict and only matches if there is a separator or ends after the i ? That would be "less" breaking even though I think we should consider this change as breaking anyway. Maybe we can introduce it in 8x under a flag ? Is it what you had in mind ?

markharwood · 2020-08-03T16:42:45Z

I wonder if the parsing should be more strict and only matches if there is a separator or ends after the i ?

Yep - end-of-string or space should be required after the end of the regex. Maybe brackets too for Boolean logic?
You're right that this means a breaking change.
I just tested range queries ie num:[1 TO 5]i and that tolerates no whitespace between the closing ] and a search for the letter i so it's currently legal syntax (although weird and probably rare).

I hadn't considered adding a flag for 8.x. If we do I'd prefer to see it support the new behaviour by default - the rationale being that it is better to give an escape hatch to the few admins concerned about BWC for an edge case than continue the legacy of silent failures for potentially many regex searchers assuming /Foo/i was supported syntax.

markharwood · 2020-08-13T12:04:25Z

Progress update - I'm struggling a bit with how to make the parser stricter i.e. ensuring there's a space between /Foo/i and the next search term. I'd also need to allow for ( A OR /Foo/i) with the closing bracket instead of a space.
I'm not yet clear in JavaCC how to write that rule without also consuming the closing bracket. Some BWC issues to work through here too.

markharwood · 2020-08-18T10:01:48Z

I wonder if the parsing should be more strict and only matches if there is a separator or ends after the i ?

@jimczi What is the behaviour for a non-match?
Should /regex/iterm throw an error or be interpreted as /regex/ and iterm?

jimczi · 2020-08-18T11:11:04Z

Should /regex/iterm throw an error or be interpreted as /regex/ and iterm?

I'd prefer that we throw an error if any of the character attached to a regexp is not recognized.

markharwood · 2020-08-18T11:21:18Z

OK. @romseygeek suggested the BWC flag is called "allow_modifiers" and, if false, legacy behaviour is used ie there would be no errors for characters after trailing / and /regex/i is still interpreted as /regex/ and term i.

markharwood · 2020-08-20T14:13:25Z

@jimczi The TL/DR is I think it's going to be too hard to implement the stricter parsing logic.

I spoke with @romseygeek and we couldn't see a neat way that the string after the closing / could be processed. The parser doesn't allow us to write a handler that behaves differently depending on a flag:

Eager option - pass everything immediately after closing `/` to the regexp constructor logic.

When in strict mode , a simple non-i string is easy to detect and error on.
When not in strict mode the non-i string has to then be parsed as a regular clause (along with any boosts). It's not possible for the regex query-building logic to invoke the parser to do this.

Lazy option - pass only any trailing `i` to the regexp constructor logic

This simplifies the regex construction logic but pushes the problem elsewhere - in strict mode, the next clause the parser builds would have to double-check it wasn't immediately preceded by a regex with no whitespace to separate them and throw an error. The parsing framework does not provide such a context that would allow this logic.

Alternative option

We did consider forgetting about /Foo/i syntax and opting for the inline syntax commonly supported by regex parsers e.g. /(?i)Foo/ but this is less well known and more cryptic.

Summary

We felt that the existing lax implementation where /Foo/iphone is interpreted as /[Ff][Oo][Oo]/ OR phone is not so awful that it warrants the added complexity needed to make it work as we'd ideally want.
If there are any smart ideas we're open to them but it looks like a very messy problem.

…arser using the standard /.../i regex syntax

markharwood added the enhancement label Jul 31, 2020

markharwood requested a review from jimczi July 31, 2020 14:05

markharwood self-assigned this Jul 31, 2020

markharwood mentioned this pull request Jul 31, 2020

[KQL] Add regex support elastic/kibana#46855

Closed

jimczi reviewed Aug 3, 2020

View reviewed changes

markharwood force-pushed the fix/9445 branch from 4e1824f to bae40e6 Compare August 10, 2020 09:40

markharwood mentioned this pull request Aug 14, 2020

Option for case insensitive search at runtime elastic/elasticsearch#61162

Closed

7 tasks

markharwood added 2 commits August 25, 2020 09:49

LUCENE-9445 Add support for case insensitive regex searches in QueryP…

0c0446b

…arser using the standard /.../i regex syntax

Docs changes - package docs and CHANGES.txt

b10b65f

markharwood force-pushed the fix/9445 branch from bae40e6 to b10b65f Compare August 25, 2020 09:41

reta mentioned this pull request Jun 21, 2022

[BUG] Uppercase Regex in String queries search opensearch-project/OpenSearch#3578

Closed

asfimport mentioned this pull request Sep 17, 2020

Expose new case insensitive RegExpQuery support in QueryParser [LUCENE-9445] apache/lucene#10485

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9445 Add support for case insensitive regex searches in QueryParser #1708

LUCENE-9445 Add support for case insensitive regex searches in QueryParser #1708

markharwood commented Jul 31, 2020

jimczi left a comment

markharwood commented Aug 3, 2020 •

edited

Loading

markharwood commented Aug 13, 2020

markharwood commented Aug 18, 2020

jimczi commented Aug 18, 2020

markharwood commented Aug 18, 2020

markharwood commented Aug 20, 2020

LUCENE-9445 Add support for case insensitive regex searches in QueryParser #1708

Are you sure you want to change the base?

LUCENE-9445 Add support for case insensitive regex searches in QueryParser #1708

Conversation

markharwood commented Jul 31, 2020

jimczi left a comment

Choose a reason for hiding this comment

markharwood commented Aug 3, 2020 • edited Loading

markharwood commented Aug 13, 2020

markharwood commented Aug 18, 2020

jimczi commented Aug 18, 2020

markharwood commented Aug 18, 2020

markharwood commented Aug 20, 2020

Eager option - pass everything immediately after closing / to the regexp constructor logic.

Lazy option - pass only any trailing i to the regexp constructor logic

Alternative option

Summary

markharwood commented Aug 3, 2020 •

edited

Loading

Eager option - pass everything immediately after closing `/` to the regexp constructor logic.

Lazy option - pass only any trailing `i` to the regexp constructor logic