Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(expr): switch to fancy-regex crate & update the original version #12329

Merged
merged 11 commits into from
Sep 16, 2023

Conversation

xzhseh
Copy link
Contributor

@xzhseh xzhseh commented Sep 15, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

resolve #12119

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

The fancy-regex adds support for advanced regex features like back-reference & positive/negative lookahead, may need to update the documentation according to this.

@xzhseh
Copy link
Contributor Author

xzhseh commented Sep 15, 2023

The expression (regex) in fancy-regex will be essentially parsed to a ExprTree when building by RegexBuilder, thus we do not need to explicitly specify -i flag, but to integrate it with the input pattern.
Also, we support back-reference and positive/negative lookahead in the regex now, which means the following query should be valid at present:

query T
select regexp_replace('foobarbaz', 'a(?=r)', 'X');
----
foobXrbaz

@xzhseh
Copy link
Contributor Author

xzhseh commented Sep 15, 2023

Part of Regex::new_options is as below:

fn new_options(options: RegexOptions) -> Result<Regex> {
      let raw_tree = Expr::parse_tree(&options.pattern)?;

      // wrapper to search for re at arbitrary start position,
      // and to capture the match bounds
      let tree = ExprTree {
          expr: Expr::Concat(vec![
              Expr::Repeat {
                  child: Box::new(Expr::Any { newline: true }),
                  lo: 0,
                  hi: usize::MAX,
                  greedy: false,
              },
              Expr::Group(Box::new(raw_tree.expr)),
          ]),
          ..raw_tree
      };
...
}

@xzhseh xzhseh self-assigned this Sep 15, 2023
src/expr/src/vector_op/regexp.rs Show resolved Hide resolved
src/expr/src/vector_op/regexp.rs Outdated Show resolved Hide resolved
src/expr/src/vector_op/regexp.rs Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Sep 15, 2023

Codecov Report

Merging #12329 (591ac5d) into main (0032145) will decrease coverage by 0.01%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main   #12329      +/-   ##
==========================================
- Coverage   69.86%   69.86%   -0.01%     
==========================================
  Files        1417     1417              
  Lines      235501   235505       +4     
==========================================
- Hits       164541   164536       -5     
- Misses      70960    70969       +9     
Flag Coverage Δ
rust 69.86% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
src/expr/src/table_function/regexp_matches.rs 32.00% <0.00%> (-1.34%) ⬇️
src/expr/src/vector_op/regexp.rs 17.24% <0.00%> (-0.19%) ⬇️

... and 4 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@TennyZhuang
Copy link
Contributor

Are there some inputs can make the expression panic?

Copy link
Contributor

@TennyZhuang TennyZhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases LGTM

Copy link
Contributor

@wangrunji0408 wangrunji0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xzhseh
Copy link
Contributor Author

xzhseh commented Sep 15, 2023

Are there some inputs can make the expression panic?

Yes, like if the input pattern can not be parsed when trying to RegexBuilder::build(), an Error will be returned.
And in the current implementation we just simply call unwrap() on it, which will cause panic in this case.
We could definitely make a wrapper to handle this if necessary.

The below is all the Error that may be triggered during the backtracking regex approach.

/// An error as the result of parsing, compiling or running a regex.
#[derive(Debug)]
pub enum Error {
    /// An error as a result of parsing a regex pattern, with the position where the error occurred
    ParseError(ParseErrorPosition, ParseError),
    /// An error as a result of compiling a regex
    CompileError(CompileError),
    /// An error as a result of running a regex
    RuntimeError(RuntimeError),

    /// This enum may grow additional variants, so this makes sure clients don't count on exhaustive
    /// matching. Otherwise, adding a new variant could break existing code.
    #[doc(hidden)]
    __Nonexhaustive,
}

@xzhseh xzhseh added this pull request to the merge queue Sep 16, 2023
Merged via the queue into main with commit 31fdc26 Sep 16, 2023
27 of 28 checks passed
@xzhseh xzhseh deleted the xzhseh/feat-fancy-regex branch September 16, 2023 01:03
Little-Wallace added a commit that referenced this pull request Sep 18, 2023
commit c82fc9c
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Sep 18 08:37:33 2023 +0000

    chore(deps): Bump chrono from 0.4.30 to 0.4.31 (#12359)

    Signed-off-by: dependabot[bot] <[email protected]>
    Signed-off-by: Runji Wang <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Co-authored-by: Runji Wang <[email protected]>
    Co-authored-by: TennyZhuang <[email protected]>

commit cbdc1ac
Author: Huangjw <[email protected]>
Date:   Mon Sep 18 16:22:35 2023 +0800

    chore(ci): move release jobs to main-cron pipeline (#12339)

commit b37a19c
Author: Yuhao Su <[email protected]>
Date:   Mon Sep 18 16:18:01 2023 +0800

    feat(dashboard): add memory profiling (#12052)

commit 71d8170
Author: TennyZhuang <[email protected]>
Date:   Mon Sep 18 15:58:26 2023 +0800

    refactor(expr): allow defining functions in frontend (#12287)

    Signed-off-by: TennyZhuang <[email protected]>
    Co-authored-by: zwang28 <[email protected]>
    Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

commit cedaec9
Author: Dylan <[email protected]>
Date:   Mon Sep 18 15:54:10 2023 +0800

    feat(optimizer): support agg group by simplify rule (#12349)

commit 71d9b0b
Author: Noel Kwan <[email protected]>
Date:   Mon Sep 18 15:32:00 2023 +0800

    feat(meta): update StreamJob status on finish (#12342)

commit 784fe56
Author: zwang28 <[email protected]>
Date:   Mon Sep 18 14:47:49 2023 +0800

    fix(backup): ensure correct delta log order (#12371)

commit 711ecd5
Author: congyi wang <[email protected]>
Date:   Mon Sep 18 14:11:24 2023 +0800

    feat(state_table): add iterator sub range under a certain pk prefix (#12251)

commit 1877aed
Author: xiangjinwu <[email protected]>
Date:   Mon Sep 18 13:49:15 2023 +0800

    refactor(sink): impl SinkFormatter for AppendOnly and Upsert (#12321)

commit f304ed2
Author: xxchan <[email protected]>
Date:   Sun Sep 17 20:20:17 2023 +0800

    revert: Revert "chore: add platforms to hakari (#12333)" (#12363)

commit a975d93
Author: Bohan Zhang <[email protected]>
Date:   Sun Sep 17 19:04:24 2023 +0800

    fix: handle kafka sink message timeout error (#12350)

commit 8ef74ad
Author: Runji Wang <[email protected]>
Date:   Sat Sep 16 12:16:02 2023 +0800

    fix(udf): handle visibility of input chunks in UDTF (#12357)

    Signed-off-by: Runji Wang <[email protected]>

commit 31fdc26
Author: Xu <[email protected]>
Date:   Fri Sep 15 21:01:14 2023 -0400

    feat(expr): switch to `fancy-regex` crate & update the original version (#12329)

    Co-authored-by: xzhseh <[email protected]>

commit 0032145
Author: Runji Wang <[email protected]>
Date:   Fri Sep 15 16:57:25 2023 +0800

    refactor(expr): support variadic function in `#[function]` macro (#12178)

    Signed-off-by: Runji Wang <[email protected]>

commit 467ba4b
Author: stonepage <[email protected]>
Date:   Fri Sep 15 16:28:13 2023 +0800

    fix: stream backfill executor use correct schema (#12314)

    Co-authored-by: Noel Kwan <[email protected]>

commit c443197
Author: Dylan <[email protected]>
Date:   Fri Sep 15 16:22:13 2023 +0800

    feat(optimizer): support correlated column in order by (#12341)

commit 8a36ca3
Author: Noel Kwan <[email protected]>
Date:   Fri Sep 15 16:11:03 2023 +0800

    feat(meta): Add `creating_status` field for stream jobs (#12330)

commit bf5b14e
Author: zwang28 <[email protected]>
Date:   Fri Sep 15 16:06:17 2023 +0800

    chore: lift decoding message size limit for ddl client (#12340)

commit c0060b2
Author: zwang28 <[email protected]>
Date:   Fri Sep 15 15:32:14 2023 +0800

    feat(meta): add hummock config relevant tables to rw_catalog (#12337)

commit 59bb645
Author: xxchan <[email protected]>
Date:   Fri Sep 15 14:54:54 2023 +0800

    chore: add platforms to hakari (#12333)

    Signed-off-by: Runji Wang <[email protected]>
    Co-authored-by: Runji Wang <[email protected]>

commit 7baa27f
Author: Bugen Zhao <[email protected]>
Date:   Fri Sep 15 14:00:14 2023 +0800

    chore: split full debug info for release build (#12255)

    Signed-off-by: Bugen Zhao <[email protected]>

commit a99e6f3
Author: Richard Chien <[email protected]>
Date:   Fri Sep 15 13:58:19 2023 +0800

    fix(stream): fix pk indices of GroupTopN executors (#12304)

    Signed-off-by: Richard Chien <[email protected]>

commit 43c010e
Author: Croxx <[email protected]>
Date:   Fri Sep 15 11:59:41 2023 +0800

    chore: fix comment and metrics (#12331)

    Signed-off-by: MrCroxx <[email protected]>

commit 214118b
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Fri Sep 15 10:03:14 2023 +0800

    chore(deps): Bump serde_json from 1.0.106 to 1.0.107 (#12322)

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 41ebb2a
Author: Xu <[email protected]>
Date:   Thu Sep 14 22:02:08 2023 -0400

    fix(regexp): substraction overflow when incorrectly speicifying `start` (#12325)

commit a566cfe
Author: Xu <[email protected]>
Date:   Thu Sep 14 12:58:35 2023 -0400

    feat(expr): add `array_sum` (#12162)

    Signed-off-by: Runji Wang <[email protected]>
    Co-authored-by: Runji Wang <[email protected]>

commit 28bbf10
Author: Croxx <[email protected]>
Date:   Fri Sep 15 00:40:27 2023 +0800

    fix(ci): exclude tikv-jemalloc-sys in hakari check (#12320)

    Signed-off-by: MrCroxx <[email protected]>

commit 5aa5a47
Author: zwang28 <[email protected]>
Date:   Thu Sep 14 21:02:01 2023 +0800

    feat(meta): add hummock version relevant tables to rw_catalog (#12309)

commit a740364
Author: Huangjw <[email protected]>
Date:   Thu Sep 14 19:11:04 2023 +0800

    chore(ci): install locales in prebuilt image (#12311)

    Signed-off-by: Bugen Zhao <[email protected]>
    Co-authored-by: Bugen Zhao <[email protected]>

commit 0e72056
Author: StrikeW <[email protected]>
Date:   Thu Sep 14 18:42:34 2023 +0800

    refactor(jdbc-sink): execute statements in batch and set isolation level to RC (#12250)

commit 827ed5e
Author: Dylan <[email protected]>
Date:   Thu Sep 14 17:31:41 2023 +0800

    refactor(connector): migrate cdc source metric from connector to compute (#12283)

commit a934185
Author: Dylan <[email protected]>
Date:   Thu Sep 14 17:31:04 2023 +0800

    fix(optimizer): relax scan predicate pull up mapping inverse restriction (#12308)

commit db0c099
Author: Dylan <[email protected]>
Date:   Thu Sep 14 17:30:28 2023 +0800

    feat(stream): handling watermark in temporal join (#12302)

commit 1ecea63
Author: Bugen Zhao <[email protected]>
Date:   Thu Sep 14 16:43:14 2023 +0800

    refactor(risedev): split the steps for building and running playground (#12279)

    Signed-off-by: Bugen Zhao <[email protected]>
    Co-authored-by: xxchan <[email protected]>

commit ae4b1f8
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Sep 14 08:41:29 2023 +0000

    chore(deps): Bump clap from 4.4.2 to 4.4.3 (#12245)

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Co-authored-by: Bugen Zhao <[email protected]>

commit 7ca370a
Author: Croxx <[email protected]>
Date:   Thu Sep 14 16:24:19 2023 +0800

    feat(refill): fetch whole sst file when refilling (#12265)

    Signed-off-by: MrCroxx <[email protected]>

commit ec129b6
Author: Yuhao Su <[email protected]>
Date:   Thu Sep 14 16:04:37 2023 +0800

    chore: use cfg! to instead of #cfg[] for jemalloc control policy (#12307)

commit 9814af8
Author: Runji Wang <[email protected]>
Date:   Thu Sep 14 14:45:14 2023 +0800

    feat(expr): add `pg_sleep` function (#12294)

    Signed-off-by: Runji Wang <[email protected]>

commit 4525e67
Author: Noel Kwan <[email protected]>
Date:   Thu Sep 14 14:38:03 2023 +0800

    feat(stream): support source throttling (#12295)

commit 5ffd58d
Author: Dylan <[email protected]>
Date:   Thu Sep 14 14:35:03 2023 +0800

    refactor(connector): replace validate source rpc with jni (#12270)

commit 888f2dd
Author: Eric Fu <[email protected]>
Date:   Thu Sep 14 14:32:59 2023 +0800

    fix: panic when dumping memory profile (#12276)

Signed-off-by: Little-Wallace <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

expr: switch to fancy-regex crate
3 participants