Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Folllow up: Support for python "re" module for doing regex in jinja templates #2851

Merged
merged 4 commits into from
Oct 26, 2020
Merged

Conversation

hochoy
Copy link
Contributor

@hochoy hochoy commented Oct 26, 2020

resolves #1755

This is a follow-up PR for the same exact feature posted by @whisperstream at #1755 . Descriptions below will almost exactly match. My first attempt at contributing to dbt 😄 ! Please do not hesitate to point out any possible improvements.

Description

Describe the feature
"I have a case where I'd like to be able to do a regex in a jinja template for matching a particular set of values. It would be handier to have a access to python's re module than having to use normal string manipulation and matching" ~ @whisperstream

Describe alternatives you've considered
Similar to the original issue, I have used the following to work around not having RegEx

  1. for loops
  2. string replacement and splitting + array indexing
  3. IF and NOT statements

Additional context
"This change is not database specific, it would apply to all jinja templates in the same way that datetime and pytz modules are made available" ~ @whisperstream

Additionally, I checked the list of exported functions from the native python re module and they do not introduce any unsafe behaviour with respect to database connections, exposed secrets, etc:
https:/python/cpython/blob/83c86cf54b36a7325f615f5adf22b28e48f0e72d/Lib/re.py#L135-L141

__all__ = [
    "match", "fullmatch", "search", "sub", "subn", "split",
    "findall", "finditer", "compile", "purge", "template", "escape",
    "error", "Pattern", "Match", "A", "I", "L", "M", "S", "X", "U",
    "ASCII", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE",
    "UNICODE",
]

If there is a concern that future versions of re will introduce unsafe behaviour, I can close this and create another PR with explicit re imports instead of __all__.

Who will this benefit?
"Anybody who wants to do more complex string manipulation and string checking in a more straight-forward way than is possible today" ~ @whisperstream

For us, this comes up often when an input array of strings to a macro is "dirty" and the strings need to be cleaned/filtered via RegEx. Another case if we are looking for certain string occurrences that might decide the logical IF, ELSE flow of our macros.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Oct 26, 2020
@hochoy
Copy link
Contributor Author

hochoy commented Oct 26, 2020

First time contributing a PR! Assuming this is accepted, if I push a commit for the Changelog and dbt next sections, how do I re-trigger the circleci tests? Does the test re-run on commit or should I create a new PR? Thanks!

@jtcohen6
Copy link
Contributor

Thanks for this PR, @hochoy, and the detailed description! I'm totally in favor of pulling in regex functionality to the Jinja layer. It's also a great help during dbt development, even if we end up rewriting the trickiest bits in python.

Assuming this is accepted, if I push a commit for the Changelog and dbt next sections, how do I re-trigger the circleci tests?

Any commit you add to this open PR will automatically re-trigger the Circle CI tests.

@drewbanin Do you have any concerns about pulling in re.__all__? I'm remembering Jake's comment from #1997, where we discussed wanting explicit yeslists:

I agree that lots of re should be available (search, match, findall, sub, and the constants seems like the minimum)

I think that's most of what we'd want, anyway, but I also don't have specific reasons to oppose the rest.

@drewbanin
Copy link
Contributor

I think adding the entire regex module is probably appropriate - @iktl any opposition on your end? Thanks for opening this PR @hochoy!

@iktl
Copy link

iktl commented Oct 26, 2020

I think adding the entire regex module is probably appropriate - @iktl any opposition on your end? Thanks for opening this PR @hochoy!

I don't see any reason not to 👍 — no opposition here

@hochoy
Copy link
Contributor Author

hochoy commented Oct 26, 2020

Thank you for the feedback! I hope this will speed up the (already fast) workflow for all dbt devs.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @hochoy!

@jtcohen6 jtcohen6 merged commit 9728152 into dbt-labs:dev/kiyoshi-kuromiya Oct 26, 2020
@kdutta-c
Copy link

Do I have to install this as a package? If so, can you tell me how?

@dataders
Copy link
Contributor

dataders commented Sep 17, 2021

Maybe there's a doc out there somewhere, but I'm also confused as to how to use the regex functions inside of dbt-jinja. I wrote a SO question that I'll answer myself once I find out.

@jtcohen6
Copy link
Contributor

No need to install additional packages. You can access methods from modules.re right in the Jinja context, such as match:

https://docs.getdbt.com/reference/dbt-jinja-functions/modules#re

@kdutta-c
Copy link

dbt comes with re. I may have overthought about this because in Python we import re
I did something like this
{% set date_format = '^(\d{4}\-\d{2}\-\d{2})' %}
and then...
{% set date_check = modules.re.match(date_format,blb, modules.re.IGNORECASE) %}

@dataders
Copy link
Contributor

You can access methods from modules.re

beautiful thanks! for reference I was trying to filter an Agate table as given in the Agate cookbook this:

new_table = table.where(lambda row: re.match('^C', str(row['state'])))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for python "re" module for doing regex in jinja templates
6 participants