Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL encode dataset names to support multibyte characters #1198

Merged
merged 7 commits into from
Sep 26, 2024

Conversation

t0momi219
Copy link
Contributor

@t0momi219 t0momi219 commented Sep 7, 2024

Description

In projects containing models with names like the following, dataset creation fails, and an error occurs during execution.

└── dbt
    └── my_project
        └── models
            ├── 日本語名モデル.sql
            └── 日本語名モデル.yml
[2024-09-07T04:10:58.435+0000] {local.py:455} DEBUG - URIs to be converted to Dataset: ['snowflake://***.ap-northeast-1.aws/TEST_DB.TEST_SCHEMA.日本語名モデル']
[2024-09-07T04:10:58.437+0000] {taskinstance.py:3301} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/cosmos/operators/local.py", line 459, in get_datasets
    datasets = [Dataset(uri) for uri in uris]
                ^^^^^^^^^^^^
  File "<attrs generated init airflow.datasets.Dataset>", line 3, in __init__
    _setattr('uri', __attr_converter_uri(uri))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/datasets/__init__.py", line 78, in _sanitize_uri
    raise ValueError("Dataset URI must only consist of ASCII characters")
ValueError: Dataset URI must only consist of ASCII characters

To support model names with multibyte characters, it might be good to URL encode the names.

Related Issue(s)

closes: #1197

Breaking Change?

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Sep 7, 2024
Copy link

netlify bot commented Sep 7, 2024

Deploy Preview for sunny-pastelito-5ecb04 canceled.

Name Link
🔨 Latest commit 10d1eae
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/66dc04010bb5aa0008a74e6c

Copy link

netlify bot commented Sep 7, 2024

Deploy Preview for sunny-pastelito-5ecb04 ready!

Name Link
🔨 Latest commit 3b6ac62
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/66f56d5aaf48540008404054
😎 Deploy Preview https://deploy-preview-1198--sunny-pastelito-5ecb04.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@dosubot dosubot bot added area:datasets Related to the Airflow datasets feature/module area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc execution:local Related to Local execution environment labels Sep 7, 2024
Copy link

codecov bot commented Sep 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.95%. Comparing base (11de5ba) to head (3b6ac62).
Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1198   +/-   ##
=======================================
  Coverage   95.95%   95.95%           
=======================================
  Files          64       64           
  Lines        3656     3657    +1     
=======================================
+ Hits         3508     3509    +1     
  Misses        148      148           
Flag Coverage Δ
95.95% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @t0momi219 this contribution is very valuable, thank you for working on this!

Please, as suggested by @pankajkoti , could you add a test case?
It would be great if we could have a test that didn't pass this change, when calling get_datasets, and that it passes after this change. This would help make sure future changes in Cosmos won't undo this fix.
The right place is probably: https:/astronomer/astronomer-cosmos/blob/main/tests/operators/test_local.py to confirm

Also, in the PR description, you refer the original issue #1197. Could you add before it one of the following keywords, so Github will automatically close the issue once the PR is merged?

close
closes
closed
fix
fixes
fixed
resolve
resolves
resolved

Once we have the test, we can merge this change. If we can do this during this week, we can release it as part of Cosmos 1.7.

@t0momi219
Copy link
Contributor Author

@pankajkoti @tatiana
Thank you for your review. I plan to implement a test by placing a model with a multibyte character name in the dev/dags/dbt/simple/ project. If there is a more appropriate directory where this model should be placed, please let me know.
I will write the test by tomorrow at the latest.
Thanks!

@tatiana
Copy link
Collaborator

tatiana commented Sep 24, 2024

Hi @t0momi219 , thank you very much! Sounds like a good plan!

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Sep 24, 2024
docs/contributing.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, @t0momi219 , for identifying, reporting and fixing this issue. I'm sure many members of the community will be able to benefit from this.

Thanks also for adding the missing test.

I made one last comment inline, if you could address it before we merge this PR, it would be awesome! Thank you very much.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 26, 2024
Copy link
Contributor

@pankajkoti pankajkoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @t0momi219 . Will keep an eye on the CI run & once it succeeds, I will go ahead to merge it

@pankajkoti pankajkoti merged commit e0a9fd3 into astronomer:main Sep 26, 2024
68 checks passed
slords pushed a commit to slords/astronomer-cosmos that referenced this pull request Sep 26, 2024
…1198)

In projects containing models with names like the following, dataset
creation fails, and an error occurs during execution.
```txt
└── dbt
    └── my_project
        └── models
            ├── 日本語名モデル.sql
            └── 日本語名モデル.yml

```

```
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/datasets/__init__.py", line 78, in _sanitize_uri
    raise ValueError("Dataset URI must only consist of ASCII characters")
ValueError: Dataset URI must only consist of ASCII characters
```

To support model names with multibyte characters, it might be good to
URL encode the names.

closes: astronomer#1197

Co-authored-by: Tatiana Al-Chueyr <[email protected]>
@tatiana tatiana mentioned this pull request Oct 2, 2024
tatiana added a commit that referenced this pull request Oct 4, 2024
New Features

* Introduction of experimental support to run dbt BQ models using Airflow deferrable operators by @pankajkoti @pankajastro @tatiana in #1224 #1230.
  This is a first step in this journey and we would really appreciate feedback from the community.

  For more information, check the documentation: https://astronomer.github.io/astronomer-cosmos/getting_started/execution-modes.html#airflow-async-experimental

  This work has been inspired by the talk "Airflow at Monzo: Evolving our data platform as the bank scales" by
  @jonathanrainer @ed-sparkes given at Airflow Summit 2023: https://airflowsummit.org/sessions/2023/airflow-at-monzo-evolving-our-data-platform-as-the-bank-scales/.

* Support using ``DatasetAlias`` and fix orphaning unreferenced dataset by @tatiana in #1217 #1240

  Documentation: https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html#data-aware-scheduling

* Add GCP_CLOUD_RUN_JOB execution mode by @ags-de #1153

  Learn more about it: https://astronomer.github.io/astronomer-cosmos/getting_started/gcp-cloud-run-job.html

Enhancements

* Create single virtualenv when ``DbtVirtualenvBaseOperator`` has ``virtualenv_dir=None`` and ``is_virtualenv_dir_temporary=True`` by @kesompochy in #1200
* Consistently handle build and imports in ``cosmos/__init__.py`` by @tatiana in #1215
* Add enum constants to init for direct import by @fabiomx in #1184

Bug fixes

* URL encode dataset names to support multibyte characters by @t0momi219 in #1198
* Fix invalid argument (``full_refresh``) passed to DbtTestAwsEksOperator (and others) by @johnhoran in #1175
* Fix ``printer_width`` arg type in ``DbtProfileConfigVars`` by @jessicaschueler in #1191
* Fix task owner fallback by @jmaicher in #1195

Docs

* Add scarf to readme and docs for website analytics by @cmarteepants in #1221
* Add ``virtualenv_dir`` param to ``ExecutionConfig`` docs by @pankajkoti in #1173
* Give credits to @LennartKloppenburg in CHANGELOG.rst by @tatiana #1174
* Refactor docs for async mode execution by @pankajkoti in #1241

Others

* Remove PR branch added for testing a change in CI in #1224 by @pankajkoti in #1233
* Fix CI wrt broken coverage upload artifact @pankajkoti in #1210
* Fix CI issues - Upgrade actions/upload-artifact & actions/download-artifact to v4 and set min version for packaging by @pankajkoti in #1208
* Resolve CI failures for Apache Airflow 2.7 jobs by @pankajkoti in #1182
* CI: Update GCP manifest file path based on new secret update by @pankajkoti in #1237
* Pre-commit hook updates in #1176 #1186, #1186, #1201, #1219, #1231
tatiana added a commit that referenced this pull request Oct 4, 2024
New Features

* Introduction of experimental support to run dbt BQ models using Airflow deferrable operators by @pankajkoti @pankajastro @tatiana in #1224 #1230.
  This is a first step in this journey and we would really appreciate feedback from the community.

  For more information, check the documentation: https://astronomer.github.io/astronomer-cosmos/getting_started/execution-modes.html#airflow-async-experimental

  This work has been inspired by the talk "Airflow at Monzo: Evolving our data platform as the bank scales" by
  @jonathanrainer @ed-sparkes given at Airflow Summit 2023: https://airflowsummit.org/sessions/2023/airflow-at-monzo-evolving-our-data-platform-as-the-bank-scales/.

* Support using ``DatasetAlias`` and fix orphaning unreferenced dataset by @tatiana in #1217 #1240

  Documentation: https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html#data-aware-scheduling

* Add GCP_CLOUD_RUN_JOB execution mode by @ags-de #1153

  Learn more about it: https://astronomer.github.io/astronomer-cosmos/getting_started/gcp-cloud-run-job.html

Enhancements

* Create single virtualenv when ``DbtVirtualenvBaseOperator`` has ``virtualenv_dir=None`` and ``is_virtualenv_dir_temporary=True`` by @kesompochy in #1200
* Consistently handle build and imports in ``cosmos/__init__.py`` by @tatiana in #1215
* Add enum constants to init for direct import by @fabiomx in #1184

Bug fixes

* URL encode dataset names to support multibyte characters by @t0momi219 in #1198
* Fix invalid argument (``full_refresh``) passed to DbtTestAwsEksOperator (and others) by @johnhoran in #1175
* Fix ``printer_width`` arg type in ``DbtProfileConfigVars`` by @jessicaschueler in #1191
* Fix task owner fallback by @jmaicher in #1195

Docs

* Add scarf to readme and docs for website analytics by @cmarteepants in #1221
* Add ``virtualenv_dir`` param to ``ExecutionConfig`` docs by @pankajkoti in #1173
* Give credits to @LennartKloppenburg in CHANGELOG.rst by @tatiana #1174
* Refactor docs for async mode execution by @pankajkoti in #1241

Others

* Remove PR branch added for testing a change in CI in #1224 by @pankajkoti in #1233
* Fix CI wrt broken coverage upload artifact @pankajkoti in #1210
* Fix CI issues - Upgrade actions/upload-artifact & actions/download-artifact to v4 and set min version for packaging by @pankajkoti in #1208
* Resolve CI failures for Apache Airflow 2.7 jobs by @pankajkoti in #1182
* CI: Update GCP manifest file path based on new secret update by @pankajkoti in #1237
* Pre-commit hook updates in #1176 #1186, #1186, #1201, #1219, #1231
tatiana added a commit that referenced this pull request Oct 4, 2024
**New Features**

* Support using ``DatasetAlias`` and fix orphaning unreferenced dataset
by @tatiana in #1217 #1240

Documentation:
https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html#data-aware-scheduling

* Add GCP_CLOUD_RUN_JOB execution mode by @ags-de #1153

Learn more about it:
https://astronomer.github.io/astronomer-cosmos/getting_started/gcp-cloud-run-job.html

* Introduction of experimental support to run dbt BQ models using
Airflow deferrable operators by @pankajkoti @pankajastro @tatiana in
#1224 #1230.

This is the first step in the journey of running dbt resources with
native Airflow, and we would appreciate feedback from the community.

For more information, check the documentation:
https://astronomer.github.io/astronomer-cosmos/getting_started/execution-modes.html#airflow-async-experimental

This work has been inspired by the talk "Airflow at Monzo: Evolving our
data platform as the bank scales" by
@jonathanrainer @ed-sparkes given at Airflow Summit 2023:
https://airflowsummit.org/sessions/2023/airflow-at-monzo-evolving-our-data-platform-as-the-bank-scales/.


**Enhancements**

* Create single virtualenv when ``DbtVirtualenvBaseOperator`` has
``virtualenv_dir=None`` and ``is_virtualenv_dir_temporary=True`` by
@kesompochy in #1200
* Consistently handle build and imports in ``cosmos/__init__.py`` by
@tatiana in #1215
* Add enum constants to init for direct import by @fabiomx in #1184

**Bug fixes**

* URL encode dataset names to support multibyte characters by @t0momi219
in #1198
* Fix invalid argument (``full_refresh``) passed to
DbtTestAwsEksOperator (and others) by @johnhoran in #1175
* Fix ``printer_width`` arg type in ``DbtProfileConfigVars`` by
@jessicaschueler in #1191
* Fix task owner fallback by @jmaicher in #1195

**Docs**

* Add scarf to readme and docs for website analytics by @cmarteepants in
#1221
* Add ``virtualenv_dir`` param to ``ExecutionConfig`` docs by
@pankajkoti in #1173
* Give credits to @LennartKloppenburg in CHANGELOG.rst by @tatiana #1174
* Refactor docs for async mode execution by @pankajkoti in #1241

Others

* Remove PR branch added for testing a change in CI in #1224 by
@pankajkoti in #1233
* Fix CI wrt broken coverage upload artifact @pankajkoti in #1210
* Fix CI issues - Upgrade actions/upload-artifact &
actions/download-artifact to v4 and set min version for packaging by
@pankajkoti in #1208
* Resolve CI failures for Apache Airflow 2.7 jobs by @pankajkoti in
#1182
* CI: Update GCP manifest file path based on new secret update by
@pankajkoti in #1237
* Pre-commit hook updates in #1176 #1186, #1186, #1201, #1219, #1231

---------

Co-authored-by: Pankaj Koti <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:datasets Related to the Airflow datasets feature/module area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc execution:local Related to Local execution environment lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Cannot create datasets in projects with model names containing multibyte characters
3 participants