Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] DbtVirtualenvBaseOperator uses system-wide dbt instead of virtualenv-specific in v1.7.0 #1246

Open
1 task done
kesompochy opened this issue Oct 8, 2024 · 1 comment · May be fixed by #1252
Open
1 task done
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc bug Something isn't working execution:virtualenv Related to Virtualenv execution environment triage-needed Items need to be reviewed / assigned to milestone

Comments

@kesompochy
Copy link
Contributor

Astronomer Cosmos Version

Other Astronomer Cosmos version (please specify below)

If "Other Astronomer Cosmos version" selected, which one?

1.7.0

dbt-core version

1.8.7

Versions of dbt adapters

No response

LoadMode

DBT_LS_MANIFEST

ExecutionMode

VIRTUALENV

InvocationMode

SUBPROCESS

airflow version

2.4.6

Operating System

Google Cloud Composer (Linux-based)

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Google Cloud Composer

Deployment details

No response

What happened?

In version 1.7.0, the dbt command is being executed using the local node's Python path instead of the virtualenv path. This causes the operator to use the system-wide dbt installation rather than the one installed in the virtualenv.

The operator should use the Python path from the created virtualenv to execute dbt commands, ensuring that the correct (virtualenv-specific) version of dbt is used.

Relevant log output

...
[2024-10-08, 04:33:22 UTC] {virtualenv.py:86} INFO - Trying to run the command:
 ['/opt/python3.8/bin/dbt', 'deps', '--project-dir', '/tmp/tmpmykndlmp', '--profiles-dir', '/tmp/cosmos/profile/9285b6cf4e81fba356567ec2da8448788576db0af33ebb38689f0fbf0760e6e8', '--profile', 'my-profile', '--target', 'test']
...
[2024-10-08, 04:33:44 UTC] {subprocess.py:90} INFO - �[0m04:33:44  Running with dbt=1.5.4
...

How to reproduce

  1. Set up a DAG using the DbtVirtualenvBaseOperator in Cloud Composer.
  2. Run the DAG and observe the task logs.
  3. In the logs, you should see that the dbt command is being executed with the system Python path instead of the virtualenv path.

Anything else :)?

This bug was introduced by #1200, which I created. I apologize for this bug.
I noticed that self._py_bin becomes None in the self.run_subprocess method, so the command does not change during task execution.

It appears that this issue is caused by self being bound to a different instance within the self.run_subprocess method. I added the following logger at various points to confirm that the instance id differs at the time of run_subprocess:

self.log.info("method called on instance %s", id(self))

The logs show:

[2024-10-08, 05:24:15 UTC] {virtualenv.py:145} INFO - execute called on instance 275939531936
[2024-10-08, 05:24:15 UTC] {virtualenv.py:158} INFO - _prepare_virtualenv called on instance 275939531936 
[2024-10-08, 05:26:20 UTC] {virtualenv.py:90} INFO - run_subprocess called on instance 275940314992

I'm wondering if this might be related to DAG pickle of Airflow.

I believe this may have been the direct cause of the bug reported in #958. When I investigated in version 1.5.0, I also found that properties were disappearing within the run_subprocess method.
To address this, I plan to change the invoke_dbt method to be treated as a property, ensuring that the run_subprocess method of the currently active instance is referenced, rather than directly assigning an instance method to invoke_dbt in __init__.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Contact Details

No response

@kesompochy kesompochy added bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone labels Oct 8, 2024
Copy link

dosubot bot commented Oct 8, 2024

Hey, @kesompochy! I'm here to help you with any bugs, questions, or contributions you have. Let's work together to resolve your issue.

The issue you're experiencing with the DbtVirtualenvBaseOperator in version 1.7.0 seems to be related to instance binding, where self._py_bin becomes None during the run_subprocess method. This is likely due to Airflow's DAG pickling process, which can cause method references to lose their context.

To address this, you can modify the invoke_dbt method to be treated as a property. This ensures that the correct instance is referenced during execution, maintaining the integrity of instance-specific attributes like self._py_bin.

Here's a potential approach:

  1. Change the invoke_dbt method to a property to ensure it always references the current instance.
  2. Verify that self._py_bin is correctly set in the _prepare_virtualenv method and persists through execution.

This should help maintain the correct virtual environment path during subprocess execution. If you're planning to submit a PR, this change could effectively resolve the issue.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc execution:virtualenv Related to Virtualenv execution environment labels Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc bug Something isn't working execution:virtualenv Related to Virtualenv execution environment triage-needed Items need to be reviewed / assigned to milestone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant