Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check pandas before import #229

Merged
merged 5 commits into from
Aug 24, 2022
Merged

Check pandas before import #229

merged 5 commits into from
Aug 24, 2022

Conversation

ChenyuLInx
Copy link
Contributor

@ChenyuLInx ChenyuLInx commented Aug 9, 2022

resolves #228

Description

Add pandas as the default packages for dbt-snowflake

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR: There's no way to actually test this since we already signed the Anaconda agreements
  • I have run changie new to create a changelog entry

@cla-bot cla-bot bot added the cla:yes label Aug 9, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2022

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-snowflake contributing guide.

@ChenyuLInx ChenyuLInx marked this pull request as ready for review August 9, 2022 21:29
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 code looks cleaner

Another approach we could take: Adding some if pandas logic within our "materialization" code, to detect if it's installed before trying to call it. Two reasons:

  • Requiring pandas for all Python models means that enabling Anaconda packages / accepting third-party terms is a prerequisite for using Snowpark Python at all (Snowpark setup third-party terms docs.getdbt.com#1848)
  • Could loading pandas in every stored proc risk slowing down its runtime?

Basic idea: this code

# we have to make sure pandas is imported
import pandas
if isinstance(df, pandas.core.frame.DataFrame):
# session.write_pandas does not have overwrite function
df = session.createDataFrame(df)

would become something like:

try:
    import pandas
else:
    pandas = None

if pandas and isinstance(df, pandas.core.frame.DataFrame): 
     # session.write_pandas does not have overwrite function 
     df = session.createDataFrame(df)

dbt/adapters/snowflake/impl.py Outdated Show resolved Hide resolved
Co-authored-by: Jeremy Cohen <[email protected]>
@ChenyuLInx
Copy link
Contributor Author

ChenyuLInx commented Aug 16, 2022

@jtcohen6 As I looked into this more, I realized that the reason we don't run into this issue without specify pandas was because pandas is required(EDIT actually optional requirement) for snowflake-snowpark package. So I think the other approach makes sense!

Product call and we will make it happen.

Screen Shot 2022-08-15 at 5 14 01 PM

-

@jtcohen6
Copy link
Contributor

@ChenyuLInx Sounds good! Let's do the other approach then: check to see whether pandas is installed, and then proceed accordingly.

Only question: It's not totally clear to me if a user could still call .to_pandas() on a Snowpark data frame, even if pandas isn't loaded into the environment. I think this would fail:

def model(dbt, session):
    df = dbt.ref('some_model')
    return df.to_pandas()

And so the user would need to configure it as:

def model(dbt, session):
    dbt.config(packages = ['pandas'])
    df = dbt.ref('some_model')
    return df.to_pandas()

@ChenyuLInx
Copy link
Contributor Author

ChenyuLInx commented Aug 19, 2022

@jtcohen6 Looking though the Slack chat again of how this issue is created, here's my current theory.

  1. For snowpark python run, you cannot use any third party library before accept Anaconda Terms, but you can still use snowflake-snowpark package, and when you specify it as a package, snowflake will install it without including the optional pandas dependency. Thus user would run into the pandas not available error
  2. Once you accept the Anaconda term, when you specify snowflake-snowpark as a package, snowflake will automatically install it with the optional pandas dependency. And that's why we never run into an issue with it.
  3. to_pandas probably will not work if you don't accept the Anaconda term

So the current solution of checking whether pandas is available should work just fine.

  • before accepting the term, your pure snowpark python code works fine
  • after accepting the term, you don't need to add pandas to every environment, it will be automatically added.

Might worth confirm with the snowpark team about it since both our testing and production account have accepted the term and there's no way I can test it.

@ChenyuLInx ChenyuLInx changed the title add pandas as a defualt package Check pandas before import Aug 19, 2022
@@ -180,7 +180,13 @@ def submit_python_job(self, parsed_model: dict, compiled_code: str):
database = getattr(parsed_model, "database", self.config.credentials.database)
identifier = parsed_model["alias"]
proc_name = f"{database}.{schema}.{identifier}__dbt_sp"
packages = ["snowflake-snowpark-python"] + parsed_model["config"].get("packages", [])
packages = parsed_model["config"].get("packages", [])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This make sure that if user also specify snowflake-snowpark-python in the config, we will still run everything well

@ChenyuLInx ChenyuLInx merged commit a7b312d into main Aug 24, 2022
@ChenyuLInx ChenyuLInx deleted the snowpark_pandas branch August 24, 2022 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1017] No module named 'pandas' for python models
3 participants