Check pandas before import #229

ChenyuLInx · 2022-08-09T21:27:47Z

resolves #228

Description

Add pandas as the default packages for dbt-snowflake

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR: There's no way to actually test this since we already signed the Anaconda agreements
I have run changie new to create a changelog entry

github-actions · 2022-08-09T21:28:06Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-snowflake contributing guide.

jtcohen6

👍 code looks cleaner

Another approach we could take: Adding some if pandas logic within our "materialization" code, to detect if it's installed before trying to call it. Two reasons:

Requiring pandas for all Python models means that enabling Anaconda packages / accepting third-party terms is a prerequisite for using Snowpark Python at all (Snowpark setup third-party terms docs.getdbt.com#1848)
Could loading pandas in every stored proc risk slowing down its runtime?

Basic idea: this code

dbt-snowflake/dbt/include/snowflake/macros/materializations/table.sql

Lines 44 to 48 in 340c3fa

 # we have to make sure pandas is imported 

 import pandas 

 if isinstance(df, pandas.core.frame.DataFrame): 

 # session.write_pandas does not have overwrite function 

 df = session.createDataFrame(df)

would become something like:

try:
    import pandas
else:
    pandas = None

if pandas and isinstance(df, pandas.core.frame.DataFrame): 
     # session.write_pandas does not have overwrite function 
     df = session.createDataFrame(df)

dbt/adapters/snowflake/impl.py

Co-authored-by: Jeremy Cohen <[email protected]>

ChenyuLInx · 2022-08-16T00:16:54Z

@jtcohen6 As I looked into this more, I realized that the reason we don't run into this issue without specify pandas was because pandas is required(EDIT actually optional requirement) for snowflake-snowpark package. So I think the other approach makes sense!

Product call and we will make it happen.

-

jtcohen6 · 2022-08-17T16:59:20Z

@ChenyuLInx Sounds good! Let's do the other approach then: check to see whether pandas is installed, and then proceed accordingly.

Only question: It's not totally clear to me if a user could still call .to_pandas() on a Snowpark data frame, even if pandas isn't loaded into the environment. I think this would fail:

def model(dbt, session):
    df = dbt.ref('some_model')
    return df.to_pandas()

And so the user would need to configure it as:

def model(dbt, session):
    dbt.config(packages = ['pandas'])
    df = dbt.ref('some_model')
    return df.to_pandas()

ChenyuLInx · 2022-08-19T23:15:23Z

@jtcohen6 Looking though the Slack chat again of how this issue is created, here's my current theory.

For snowpark python run, you cannot use any third party library before accept Anaconda Terms, but you can still use snowflake-snowpark package, and when you specify it as a package, snowflake will install it without including the optional pandas dependency. Thus user would run into the pandas not available error
Once you accept the Anaconda term, when you specify snowflake-snowpark as a package, snowflake will automatically install it with the optional pandas dependency. And that's why we never run into an issue with it.
to_pandas probably will not work if you don't accept the Anaconda term

So the current solution of checking whether pandas is available should work just fine.

before accepting the term, your pure snowpark python code works fine
after accepting the term, you don't need to add pandas to every environment, it will be automatically added.

Might worth confirm with the snowpark team about it since both our testing and production account have accepted the term and there's no way I can test it.

ChenyuLInx · 2022-08-23T21:37:41Z

dbt/adapters/snowflake/impl.py

@@ -180,7 +180,13 @@ def submit_python_job(self, parsed_model: dict, compiled_code: str):
 database = getattr(parsed_model, "database", self.config.credentials.database)
 identifier = parsed_model["alias"]
 proc_name = f"{database}.{schema}.{identifier}__dbt_sp"
- packages = ["snowflake-snowpark-python"] + parsed_model["config"].get("packages", [])
+ packages = parsed_model["config"].get("packages", [])


This make sure that if user also specify snowflake-snowpark-python in the config, we will still run everything well

add pandas as a defualt package

b50858b

cla-bot bot added the cla:yes label Aug 9, 2022

add changelog

ba694c9

ChenyuLInx marked this pull request as ready for review August 9, 2022 21:29

jtcohen6 reviewed Aug 10, 2022

View reviewed changes

dbt/adapters/snowflake/impl.py Outdated Show resolved Hide resolved

jtcohen6 mentioned this pull request Aug 10, 2022

Snowpark setup third-party terms dbt-labs/docs.getdbt.com#1848

Merged

fixing typo

16cd5c3

Co-authored-by: Jeremy Cohen <[email protected]>

check pandas before import

1d1808d

ChenyuLInx changed the title ~~add pandas as a defualt package~~ Check pandas before import Aug 19, 2022

ChenyuLInx requested review from McKnight-42, VersusFacit, jtcohen6, gshank and iknox-fa August 22, 2022 16:28

ChenyuLInx commented Aug 23, 2022

View reviewed changes

gshank approved these changes Aug 23, 2022

View reviewed changes

Merge branch 'main' into snowpark_pandas

f8e7495

ChenyuLInx merged commit a7b312d into main Aug 24, 2022

ChenyuLInx deleted the snowpark_pandas branch August 24, 2022 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check pandas before import #229

Check pandas before import #229

ChenyuLInx commented Aug 9, 2022 •

edited

Loading

github-actions bot commented Aug 9, 2022

jtcohen6 left a comment

ChenyuLInx commented Aug 16, 2022 •

edited

Loading

jtcohen6 commented Aug 17, 2022

ChenyuLInx commented Aug 19, 2022 •

edited

Loading

ChenyuLInx Aug 23, 2022

	# we have to make sure pandas is imported
	import pandas
	if isinstance(df, pandas.core.frame.DataFrame):
	# session.write_pandas does not have overwrite function
	df = session.createDataFrame(df)

Check pandas before import #229

Check pandas before import #229

Conversation

ChenyuLInx commented Aug 9, 2022 • edited Loading

Description

Checklist

github-actions bot commented Aug 9, 2022

jtcohen6 left a comment

Choose a reason for hiding this comment

ChenyuLInx commented Aug 16, 2022 • edited Loading

jtcohen6 commented Aug 17, 2022

ChenyuLInx commented Aug 19, 2022 • edited Loading

ChenyuLInx Aug 23, 2022

Choose a reason for hiding this comment

ChenyuLInx commented Aug 9, 2022 •

edited

Loading

ChenyuLInx commented Aug 16, 2022 •

edited

Loading

ChenyuLInx commented Aug 19, 2022 •

edited

Loading