Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kedro-datasets: allow users to choose between Databricks and Spark sessions #700

Open
michal-mmm opened this issue May 27, 2024 · 7 comments · May be fixed by #862
Open

kedro-datasets: allow users to choose between Databricks and Spark sessions #700

michal-mmm opened this issue May 27, 2024 · 7 comments · May be fixed by #862

Comments

@michal-mmm
Copy link
Contributor

Description

Using databricks-connect isn't always optimal for initializing Spark sessions in some use cases. This can be problematic for users who do not wish to use databricks-connect or prefer using spark-connect.

Context

  • This change is important because databricks-connect can sometimes cause issues, such as those described in this community discussion. These issues can disrupt workflows and create unnecessary complications.
  • Providing an option to choose whether to use databricks-connect or a regular Spark session would allow users to avoid these issues and use their preferred method for Spark session initialization.
  • _get_spark() function is used in some datasets

Possible Alternatives

  • Modify the _get_spark() function to allow users to specify their preference for databricks-connect, spark-connect, or a regular Spark session through configuration settings.
  • Another alternative could be to provide separate functions for initializing a Databricks session and a regular Spark session, and allow users to explicitly choose which function to call.
@MigQ2
Copy link
Contributor

MigQ2 commented Sep 14, 2024

It's also worth considering that with databricks-connect>=13.3 the spark client is no longer a Singleton, and different connections with different settings can be started from a single python interpreter.

The whole _get_spark() method and the pyspark integration docs have been designed assuming spark was a Singleton

I haven't tested this throroughly but I guess with the current implementation since spark is initialized independently and without args for each Dataset, no spark configuration could be used

If this development is going to be taken it would be nice to also consider this and design the spark integration in a way that spark config can also be used when using databricks-connect

@noklam
Copy link
Contributor

noklam commented Oct 1, 2024

We need to consider not just for latest databricks-connect but also the older one. It's natural that some library move faster and some don't. If at some point it's not possible to keep them together, we can break the dataset out or having some custom dataset temporary until the other libraries catch up.

For now, @MigQ2 can you provide some pointer and snippet how would a spark connection looks like for databricks-connect?

@MigQ2
Copy link
Contributor

MigQ2 commented Oct 1, 2024

I'm not sure I fully get what you mean @noklam

In the older versions (databricks-connect<13) you would instantiate spark using spark = SparkSession.builder.getOrCreate() and I think the current _get_spark() function should work fine with it

In the newer versions (databricks-connect>=13) you do instead spark = DatabricksSession.builder.getOrCreate(), and this also works fine currently.

I have opened #861 to tackle the problem @michal-mmm initially mentioned.

Providing a way to provide spark options in databricks-connect is still not implemented. Shall track that in a new different issue?

@noklam
Copy link
Contributor

noklam commented Oct 1, 2024

@MigQ2 Thanks for opening the PR.

This can be problematic for users who do not wish to use databricks-connect or prefer using spark-connect.

Can you explains how this resolve the issue? From my understanding the PR try to avoid initialising databricks-connect for older version. This issue seems to suggest that sometimes people still want to use pure spark, even if databrick-connect is available. Did I missed something?

@MigQ2
Copy link
Contributor

MigQ2 commented Oct 1, 2024

Yeah @noklam you're right, probably it's worth doing a more general approach at once.

I can think of 2 possible implementations, let me know what you think and I can try to implement it:

  1. Let the user decide which framework to use, either spark, spark-connect or databricks-connect. This could be configured in each SparkDataset definition or globally (with an environment variable?). We can leave the current logic as the default when not explicitly provided

  2. Create a SparkInitializer class that users can override with their custom logic and use that in all datasets that need spark. We could give the class as a parameter to the dataset like the current type config in the catalog. Also it would be nice to define a way to globally override the default

@noklam
Copy link
Contributor

noklam commented Oct 1, 2024

I would prefer 1 if it achieve the same thing without introducing another class.

For global default, I would imagine user do this via dataset factory to give default setting of their preference of spark instead of doing this via the spark dataset.

Thanks for taking a stab at this quickly. Again I will be spending more time this week on PR review so if you are able to come up with some PR soon, I can review it quickly.

@MigQ2
Copy link
Contributor

MigQ2 commented Oct 1, 2024

Hi @noklam, after some research I have come up with #862 (it's still WIP, let's discuss if we like the pattern and then I can finish implementation for all datasets).

Some relevant comments:

  1. It seems we don't need to do anything new to support Spark Connect. If the user sets the SPARK_REMOTE environment variable they shold be able to use the current _get_spark() to get a remote Spark Connect Session. I haven't tested it tho as I don't have a non-databricks spark setup

  2. This means we would clutter all spark datasets with a new spark_mode parameter just to be able to use "normal spark" and a remote session in the same kedro run. Why would someone install databricks-connect and then connect to Spark in the traditional way? Is there really a use case for this? I don't think this should be common enough to justify an extra parameter in all spark datasets __init__

Could some of the people who liked this issue explain their setup and if this would solve their concerns? @michal-mmm @zerodarkzone @filipeo2-mck

What do you think? Shall I implement this on all Datasets or should we rather just update the documentation to be clearer, close #862 and leave the code as-is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants