Fixes bugs in pyspark @with_columns #730

skrawcz · 2024-03-02T05:46:10Z

Configuration linkage was not exposed. So adding a parameter for people to specify what configuration is required.
There was a bug in the UDF logic, because all(empty) evaluates to True. So fixed that.
The select in append mode appended all possible columns. It didn't only append what was in the select. This changes that behavior by dropping "outputs" from the subdag that aren't in the select. We do it this way because we don't know what's in the original dataframe.

Changes

h_spark
llm workflow example using the PDF summarizer

How I tested this

runs locally

Notes

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

	🚀 This PR description was created by Ellipsis for commit `02120ae`.

Summary:

This PR enhances the pyspark @with_columns decorator, fixes a bug in the UDF logic, modifies the 'select' behavior in append mode, updates the example workflows, and adds new files in the pdf_summarizer/run_on_spark directory.

Key points:

Enhanced the pyspark @with_columns decorator by adding a parameter for required configurations.
Fixed a bug in the UDF logic where all(empty) was incorrectly evaluating to True.
Modified the 'select' behavior in append mode to only append the specified columns.
Updated the example workflows to demonstrate the new features and fixes.
Added new files run_with_columns.py and spark_pdf_pipeline.py in the pdf_summarizer/run_on_spark directory.

Generated with ❤️ by ellipsis.dev

ellipsis-dev

👍 Looks good to me!

Reviewed the entire pull request up to 5fdc21c
Looked at 319 lines of code in 4 files
Took 1 minute and 10 seconds to review

More info

Skipped 1 files when reviewing.
Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.

1. hamilton/plugins/h_spark.py:396:

Assessed confidence : 50%
Comment:
The changes in this file seem to address the issues mentioned in the PR description. However, it would be beneficial to have tests to ensure these changes don't introduce new bugs.
Reasoning:
The changes in the h_spark.py file seem to be addressing the issues mentioned in the PR description. The changes include adding a parameter to specify required configuration, fixing a bug in the UDF logic, and changing the behavior of the 'select' in append mode to only append the columns specified in the 'select'. The changes seem logical and there are no clear violations of best practices. However, without running the code and tests, it's hard to be certain that these changes don't introduce new bugs.

Workflow ID: wflow_R8HypYCML8A27Dd9

Not what you expected? You can customize the content of the reviews using rules. Learn more here.

ellipsis-dev

👍 Looks good to me!

Performed an incremental review on ddd38df
Looked at 359 lines of code in 5 files
Took 2 minutes and 23 seconds to review

More info

Skipped 0 files when reviewing.
Skipped posting 3 additional comments because they didn't meet confidence threshold of 50%.

1. plugin_tests/h_spark/test_h_spark.py:611:

Assessed confidence : 50%
Comment:
Consider renaming the test case 'test_with_columns_generate_nodes_select_append_mode' to 'test_with_columns_generate_nodes_append_mode_with_select' for better clarity.
Reasoning:
The test case 'test_with_columns_generate_nodes_select_append_mode' is checking if the function 'with_columns' correctly generates nodes when the mode is 'append' and a 'select' parameter is provided. The test case is passing and the implementation seems correct. However, the test case name could be more descriptive to indicate that it's testing the 'append' mode with a 'select' parameter.

2. plugin_tests/h_spark/test_h_spark.py:634:

Assessed confidence : 50%
Comment:
Consider renaming the test case 'test_with_columns_generate_nodes_select_mode_select' to 'test_with_columns_generate_nodes_select_mode_with_select' for better clarity.
Reasoning:
The test case 'test_with_columns_generate_nodes_select_mode_select' is checking if the function 'with_columns' correctly generates nodes when the mode is 'select' and a 'select' parameter is provided. The test case is passing and the implementation seems correct. However, the test case name could be more descriptive to indicate that it's testing the 'select' mode with a 'select' parameter.

3. hamilton/plugins/h_spark.py:396:

Assessed confidence : 80%
Grade: 0%
Comment:
Consider adding a test case for the function 'create_drop_node' to ensure its correct functionality.
Reasoning:
The function 'create_drop_node' is being used in 'generate_nodes' function in 'h_spark.py' file. However, there is no direct test case for this function. It would be good to have a separate test case for this function to ensure its correct functionality.

Workflow ID: wflow_BKH1gyTNBqwBmIaW

Not what you expected? You can customize the content of the reviews using rules. Learn more here.

1. Configuration linkage was not exposed. So adding a parameter for people to specify what configuration is required. 2. There was a bug in the UDF logic, because `all(empty)` evaluates to True. So fixed that. 3. The `select` in append mode appended all possible columns. It didn't only append what was in the `select`. This changes that behavior by dropping "outputs" from the subdag that aren't in the select. We do it this way because we don't know what's in the original dataframe. 4. Adds test to check for the new logic

ellipsis-dev

❌ Changes requested.

Performed an incremental review on 81fefd8
Looked at 376 lines of code in 6 files
Took 3 minutes and 39 seconds to review

More info

Skipped 0 files when reviewing.
Skipped posting 3 additional comments because they didn't meet confidence threshold of 50%.

1. hamilton/plugins/h_spark.py:396:

Assessed confidence : 0%
Comment:
The fix for the all(empty) bug is correct. The all function returns True for an empty iterable in Python, which is not the desired behavior in this case. The added condition len(pandas_annotation.values()) > 0 ensures that the iterable is not empty before calling all.
Reasoning:
In the h_spark.py file, the all(empty) bug fix is correct. The all function returns True for an empty iterable in Python, which is not the desired behavior in this case. The added condition len(pandas_annotation.values()) > 0 ensures that the iterable is not empty before calling all.

2. examples/LLM_Workflows/pdf_summarizer/run_on_spark/README.md:1:

Assessed confidence : 0%
Comment:
The changes in the README.md file provide clear instructions on how to run the PDF summarizer on Spark using two approaches. The instructions are detailed and include steps for installing dependencies, running the code, and setting the OPENAI_API_KEY in the environment. The addition of diagrams to illustrate the dataflow is also a good practice.
Reasoning:
The changes in the README.md file provide clear instructions on how to run the PDF summarizer on Spark using two approaches. The instructions are detailed and include steps for installing dependencies, running the code, and setting the OPENAI_API_KEY in the environment. The addition of diagrams to illustrate the dataflow is also a good practice.

3. tests/resources/spark/a_dag.py:9:

Assessed confidence : 80%
Grade: 0%
Comment:
The function c is supposed to return a float but the operation a * 3.5 + b could potentially return an integer if a and b are both 0. This could lead to unexpected behavior. Consider ensuring that the return type is always a float. For example:

return float(a * 3.5 + b)

Reasoning:
The function c in a_dag.py is supposed to return a float but the operation a * 3.5 + b could potentially return an integer if a and b are both 0. This could lead to unexpected behavior.

Workflow ID: wflow_XYKlW9pLFqa2xB81

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. We'll respond in a few minutes. Learn more here.

examples/LLM_Workflows/pdf_summarizer/run_on_spark/spark_pdf_pipeline.py

elijahbenizzy

Looks good -- made a few suggestions.

hamilton/plugins/h_spark.py

plugin_tests/h_spark/test_h_spark.py

hamilton/plugins/h_spark.py

ellipsis-dev · 2024-03-04T04:26:39Z

Skipped PR review on 59c28ca because no changed files had a supported extension. If you think this was in error, please contact us and we'll fix it right away.

Generated with ❤️ by ellipsis.dev

ellipsis-dev

👍 Looks good to me!

Performed an incremental review on 2e06300
Looked at 77 lines of code in 2 files
Took 1 minute and 29 seconds to review

More info

Skipped 0 files when reviewing.
Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.

1. examples/LLM_Workflows/scraping_and_chunking/spark/spark_pipeline.py:27:

Assessed confidence : 50%
Comment:
The PR description mentions that a parameter was added to specify required configurations in the pyspark @with_columns decorator. However, I don't see any changes related to this in the diff. Please confirm if the changes are in a different file or if they were accidentally omitted.
Reasoning:
The PR description mentions that a parameter was added to specify required configurations in the pyspark @with_columns decorator. However, I don't see any changes related to this in the diff. It's possible that the changes are in a file that's not included in the diff. I'll need to check the entire codebase to confirm this.

Workflow ID: wflow_qGyTljvPQXrGawaZ

Not what you expected? You can customize the content of the reviews using rules. Learn more here.

skrawcz · 2024-03-04T05:48:40Z

TODO:

redo commits
make sure comments/docs what we want to show people

ellipsis-dev

👍 Looks good to me!

Performed an incremental review on f47f0f6
Looked at 11 lines of code in 1 files
Took 1 minute and 57 seconds to review

More info

Skipped 0 files when reviewing.
Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.

1. /examples/LLM_Workflows/README.md:8:

Assessed confidence : 0%
Comment:
The changes in the README file are clear and concise. They provide additional information about the examples and their alternative implementations.
Reasoning:
The changes in the README file are clear and concise. They provide additional information about the examples and their alternative implementations. No issues found here.

Workflow ID: wflow_BQH86UZHConQymdH

Not what you expected? You can customize the content of the reviews using rules. Learn more here.

ellipsis-dev

👍 Looks good to me!

Performed an incremental review on 6ded5f1
Looked at 68 lines of code in 1 files
Took 2 minutes and 27 seconds to review

More info

Skipped 0 files when reviewing.
Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.

1. /examples/LLM_Workflows/scraping_and_chunking/spark/README.md:57:

Assessed confidence : 0%
Comment:
The updates to the README file are clear and informative. Good job!
Reasoning:
The README file has been updated to reflect the changes made in the PR. It provides a clear explanation of the changes, why they were made, and how to use the updated code. There are no issues with this file.

Workflow ID: wflow_iJIHQ6gZL7nFZCxx

Not what you expected? You can customize the content of the reviews using rules. Learn more here.

ellipsis-dev · 2024-03-04T20:16:47Z

Skipped PR review on 61582b6 because no changed files had a supported extension. If you think this was in error, please contact us and we'll fix it right away.

Generated with ❤️ by ellipsis.dev

This is close to the original, but we need to specify the set of spark safe functions. This is okay-ish I think. It just also requires that we include the pipeline in the larger DAG too... (+1 squashed commit) Squashed commits: [fe08646] Adds code to do document chunking with pyspark with_columns makes some assumptions: (a) that all functions in the with_columns subdag take in the dataframe (b) that all intermediate functions in the subdag need to become columns in the dataframe (c) it doesn't allow you to wire through inputs to the subdag We'll need to improve upon some of these to make using spark a real drop in replacement. The most annoying part right now is not being able to wire through inputs to the subdag from other functions in the DAG. That's why the html_chunker and text_chunker are outside Hamilton right now.

Adds select decorator convenience function -- that way it's clearer if that mode is wanted. Adds some more comments to explain some options better.

@skrawcz

That way it’s easier to figure out and determine what’s going on. Adds symlink in spark to llm example (+6 squashed commits) Squashed commits: [f47f0f6] Tells people to look into examples more for ray and pyspark. [6ded5f1] Adds more documentation to the spark chunking example So that it's clearer what and why. [2e06300] Pre-commit fix [59c28ca] Removes unneeded file [d3b9123] address comments left by @skrawcz on #732 (Adds code to do document chunking with pyspark); [be79aca] address comments left by @skrawcz on #732 (Adds code to do document chunking with pyspark);

ellipsis-dev

👍 Looks good to me!

Performed an incremental review on 6a572fc
Looked at 115 lines of code in 4 files
Took 1 minute and 45 seconds to review

More info

Skipped 2 files when reviewing.
Skipped posting 2 additional comments because they didn't meet confidence threshold of 50%.

1. /examples/LLM_Workflows/scraping_and_chunking/spark/doc_pipeline.py:25:

Assessed confidence : 50%
Comment:
Consider including the original error message in the exception to help with debugging. For example:

raise Exception(f'Failed to get URL: {url}. Original error: {str(e)}')

Reasoning:
The PR author has added exception handling for network requests, which is a good practice. However, the error message in the exception could be more informative. Instead of just saying 'Failed to get URL', it could include the original error message to help with debugging.

2. /examples/LLM_Workflows/scraping_and_chunking/spark/spark_pipeline.py:30:

Assessed confidence : 50%
Comment:
Consider including the original error message in the exception to help with debugging. For example:

raise RuntimeError(f'Failed to fetch sitemap from {sitemap_url}. Original error: {str(e)}')

Reasoning:
The PR author has added exception handling for network requests, which is a good practice. However, the error message in the exception could be more informative. Instead of just saying 'Failed to fetch sitemap', it could include the original error message to help with debugging.

Workflow ID: wflow_8Dtkx1sailRJdo0v

Not what you expected? You can customize the content of the reviews using rules. Learn more here.

To try to steer it to be more useful.

ellipsis-dev

👍 Looks good to me!

Performed an incremental review on 9341d31
Looked at 26 lines of code in 1 files
Took 1 minute and 56 seconds to review

More info

Skipped 0 files when reviewing.
Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.

1. ellipsis.yaml:1:

Assessed confidence : 50%
Comment:
This new file 'ellipsis.yaml' is not mentioned in the PR description. Could you please clarify if it's intended to be part of this PR? If so, please update the PR description to include information about this file and its purpose.
Reasoning:
The PR author has added a new file 'ellipsis.yaml'. This file seems to contain configuration for the Ellipsis tool, which is used for automated PR reviews. The rules defined in this file align with the principles provided for this review task. However, this file is not mentioned in the PR description, and it's unclear whether it's intended to be part of this PR or if it was added by mistake.

Workflow ID: wflow_XFF6KDS5tePXraMC

Not what you expected? You can customize the content of the reviews using rules. Learn more here.

elijahbenizzy

Looks good

ellipsis-dev

👍 Looks good to me!

Performed an incremental review on 02120ae
Looked at 26 lines of code in 1 files
Took 1 minute and 35 seconds to review

More info

Skipped 0 files when reviewing.
Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.

1. ellipsis.yaml:20:

Assessed confidence : 0%
Comment:
The addition of the ellipsis.yaml file is a good practice as it provides a clear set of rules and guidelines for contributors to follow. This will help maintain the quality and consistency of the codebase. However, it's important to ensure that these rules are communicated and enforced effectively.
Reasoning:
The addition of the ellipsis.yaml file is a good practice as it provides a clear set of rules and guidelines for contributors to follow. This will help maintain the quality and consistency of the codebase. However, it's important to ensure that these rules are communicated and enforced effectively.

Workflow ID: wflow_E2C7owIdPQ8Lv9hg

Not what you expected? You can customize the content of the reviews using rules. Learn more here.

ellipsis-dev bot reviewed Mar 2, 2024

View reviewed changes

skrawcz force-pushed the with_columns_pdf_spark_example branch from 5fdc21c to ddd38df Compare March 2, 2024 06:37

ellipsis-dev bot reviewed Mar 2, 2024

View reviewed changes

skrawcz force-pushed the with_columns_pdf_spark_example branch from ddd38df to 81fefd8 Compare March 2, 2024 06:41

skrawcz requested a review from elijahbenizzy March 2, 2024 06:43

ellipsis-dev bot reviewed Mar 2, 2024

View reviewed changes

examples/LLM_Workflows/pdf_summarizer/run_on_spark/spark_pdf_pipeline.py Show resolved Hide resolved

examples/LLM_Workflows/pdf_summarizer/run_on_spark/spark_pdf_pipeline.py Show resolved Hide resolved

elijahbenizzy reviewed Mar 3, 2024

View reviewed changes

hamilton/plugins/h_spark.py Outdated Show resolved Hide resolved

hamilton/plugins/h_spark.py Show resolved Hide resolved

plugin_tests/h_spark/test_h_spark.py Show resolved Hide resolved

hamilton/plugins/h_spark.py Show resolved Hide resolved

ellipsis-dev bot reviewed Mar 4, 2024

View reviewed changes

skrawcz and others added 3 commits March 4, 2024 12:18

Adds h_spark.select

c029edb

Adds select decorator convenience function -- that way it's clearer if that mode is wanted. Adds some more comments to explain some options better.

skrawcz force-pushed the with_columns_pdf_spark_example branch from 61582b6 to 6a572fc Compare March 4, 2024 20:19

ellipsis-dev bot reviewed Mar 4, 2024

View reviewed changes

Adds ellipsis.yaml configuration

9341d31

To try to steer it to be more useful.

ellipsis-dev bot reviewed Mar 4, 2024

View reviewed changes

elijahbenizzy self-requested a review March 4, 2024 21:57

elijahbenizzy approved these changes Mar 4, 2024

View reviewed changes

skrawcz merged commit 02120ae into main Mar 4, 2024
23 checks passed

skrawcz deleted the with_columns_pdf_spark_example branch March 4, 2024 22:34

ellipsis-dev bot reviewed Mar 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes bugs in pyspark @with_columns #730

Fixes bugs in pyspark @with_columns #730

skrawcz commented Mar 2, 2024 •

edited by ellipsis-dev bot

Loading

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

elijahbenizzy left a comment

ellipsis-dev bot commented Mar 4, 2024

ellipsis-dev bot left a comment

skrawcz commented Mar 4, 2024

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot commented Mar 4, 2024

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

elijahbenizzy left a comment

ellipsis-dev bot left a comment

Fixes bugs in pyspark @with_columns #730

Fixes bugs in pyspark @with_columns #730

Conversation

skrawcz commented Mar 2, 2024 • edited by ellipsis-dev bot Loading

Changes

How I tested this

Notes

Checklist

Summary:

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

elijahbenizzy left a comment

Choose a reason for hiding this comment

ellipsis-dev bot commented Mar 4, 2024

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

skrawcz commented Mar 4, 2024

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot commented Mar 4, 2024

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

elijahbenizzy left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

skrawcz commented Mar 2, 2024 •

edited by ellipsis-dev bot

Loading