Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes bugs in pyspark @with_columns #730

Merged
merged 5 commits into from
Mar 4, 2024
Merged

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Mar 2, 2024

  1. Configuration linkage was not exposed. So adding a parameter for people to specify what configuration is required.
  2. There was a bug in the UDF logic, because all(empty) evaluates to True. So fixed that.
  3. The select in append mode appended all possible columns. It didn't only append what was in the select. This changes that behavior by dropping "outputs" from the subdag that aren't in the select. We do it this way because we don't know what's in the original dataframe.

Changes

  • h_spark
  • llm workflow example using the PDF summarizer

How I tested this

  • runs locally

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Ellipsis 🚀 This PR description was created by Ellipsis for commit 02120ae.

Summary:

This PR enhances the pyspark @with_columns decorator, fixes a bug in the UDF logic, modifies the 'select' behavior in append mode, updates the example workflows, and adds new files in the pdf_summarizer/run_on_spark directory.

Key points:

  • Enhanced the pyspark @with_columns decorator by adding a parameter for required configurations.
  • Fixed a bug in the UDF logic where all(empty) was incorrectly evaluating to True.
  • Modified the 'select' behavior in append mode to only append the specified columns.
  • Updated the example workflows to demonstrate the new features and fixes.
  • Added new files run_with_columns.py and spark_pdf_pipeline.py in the pdf_summarizer/run_on_spark directory.

Generated with ❤️ by ellipsis.dev

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Reviewed the entire pull request up to 5fdc21c
  • Looked at 319 lines of code in 4 files
  • Took 1 minute and 10 seconds to review
More info
  • Skipped 1 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. hamilton/plugins/h_spark.py:396:
  • Assessed confidence : 50%
  • Comment:
    The changes in this file seem to address the issues mentioned in the PR description. However, it would be beneficial to have tests to ensure these changes don't introduce new bugs.
  • Reasoning:
    The changes in the h_spark.py file seem to be addressing the issues mentioned in the PR description. The changes include adding a parameter to specify required configuration, fixing a bug in the UDF logic, and changing the behavior of the 'select' in append mode to only append the columns specified in the 'select'. The changes seem logical and there are no clear violations of best practices. However, without running the code and tests, it's hard to be certain that these changes don't introduce new bugs.

Workflow ID: wflow_R8HypYCML8A27Dd9


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

@skrawcz skrawcz force-pushed the with_columns_pdf_spark_example branch from 5fdc21c to ddd38df Compare March 2, 2024 06:37
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on ddd38df
  • Looked at 359 lines of code in 5 files
  • Took 2 minutes and 23 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 3 additional comments because they didn't meet confidence threshold of 50%.
1. plugin_tests/h_spark/test_h_spark.py:611:
  • Assessed confidence : 50%
  • Comment:
    Consider renaming the test case 'test_with_columns_generate_nodes_select_append_mode' to 'test_with_columns_generate_nodes_append_mode_with_select' for better clarity.
  • Reasoning:
    The test case 'test_with_columns_generate_nodes_select_append_mode' is checking if the function 'with_columns' correctly generates nodes when the mode is 'append' and a 'select' parameter is provided. The test case is passing and the implementation seems correct. However, the test case name could be more descriptive to indicate that it's testing the 'append' mode with a 'select' parameter.
2. plugin_tests/h_spark/test_h_spark.py:634:
  • Assessed confidence : 50%
  • Comment:
    Consider renaming the test case 'test_with_columns_generate_nodes_select_mode_select' to 'test_with_columns_generate_nodes_select_mode_with_select' for better clarity.
  • Reasoning:
    The test case 'test_with_columns_generate_nodes_select_mode_select' is checking if the function 'with_columns' correctly generates nodes when the mode is 'select' and a 'select' parameter is provided. The test case is passing and the implementation seems correct. However, the test case name could be more descriptive to indicate that it's testing the 'select' mode with a 'select' parameter.
3. hamilton/plugins/h_spark.py:396:
  • Assessed confidence : 80%
  • Grade: 0%
  • Comment:
    Consider adding a test case for the function 'create_drop_node' to ensure its correct functionality.
  • Reasoning:
    The function 'create_drop_node' is being used in 'generate_nodes' function in 'h_spark.py' file. However, there is no direct test case for this function. It would be good to have a separate test case for this function to ensure its correct functionality.

Workflow ID: wflow_BKH1gyTNBqwBmIaW


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

1. Configuration linkage was not exposed. So adding a parameter
for people to specify what configuration is required.
2. There was a bug in the UDF logic, because `all(empty)` evaluates
to True. So fixed that.
3. The `select` in append mode appended all possible columns. It
didn't only append what was in the `select`. This changes that behavior
by dropping "outputs" from the subdag that aren't in the select. We
do it this way because we don't know what's in the original dataframe.
4. Adds test to check for the new logic
@skrawcz skrawcz force-pushed the with_columns_pdf_spark_example branch from ddd38df to 81fefd8 Compare March 2, 2024 06:41
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested.

  • Performed an incremental review on 81fefd8
  • Looked at 376 lines of code in 6 files
  • Took 3 minutes and 39 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 3 additional comments because they didn't meet confidence threshold of 50%.
1. hamilton/plugins/h_spark.py:396:
  • Assessed confidence : 0%
  • Comment:
    The fix for the all(empty) bug is correct. The all function returns True for an empty iterable in Python, which is not the desired behavior in this case. The added condition len(pandas_annotation.values()) > 0 ensures that the iterable is not empty before calling all.
  • Reasoning:
    In the h_spark.py file, the all(empty) bug fix is correct. The all function returns True for an empty iterable in Python, which is not the desired behavior in this case. The added condition len(pandas_annotation.values()) > 0 ensures that the iterable is not empty before calling all.
2. examples/LLM_Workflows/pdf_summarizer/run_on_spark/README.md:1:
  • Assessed confidence : 0%
  • Comment:
    The changes in the README.md file provide clear instructions on how to run the PDF summarizer on Spark using two approaches. The instructions are detailed and include steps for installing dependencies, running the code, and setting the OPENAI_API_KEY in the environment. The addition of diagrams to illustrate the dataflow is also a good practice.
  • Reasoning:
    The changes in the README.md file provide clear instructions on how to run the PDF summarizer on Spark using two approaches. The instructions are detailed and include steps for installing dependencies, running the code, and setting the OPENAI_API_KEY in the environment. The addition of diagrams to illustrate the dataflow is also a good practice.
3. tests/resources/spark/a_dag.py:9:
  • Assessed confidence : 80%
  • Grade: 0%
  • Comment:
    The function c is supposed to return a float but the operation a * 3.5 + b could potentially return an integer if a and b are both 0. This could lead to unexpected behavior. Consider ensuring that the return type is always a float. For example:
return float(a * 3.5 + b)
  • Reasoning:
    The function c in a_dag.py is supposed to return a float but the operation a * 3.5 + b could potentially return an integer if a and b are both 0. This could lead to unexpected behavior.

Workflow ID: wflow_XYKlW9pLFqa2xB81


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. We'll respond in a few minutes. Learn more here.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- made a few suggestions.

hamilton/plugins/h_spark.py Outdated Show resolved Hide resolved
hamilton/plugins/h_spark.py Show resolved Hide resolved
plugin_tests/h_spark/test_h_spark.py Show resolved Hide resolved
hamilton/plugins/h_spark.py Show resolved Hide resolved
Copy link
Contributor

ellipsis-dev bot commented Mar 4, 2024

Skipped PR review on 59c28ca because no changed files had a supported extension. If you think this was in error, please contact us and we'll fix it right away.


Generated with ❤️ by ellipsis.dev

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on 2e06300
  • Looked at 77 lines of code in 2 files
  • Took 1 minute and 29 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. examples/LLM_Workflows/scraping_and_chunking/spark/spark_pipeline.py:27:
  • Assessed confidence : 50%
  • Comment:
    The PR description mentions that a parameter was added to specify required configurations in the pyspark @with_columns decorator. However, I don't see any changes related to this in the diff. Please confirm if the changes are in a different file or if they were accidentally omitted.
  • Reasoning:
    The PR description mentions that a parameter was added to specify required configurations in the pyspark @with_columns decorator. However, I don't see any changes related to this in the diff. It's possible that the changes are in a file that's not included in the diff. I'll need to check the entire codebase to confirm this.

Workflow ID: wflow_qGyTljvPQXrGawaZ


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

@skrawcz
Copy link
Collaborator Author

skrawcz commented Mar 4, 2024

TODO:

  • redo commits
  • make sure comments/docs what we want to show people

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on f47f0f6
  • Looked at 11 lines of code in 1 files
  • Took 1 minute and 57 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. /examples/LLM_Workflows/README.md:8:
  • Assessed confidence : 0%
  • Comment:
    The changes in the README file are clear and concise. They provide additional information about the examples and their alternative implementations.
  • Reasoning:
    The changes in the README file are clear and concise. They provide additional information about the examples and their alternative implementations. No issues found here.

Workflow ID: wflow_BQH86UZHConQymdH


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on 6ded5f1
  • Looked at 68 lines of code in 1 files
  • Took 2 minutes and 27 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. /examples/LLM_Workflows/scraping_and_chunking/spark/README.md:57:
  • Assessed confidence : 0%
  • Comment:
    The updates to the README file are clear and informative. Good job!
  • Reasoning:
    The README file has been updated to reflect the changes made in the PR. It provides a clear explanation of the changes, why they were made, and how to use the updated code. There are no issues with this file.

Workflow ID: wflow_iJIHQ6gZL7nFZCxx


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

Copy link
Contributor

ellipsis-dev bot commented Mar 4, 2024

Skipped PR review on 61582b6 because no changed files had a supported extension. If you think this was in error, please contact us and we'll fix it right away.


Generated with ❤️ by ellipsis.dev

skrawcz and others added 3 commits March 4, 2024 12:18
This is close to the original, but we need to specify
the set of spark safe functions. This is okay-ish I think.
It just also requires that we include the pipeline in the
larger DAG too... (+1 squashed commit)
Squashed commits:
[fe08646] Adds code to do document chunking with pyspark

with_columns makes some assumptions:
(a) that all functions in the with_columns subdag take in the dataframe
(b) that all intermediate functions in the subdag need to become columns in the dataframe
(c) it doesn't allow you to wire through inputs to the subdag

We'll need to improve upon some of these to make using
 spark a real drop in replacement.

The most annoying part right now is not being able to wire through
inputs to the subdag from other functions in the DAG. That's why
the html_chunker and text_chunker are outside Hamilton right now.
Adds select decorator convenience function -- that way it's clearer
if that mode is wanted.
Adds some more comments to explain some options better.
That way it’s easier to figure out and determine what’s going on.

Adds symlink in spark to llm example (+6 squashed commits)
Squashed commits:
[f47f0f6] Tells people to look into examples more for ray and pyspark.
[6ded5f1] Adds more documentation to the spark chunking example

So that it's clearer what and why.
[2e06300] Pre-commit fix
[59c28ca] Removes unneeded file
[d3b9123] address comments left by @skrawcz on #732 (Adds code to do document chunking with pyspark);
[be79aca] address comments left by @skrawcz on #732 (Adds code to do document chunking with pyspark);
@skrawcz skrawcz force-pushed the with_columns_pdf_spark_example branch from 61582b6 to 6a572fc Compare March 4, 2024 20:19
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on 6a572fc
  • Looked at 115 lines of code in 4 files
  • Took 1 minute and 45 seconds to review
More info
  • Skipped 2 files when reviewing.
  • Skipped posting 2 additional comments because they didn't meet confidence threshold of 50%.
1. /examples/LLM_Workflows/scraping_and_chunking/spark/doc_pipeline.py:25:
  • Assessed confidence : 50%
  • Comment:
    Consider including the original error message in the exception to help with debugging. For example:
raise Exception(f'Failed to get URL: {url}. Original error: {str(e)}')
  • Reasoning:
    The PR author has added exception handling for network requests, which is a good practice. However, the error message in the exception could be more informative. Instead of just saying 'Failed to get URL', it could include the original error message to help with debugging.
2. /examples/LLM_Workflows/scraping_and_chunking/spark/spark_pipeline.py:30:
  • Assessed confidence : 50%
  • Comment:
    Consider including the original error message in the exception to help with debugging. For example:
raise RuntimeError(f'Failed to fetch sitemap from {sitemap_url}. Original error: {str(e)}')
  • Reasoning:
    The PR author has added exception handling for network requests, which is a good practice. However, the error message in the exception could be more informative. Instead of just saying 'Failed to fetch sitemap', it could include the original error message to help with debugging.

Workflow ID: wflow_8Dtkx1sailRJdo0v


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

To try to steer it to be more useful.
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on 9341d31
  • Looked at 26 lines of code in 1 files
  • Took 1 minute and 56 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. ellipsis.yaml:1:
  • Assessed confidence : 50%
  • Comment:
    This new file 'ellipsis.yaml' is not mentioned in the PR description. Could you please clarify if it's intended to be part of this PR? If so, please update the PR description to include information about this file and its purpose.
  • Reasoning:
    The PR author has added a new file 'ellipsis.yaml'. This file seems to contain configuration for the Ellipsis tool, which is used for automated PR reviews. The rules defined in this file align with the principles provided for this review task. However, this file is not mentioned in the PR description, and it's unclear whether it's intended to be part of this PR or if it was added by mistake.

Workflow ID: wflow_XFF6KDS5tePXraMC


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

@elijahbenizzy elijahbenizzy self-requested a review March 4, 2024 21:57
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@skrawcz skrawcz merged commit 02120ae into main Mar 4, 2024
23 checks passed
@skrawcz skrawcz deleted the with_columns_pdf_spark_example branch March 4, 2024 22:34
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me!

  • Performed an incremental review on 02120ae
  • Looked at 26 lines of code in 1 files
  • Took 1 minute and 35 seconds to review
More info
  • Skipped 0 files when reviewing.
  • Skipped posting 1 additional comments because they didn't meet confidence threshold of 50%.
1. ellipsis.yaml:20:
  • Assessed confidence : 0%
  • Comment:
    The addition of the ellipsis.yaml file is a good practice as it provides a clear set of rules and guidelines for contributors to follow. This will help maintain the quality and consistency of the codebase. However, it's important to ensure that these rules are communicated and enforced effectively.
  • Reasoning:
    The addition of the ellipsis.yaml file is a good practice as it provides a clear set of rules and guidelines for contributors to follow. This will help maintain the quality and consistency of the codebase. However, it's important to ensure that these rules are communicated and enforced effectively.

Workflow ID: wflow_E2C7owIdPQ8Lv9hg


Not what you expected? You can customize the content of the reviews using rules. Learn more here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants