Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add allow_nan_equality option to assert_approx_df_equality #29

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

mitches-got-glitches
Copy link
Contributor

This pull request solves the issue by making a fairly big change to the API. Now, rather than having two assertion functions for both DataFrame and column comparison, there is just one for each. Both these assertion functions have an optional parameter for precision, meaning the assert_approx_df_equality and assert_approx_column_equality could be removed.

In addition to solving the ticket, I feel like I made a number of improvements to the code base:

  • Added some useful spacing
  • Improved the type hinting
  • Documented the functions with docstrings using the numpy docstring convention
  • Added CHANGELOG.md to be updated with each release

Actions for @MrPowers :

  • Review and confirm happy with API changes
  • Review renaming of exception to RowsNotEqualError and consider any implications
  • Update the screenshots in the README.md to reflect the new changes

When making a new release:

  • Bump to a new minor release, and use any other means to notify users of API change
  • Replace "Unreleased" in the CHANGELOG.md with the new version number and the date released.
  • Add the Changelog notes to the Release Notes on GitHub once all tagged.

Let me know what you think!

Closes #28

The objective of the issue was to make allow_nan_equality available to
assert_approx_df_equality, but on reviewing the code it seemed like a
more efficient plan to include the precision option in
assert_df_equality and make it optional. This results in only one set
of functions needed to be maintained.

I've moved the single assert_rows_equality function into row_comparer
along with the custom exception which I've renamed. I've made one
generic equality function which takes options. All the tests have been
updated although there could be scope for a few more new tests - I
wanted to consult on that first.

This is a fairly large change to the API for which a major version
update would be recommended.
Changed the section on assert approx df equality.
I've removed assert_approx_column_equality from the API.
@mitches-got-glitches mitches-got-glitches marked this pull request as ready for review April 27, 2021 08:50
@mitches-got-glitches mitches-got-glitches marked this pull request as draft April 27, 2021 08:50
Copy link
Owner

@MrPowers MrPowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some part where you call dtypes or schema to figure out which columns are floating / decimal point (and need approximate comparison) and which columns need exact equality comparison?

This refactoring looks promising. I need to understand more about if it causes any Python version limitations, backwards incompatibility issues, or performance drains compared to what we currently have. It's certainly more elegant, but this lib is already in production for a variety of users and want to make sure it's not going to be a breaking change for anyone. Thanks!

chispa/number_helpers.py Show resolved Hide resolved

"""
both_floats = (isinstance(x, float) & isinstance(y, float))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do this cause we know the types from the schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MrPowers , can you have a look at my draft pull request here to see if I'm going along the right lines with your 3rd objective?

To summarise, rather than the following line in check_equal:

both_floats = (isinstance(x, float) & isinstance(y, float))

I'm doing this:

is_float_type = (dtype_name in ['float', 'double', 'decimal'])

Where dtype gets passed into the function, and is created with the following line:

dtypes = [field.dataType.typeName() for field in df1.schema]

Because we've already compared the schemas, we know that they're equal so we just use df1.schema.

Will using isinstance really have a noticeable effect on the speed of the tests? I feel like it might be the safer option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did some further testing too, and my current solution would not work for pyspark columns with DecimalType, as the python type is decimal.Decimal not float. I'd need to change that if the original implementation is selected.

Copy link
Contributor Author

@mitches-got-glitches mitches-got-glitches May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time to give me a steer this week @MrPowers ? As in point me in the right direction for the solution (sorry realised giving a steer might be a localised phrase) - where are you from anyway?

@MrPowers
Copy link
Owner

@mitches-got-glitches - thanks for following up and sorry for the delay.

Here are my main objectives with this PR:

  • Get your new functionality included in the lib cause it's clearly an improvement
  • No backwards incompatible changes (this lib follows semantic versioning, so we could technically make backwards incompatible changes cause we haven't made a major release yet, but I'd rather just avoid breaking changes, like Spark does)
  • Keep these functions as fast as possible (cause Spark tests can get really slow). Don't want to unnecessarily do type checking and slow functions down.
  • Support all the Python versions that we should reasonably be supporting. @nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.

Looking forward to driving this towards completion & appreciate your persistence! BTW - I'm a Python noob, so you're teaching me a lot!

@nchammas
Copy link

@nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.

Python 3.6+ is what's currently supported. Spark 3.1 dropped support for Python 2.7, 3.4, and 3.5.

@mitches-got-glitches
Copy link
Contributor Author

@nchammas - do you know the Python versions that a modern Spark app should be supporting? I think supporting > 2.7 is too lenient cause Python 2 is EOL.

Python 3.6+ is what's currently supported. Spark 3.1 dropped support for Python 2.7, 3.4, and 3.5.

Since the Python typing module is available in >=3.5 does this resolve your earlier comment @MrPowers ?

@MrPowers
Copy link
Owner

Looks like Spark 2.4.5 technically supports Python 3.4+ and I'd still like to support the Spark 2 users for a bit longer. That said, seems reasonable to set the Python version as >=3.5 for this project. Python 3.5 was released in September 2015 and think most users will at least be using Python 3.5, even the ones that are still using Spark 2.

Sidenote, when chispa eventually drops support for PySpark 2, we should do a major release (i.e. bump chispa to 1.x).

I will bump the Python version in the project now and add a section to explain this to users.

@mitches-got-glitches
Copy link
Contributor Author

To add another data point, my team is currently using Python 3.6.8 and Spark 2.4.

@MrPowers
Copy link
Owner

@mitches-got-glitches - great, glad we're still supporting PySpark 2.4 😉

Updated the required Python version.

So yes, the typing module question is resolved. We're requiring Python 3.5+ now, so using typing should be just fine 😃

@MrPowers
Copy link
Owner

@mitches-got-glitches - BTW, see here for the chispa download statistics. If this lib didn't have any users yet, then I wouldn't really care about breaking API changes. Since there are a lot of users, that's why I'm also focusing on backwards compatibility. A blessing and a curse I suppose 😉

@mitches-got-glitches
Copy link
Contributor Author

mitches-got-glitches commented May 12, 2021

Yeah, those are some good numbers! I think what you've set out to do is exactly right with the change to the readme and intention to launch 1.x. Users can always open a ticket for backwards compatibility if it's really necessary, or even create their own fork. But I agree, try not to get on their bad side ha.

So hopefully, once the thread above is resolved, then I think all that will be required is an update to the screenshots in the readme if you're happy with everything else @MrPowers ?

@MrPowers
Copy link
Owner

@mitches-got-glitches - Thanks for following up and sorry I've been delayed responding. I've been working on my first Dask PR and it's been hard for me to get up-and-running on a completely new framework. I'm still a Python n00b.

My initial hesitation for merging this PR was that it wasn't 100% backwards compatible. Is that still the case? Or does the current PR introduce some backwards incompatible changes?

@mitches-got-glitches
Copy link
Contributor Author

@mitches-got-glitches - Thanks for following up and sorry I've been delayed responding. I've been working on my first Dask PR and it's been hard for me to get up-and-running on a completely new framework. I'm still a Python n00b.

My initial hesitation for merging this PR was that it wasn't 100% backwards compatible. Is that still the case? Or does the current PR introduce some backwards incompatible changes?

Hello, sorry for the slow reply. I think in terms of backwards compatibility there were two things:

  1. The type hinting that requires Python 3.5+ (this was resolved in the comments above)
  2. Backwards capability for the API (retaining assert_approx_df_equality and equivalent assert_approx_column_equality)

I believe this can be done by doing something like the following with a deprecation warning so users know to switch their code base over to assert_df_equality in future versions:

def assert_approx_df_equality(*args, **kwargs):
    assert_df_equality(*args, **kwargs)

Does that look like a good idea?

There are two more outstanding things I need your help on before this can come out of draft status:

  1. Firstly, an answer to the replies to your comment above.
  2. Updating the screenshots in the README.md to reflect the new changes

Are you able to support me with those @MrPowers ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add allow_nan_equality option to assert_approx_df_equality
3 participants