-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFramesNotEqualError when dataframes appear identical #37
Comments
This issue is probably related to the |
I am curious why this same assert passes in another environment. Also, why does increasing the precision from 0.001 to 1.0 pass? |
I see this error intermittently, for example in a notebook cell I can run it and pass, then re-run and fail. I switched from assert_approx_df_equality to assert_df_equality to take advantage of allow_nan_equality=True, ignore_row_order=True, ignore_column_order=True. I notice now the error message shows dataframes that are different than the two show commands I run before the assert. point_data = [
('POINT (2.5 1.5)', 1.0),
('POINT (2.55 2.25)', 2.0),
('POINT (4.75 2.5)', 3.0),
('POINT EMPTY', 4.0),
(None, 5.0)
]
point_df = (spark.createDataFrame(point_data, ["wkt", "point_id"])
.selectExpr("ST_FromText(wkt) SHAPE", "point_id")
.withMeta("POINT", 4326))
poly_data = [
('POLYGON ((0.5 0.5, 3.5 3.5, 1.75 2.75, 0.5 0.5))', 1.0),
('POLYGON ((1.5 3.5, 4.0 1.0, 3.0 3.0, 1.5 3.5))', 2.0),
('POLYGON ((5.25 0.5, 5.25 4.5, 5.26 4.5, 5.26 0.5, 5.25 0.5))', 3.0),
('POLYGON EMPTY', 4.0),
(None, 5.0)
]
poly_df = (spark.createDataFrame(poly_data, ["wkt", "poly_id"])
.selectExpr("ST_FromText(wkt) SHAPE", "poly_id")
.withMeta("POLYGON", 4326))
actual_df = (...create my dataframe...)
expected_data = [
('POINT (2.5 1.5)', 1.0, 1.0, 0.7071067811865476, 2.0, 2.0, False),
('POINT (2.55 2.25)', 2.0, 2.0, 0.14142135623730964, 2.65, 2.35, False),
('POINT (4.75 2.5)', 3.0, 3.0, 0.5, 5.25, 2.5, False),
('POINT EMPTY', 4.0, None, -999.0, float('nan'), float('nan'), False)
]
expected_df = (spark.createDataFrame(expected_data, ["wkt", "point_id", "poly_id", "distance", "X", "Y", "isOnRight"])).sort("point_id")
actual_df.show()
expected_df.show()
#TODO: why does this fail intermittently when dataframes appear equal?
assert_df_equality(actual_df, expected_df, ignore_nullable=True, allow_nan_equality=True, ignore_row_order=True, ignore_column_order=True) The result of the show commands
|
I discovered with those inputs, the function I test is non-deterministic so the results in the actual_df were changing each run. That explains why the output of the .show() is different than the exception message. After correcting the inputs I still see inexplicable DataFramesNotEqual. I have simplified the example to rule out NaN comparison issues:
|
I discovered another non-deterministic result of the function I'm testing. I'm going to close this issue at this point and thank chispa for helping me catch this behavior. |
I have two dataframes that appear identical but assert_approx_df_equality is throwing DataFramesNotEqual error. There may be an intermittent going on because this code passed on the development cluster but failed in the test pipeline. Also, changing the precision from 0.001 to 1.0 allows the test to pass, although I don't see any differences in the actual vs. expected output.
the output of the show commands:
The exception shows the last three rows are different though I can't spot the differences.
and the two schemas compared:
The text was updated successfully, but these errors were encountered: