Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFramesNotEqualError when dataframes appear identical #37

Closed
bweisberg opened this issue Dec 17, 2021 · 6 comments
Closed

DataFramesNotEqualError when dataframes appear identical #37

bweisberg opened this issue Dec 17, 2021 · 6 comments

Comments

@bweisberg
Copy link

I have two dataframes that appear identical but assert_approx_df_equality is throwing DataFramesNotEqual error. There may be an intermittent going on because this code passed on the development cluster but failed in the test pipeline. Also, changing the precision from 0.001 to 1.0 allows the test to pass, although I don't see any differences in the actual vs. expected output.

actual_df = ...create the dataframe with my component...

expected_data = [ 
        ('POINT (2.5 1.5)', 1.0, 1.0, 0.7071067811865476, 2.0, 2.0, False),
        ('POINT (2.55 2.25)', 2.0, 2.0, 0.14142135623730964, 2.65, 2.35, False),
        ('POINT (4.75 2.5)', 3.0, 3.0, 0.5, 5.25, 2.5, False),
        ('POINT EMPTY', 4.0, None, -999.0, float('nan'), float('nan'), False)
     ]
expected_df = (spark.createDataFrame(expected_data, ["wkt", "point_id", "poly_id", "distance", "X", "Y", "isOnRight"])).sort("point_id")

actual_df.show()
expected_df.show()

assert_approx_df_equality(actual_df, expected_df, 0.001, ignore_nullable=True)

the output of the show commands:

+-----------------+--------+-------+-------------------+----+----+---------+
|              wkt|point_id|poly_id|           distance|   X|   Y|isOnRight|
+-----------------+--------+-------+-------------------+----+----+---------+
|  POINT (2.5 1.5)|     1.0|    1.0| 0.7071067811865476| 2.0| 2.0|    false|
|POINT (2.55 2.25)|     2.0|    2.0|0.14142135623730964|2.65|2.35|    false|
| POINT (4.75 2.5)|     3.0|    3.0|                0.5|5.25| 2.5|    false|
|      POINT EMPTY|     4.0|   null|             -999.0| NaN| NaN|    false|
+-----------------+--------+-------+-------------------+----+----+---------+

+-----------------+--------+-------+-------------------+----+----+---------+
|              wkt|point_id|poly_id|           distance|   X|   Y|isOnRight|
+-----------------+--------+-------+-------------------+----+----+---------+
|  POINT (2.5 1.5)|     1.0|    1.0| 0.7071067811865476| 2.0| 2.0|    false|
|POINT (2.55 2.25)|     2.0|    2.0|0.14142135623730964|2.65|2.35|    false|
| POINT (4.75 2.5)|     3.0|    3.0|                0.5|5.25| 2.5|    false|
|      POINT EMPTY|     4.0|   null|             -999.0| NaN| NaN|    false|
+-----------------+--------+-------+-------------------+----+----+---------+

The exception shows the last three rows are different though I can't spot the differences.

DataFramesNotEqualError                   Traceback (most recent call last)
<command-340851985589312> in <module>
     50 expected_df.show()
     51 
---> 52 assert_approx_df_equality(actual_df, expected_df, 0.001, ignore_nullable=True)

/databricks/python/lib/python3.7/site-packages/chispa/dataframe_comparer.py in assert_approx_df_equality(df1, df2, precision, ignore_nullable)
     38 def assert_approx_df_equality(df1, df2, precision, ignore_nullable=False):
     39     assert_schema_equality(df1.schema, df2.schema, ignore_nullable)
---> 40     assert_generic_rows_equality(df1, df2, are_rows_approx_equal, [precision])
     41 
     42 

/databricks/python/lib/python3.7/site-packages/chispa/dataframe_comparer.py in assert_generic_rows_equality(df1, df2, row_equality_fun, row_equality_fun_args)
     62             t.add_row([r1, r2])
     63     if allRowsEqual == False:
---> 64         raise DataFramesNotEqualError("\n" + t.get_string())
     65 
     66 

DataFramesNotEqualError: 
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|                                                          df1                                                           |                                                          df2                                                           |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|   Row(wkt='POINT (2.5 1.5)', point_id=1.0, poly_id=2.0, distance=0.7071067811865476, X=3.0, Y=2.0, isOnRight=False)    |   Row(wkt='POINT (2.5 1.5)', point_id=1.0, poly_id=1.0, distance=0.7071067811865476, X=2.0, Y=2.0, isOnRight=False)    |
| Row(wkt='POINT (2.55 2.25)', point_id=2.0, poly_id=2.0, distance=0.14142135623730964, X=2.65, Y=2.35, isOnRight=False) | Row(wkt='POINT (2.55 2.25)', point_id=2.0, poly_id=2.0, distance=0.14142135623730964, X=2.65, Y=2.35, isOnRight=False) |
|          Row(wkt='POINT (4.75 2.5)', point_id=3.0, poly_id=3.0, distance=0.5, X=5.25, Y=2.5, isOnRight=False)          |          Row(wkt='POINT (4.75 2.5)', point_id=3.0, poly_id=3.0, distance=0.5, X=5.25, Y=2.5, isOnRight=False)          |
|           Row(wkt='POINT EMPTY', point_id=4.0, poly_id=None, distance=-999.0, X=nan, Y=nan, isOnRight=False)           |           Row(wkt='POINT EMPTY', point_id=4.0, poly_id=None, distance=-999.0, X=nan, Y=nan, isOnRight=False)           |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

and the two schemas compared:

root
 |-- wkt: string (nullable = true)
 |-- point_id: double (nullable = true)
 |-- poly_id: double (nullable = true)
 |-- distance: double (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- isOnRight: boolean (nullable = true)

root
 |-- wkt: string (nullable = true)
 |-- point_id: double (nullable = true)
 |-- poly_id: double (nullable = true)
 |-- distance: double (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- isOnRight: boolean (nullable = true)
@bweisberg
Copy link
Author

This issue is probably related to the NaN values. I see now there is a allow_nan_equality=True for assert_df_equality. Is there an allow_nan_equality option for assert_approx_df_equality?

@bweisberg
Copy link
Author

duplicate for #28 and #29

@bweisberg
Copy link
Author

I am curious why this same assert passes in another environment. Also, why does increasing the precision from 0.001 to 1.0 pass?

@bweisberg
Copy link
Author

I see this error intermittently, for example in a notebook cell I can run it and pass, then re-run and fail. I switched from assert_approx_df_equality to assert_df_equality to take advantage of allow_nan_equality=True, ignore_row_order=True, ignore_column_order=True. I notice now the error message shows dataframes that are different than the two show commands I run before the assert.

point_data = [ 
        ('POINT (2.5 1.5)', 1.0),
        ('POINT (2.55 2.25)', 2.0),
        ('POINT (4.75 2.5)', 3.0),
        ('POINT EMPTY', 4.0),
        (None, 5.0)
     ]

point_df = (spark.createDataFrame(point_data, ["wkt", "point_id"])
     .selectExpr("ST_FromText(wkt) SHAPE", "point_id")
        .withMeta("POINT", 4326))

poly_data = [ 
        ('POLYGON ((0.5 0.5, 3.5 3.5, 1.75 2.75, 0.5 0.5))', 1.0),
        ('POLYGON ((1.5 3.5, 4.0 1.0, 3.0 3.0, 1.5 3.5))', 2.0),
        ('POLYGON ((5.25 0.5, 5.25 4.5, 5.26 4.5, 5.26 0.5, 5.25 0.5))', 3.0),
        ('POLYGON EMPTY', 4.0),
        (None, 5.0)
     ]

poly_df = (spark.createDataFrame(poly_data, ["wkt", "poly_id"])
    .selectExpr("ST_FromText(wkt) SHAPE", "poly_id")
    .withMeta("POLYGON", 4326))

actual_df = (...create my dataframe...)

expected_data = [ 
        ('POINT (2.5 1.5)', 1.0, 1.0, 0.7071067811865476, 2.0, 2.0, False),
        ('POINT (2.55 2.25)', 2.0, 2.0, 0.14142135623730964, 2.65, 2.35, False),
        ('POINT (4.75 2.5)', 3.0, 3.0, 0.5, 5.25, 2.5, False),
        ('POINT EMPTY', 4.0, None, -999.0, float('nan'), float('nan'), False)
     ]
expected_df = (spark.createDataFrame(expected_data, ["wkt", "point_id", "poly_id", "distance", "X", "Y", "isOnRight"])).sort("point_id")


actual_df.show()
expected_df.show()
#TODO: why does this fail intermittently when dataframes appear equal?
assert_df_equality(actual_df, expected_df, ignore_nullable=True, allow_nan_equality=True, ignore_row_order=True, ignore_column_order=True)

The result of the show commands

+-----------------+--------+-------+-------------------+----+----+---------+
|              wkt|point_id|poly_id|           distance|   X|   Y|isOnRight|
+-----------------+--------+-------+-------------------+----+----+---------+
|  POINT (2.5 1.5)|     1.0|    1.0| 0.7071067811865476| 2.0| 2.0|    false|
|POINT (2.55 2.25)|     2.0|    2.0|0.14142135623730964|2.65|2.35|    false|
| POINT (4.75 2.5)|     3.0|    3.0|                0.5|5.25| 2.5|    false|
|      POINT EMPTY|     4.0|   null|             -999.0| NaN| NaN|    false|
+-----------------+--------+-------+-------------------+----+----+---------+

+-----------------+--------+-------+-------------------+----+----+---------+
|              wkt|point_id|poly_id|           distance|   X|   Y|isOnRight|
+-----------------+--------+-------+-------------------+----+----+---------+
|  POINT (2.5 1.5)|     1.0|    1.0| 0.7071067811865476| 2.0| 2.0|    false|
|POINT (2.55 2.25)|     2.0|    2.0|0.14142135623730964|2.65|2.35|    false|
| POINT (4.75 2.5)|     3.0|    3.0|                0.5|5.25| 2.5|    false|
|      POINT EMPTY|     4.0|   null|             -999.0| NaN| NaN|    false|
+-----------------+--------+-------+-------------------+----+----+---------+
DataFramesNotEqualError: 
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|                                                          df1                                                           |                                                          df2                                                           |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| Row(X=2.65, Y=2.35, distance=0.14142135623730964, isOnRight=False, point_id=2.0, poly_id=2.0, wkt='POINT (2.55 2.25)') |   Row(X=2.0, Y=2.0, distance=0.7071067811865476, isOnRight=False, point_id=1.0, poly_id=1.0, wkt='POINT (2.5 1.5)')    |
|   Row(X=3.0, Y=2.0, distance=0.7071067811865476, isOnRight=False, point_id=1.0, poly_id=2.0, wkt='POINT (2.5 1.5)')    | Row(X=2.65, Y=2.35, distance=0.14142135623730964, isOnRight=False, point_id=2.0, poly_id=2.0, wkt='POINT (2.55 2.25)') |
|          Row(X=5.25, Y=2.5, distance=0.5, isOnRight=False, point_id=3.0, poly_id=3.0, wkt='POINT (4.75 2.5)')          |          Row(X=5.25, Y=2.5, distance=0.5, isOnRight=False, point_id=3.0, poly_id=3.0, wkt='POINT (4.75 2.5)')          |
|           Row(X=nan, Y=nan, distance=-999.0, isOnRight=False, point_id=4.0, poly_id=None, wkt='POINT EMPTY')           |           Row(X=nan, Y=nan, distance=-999.0, isOnRight=False, point_id=4.0, poly_id=None, wkt='POINT EMPTY')           |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

@bweisberg bweisberg reopened this Dec 22, 2021
@bweisberg
Copy link
Author

I discovered with those inputs, the function I test is non-deterministic so the results in the actual_df were changing each run. That explains why the output of the .show() is different than the exception message. After correcting the inputs I still see inexplicable DataFramesNotEqual. I have simplified the example to rule out NaN comparison issues:


point_data = [ 
        ('POINT (2.5 1.75)', 1.0),
        ('POINT (2.55 2.25)', 2.0),
        ('POINT (4.75 2.5)', 3.0),
        (None, 5.0)
     ]

point_df = (spark.createDataFrame(point_data, ["wkt", "point_id"])
     .selectExpr("ST_FromText(wkt) SHAPE", "point_id")
        .withMeta("POINT", 4326))

poly_data = [ 
        ('POLYGON ((0.5 0.5, 3.5 3.5, 1.75 2.75, 0.5 0.5))', 1.0),
        ('POLYGON ((1.5 3.5, 4.0 1.0, 3.0 3.0, 1.5 3.5))', 2.0),
        ('POLYGON ((5.25 0.5, 5.25 4.5, 5.26 4.5, 5.26 0.5, 5.25 0.5))', 3.0),
        ('POLYGON EMPTY', 4.0),
        (None, 5.0)
     ]

poly_df = (spark.createDataFrame(poly_data, ["wkt", "poly_id"])
    .selectExpr("ST_FromText(wkt) SHAPE", "poly_id")
    .withMeta("POLYGON", 4326))

actual_df = (...call the function with inputs that produce deterministic results...)

expected_data = [ 
        ('POINT (2.5 1.75)', 1.0, 1.0, 0.5303300858899106, 2.125, 2.125, False),
        ('POINT (2.55 2.25)', 2.0, 2.0, 0.14142135623730964, 2.65, 2.35, False),
        ('POINT (4.75 2.5)', 3.0, 3.0, 0.5, 5.25, 2.5, False)
     ]
expected_df = (spark.createDataFrame(expected_data, ["wkt", "point_id", "poly_id", "distance", "X", "Y", "isOnRight"])).sort("point_id")

assert_df_equality(actual_df, expected_df, ignore_nullable=True, allow_nan_equality=False, ignore_row_order=True, ignore_column_order=True)
/databricks/python/lib/python3.7/site-packages/chispa/dataframe_comparer.py in assert_df_equality(df1, df2, ignore_nullable, transforms, allow_nan_equality, ignore_column_order, ignore_row_order)
     25         assert_generic_rows_equality(df1, df2, are_rows_equal_enhanced, [True])
     26     else:
---> 27         assert_basic_rows_equality(df1, df2)
     28 
     29 

/databricks/python/lib/python3.7/site-packages/chispa/dataframe_comparer.py in assert_basic_rows_equality(df1, df2)
     76             else:
     77                 t.add_row([r1, r2])
---> 78         raise DataFramesNotEqualError("\n" + t.get_string())

DataFramesNotEqualError: 
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|                                                          df1                                                           |                                                          df2                                                           |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| Row(X=2.65, Y=2.35, distance=0.14142135623730964, isOnRight=False, point_id=2.0, poly_id=2.0, wkt='POINT (2.55 2.25)') | Row(X=2.125, Y=2.125, distance=0.5303300858899106, isOnRight=False, point_id=1.0, poly_id=1.0, wkt='POINT (2.5 1.75)') |
| Row(X=2.875, Y=2.125, distance=0.5303300858899106, isOnRight=False, point_id=1.0, poly_id=2.0, wkt='POINT (2.5 1.75)') | Row(X=2.65, Y=2.35, distance=0.14142135623730964, isOnRight=False, point_id=2.0, poly_id=2.0, wkt='POINT (2.55 2.25)') |
|          Row(X=5.25, Y=2.5, distance=0.5, isOnRight=False, point_id=3.0, poly_id=3.0, wkt='POINT (4.75 2.5)')          |          Row(X=5.25, Y=2.5, distance=0.5, isOnRight=False, point_id=3.0, poly_id=3.0, wkt='POINT (4.75 2.5)')          |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

@bweisberg
Copy link
Author

I discovered another non-deterministic result of the function I'm testing. I'm going to close this issue at this point and thank chispa for helping me catch this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant