Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Decimal support for Postgres backend #10216

Merged
merged 47 commits into from
Jul 2, 2024

Conversation

GregoryTravis
Copy link
Contributor

Pull Request Description

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    TypeScript,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • Unit tests have been written where possible.

@@ -29,7 +29,7 @@ make_aggregate_column : DB_Table -> Aggregate_Column -> Text -> Dialect -> (Any
make_aggregate_column table aggregate as dialect infer_return_type problem_builder =
is_non_empty_selector v = v.is_nothing.not && v.not_empty
simple_aggregate op_kind columns =
expression = SQL_Expression.Operation op_kind (columns.map c->c.expression)
expression = dialect.cast_op_type op_kind columns (SQL_Expression.Operation op_kind (columns.map c->c.expression))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding the cast here, at the SQL level, instead of using DB_Column.cast. This seems like the right thing to me, because we really just with that Postgres understood that a sum of integers is also an integer, and since it doesn't, we're helping it out here.

And by doing it at this point, the regular type inference code will take care of the rest.

Copy link
Member

@radeusgd radeusgd Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the right place to call into this hook.

I'm wondering if this is the right thing to do with the type change though.
While I understand we want to work with long and not BigInteger in the in-memory backend, in a remote Postgres Database I'm not sure if the big-integer arithmetic will be the bottleneck. It may be in some cases I guess...

The problem is - what if the result overflows? If the type of the Sum is the same as the type of the input, it is very easy to overflow it - just take two elements close to the maximum value representable in the input type and sum them.

I think that is exactly the reason why the Postgres DB has these rules for 'promoting' the aggregate type to a larger type - to be able to hold a sum. I imagine it still has its limitations - a sum of int4 column is promoted to int8 so if we sum 2^32 values of max int4 we could get it to overflow - but I assume it is rather rare to sum that many values, so it's probably 'good enough'.

However as noted above - if we do not enlarge the type, we can get an overflow very easily with just 2 rows.

Probably the problem is that our columns are mostly int8 and so the only 'larger' type that we can promote to is the numeric (big-integer) type.

Maybe we should use the int4 type more if our data fits only 32-bits? Then the aggregate would become int8, still being a 'fast' number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we still want to have this cast, we should at least ensure that we will not corrupt the data. We must add a test case where we have a column containing e.g. 2 rows with max int8 value, and add them together. Let's see what happens :)

I imagine if we get some kind of 'overflow' error - that will be OK. Ideally we should intercept this error and give the user a clear indication what happened and suggest that they can cast to a larger type (e.g. numeric) to get the result (paying the performance price).

I think if we'd get a value that is modulo-truncated - that would be unacceptable as it would be a silent data corruption. But looking at a simple example:

radeusgd=> SELECT pg_typeof(9223372036854775808);
 pg_typeof
-----------
 numeric
(1 row)

radeusgd=> SELECT CAST(9223372036854775808 AS int4);
ERROR:  integer out of range

I imagine we will likely just get an error, so it would just be good to intercept it and handle in a nice way for the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, I have wrote this comment before reading Postgres_Dialect.enso and I think I misunderstood the idea here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so do I understand correctly that the cast is needed to set the scale of the NUMERIC type to 0 so that it is interpreted as BigInteger and not BigDecimal?

I thought that we are casting to int8 but apparently we are not, sorry for my confusing comments above...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, I guess it may be good to add such a test (Summing 2 rows that contain MAX INT64 values). It will be a good test verifying that our implementation is correct and does not cause data corruption in all kinds of DB backends (as I imagine how well the backends handle this may vary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this test.

@@ -130,6 +130,12 @@ type Redshift_Dialect
_ = [approximate_result_type, infer_result_type_from_database_callback]
column

## PRIVATE
Add an extra cast to adjust the output type of certain operations with certain arguments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very useful, I imagine I may want to use a hook like this in the Snowflake dialect as well to handle some weird edge cases.

I'd appreciate if we can make this comment a bit more detailed to understand how it is supposed to be used.

I'd like to know:

  • when is this method called: is it when returning a result from any operation? or just some operations?
  • I imagine it's good to say that 'In most cases this method will just return the expression unchanged, it is used only to override the type in cases where the default one that the database uses is not what we want.'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

## PRIVATE
Add an extra cast to adjust the output type of certain operations with certain arguments.
cast_op_type self (op_kind:Text) (args:(Vector Internal_Column)) (expression:SQL_Expression) =
is_bigint ic = ic.sql_type_reference.get.typeid == Types.BIGINT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I prefer to call it int8 even though BIGINT is indeed Postgres's 'primary' name for this. It is just easy to confuse it with Java's BigInteger (Postgres's NUMERIC).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@radeusgd radeusgd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of cast_op_type hook. I'd just appreciate some more info about it - e.g. when it is called - as currently it only happens on aggregates, so if I want to use it in any other place, I'd have to remember add the call to the hook - which is totally OK (no need to call the hook in yet unused places) - but let's make that clear.

I was a bit confused about how the type-change logic works, but if I understand correctly it seems all good.

@GregoryTravis
Copy link
Contributor Author

Not ready to merge this yet -- I want to add many more tests checking for accidental Decimals.

@GregoryTravis GregoryTravis changed the title Read Decimal column from Postgres into in-memory table Implement Decimal type for Postgres backend Jun 28, 2024
@GregoryTravis GregoryTravis changed the title Implement Decimal type for Postgres backend Implement Decimal support for Postgres backend Jun 28, 2024
@GregoryTravis GregoryTravis marked this pull request as ready for review June 28, 2024 22:33
@GregoryTravis
Copy link
Contributor Author

Ready to merge; returns types verified.

@GregoryTravis GregoryTravis merged commit 48fb999 into develop Jul 2, 2024
36 checks passed
@GregoryTravis GregoryTravis deleted the wip/gmt/10213-read-pg branch July 2, 2024 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants