Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Decimal support for Postgres backend #10216

Merged
merged 47 commits into from
Jul 2, 2024
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
3ad23f1
treat scale nothing as unspecifed
GregoryTravis Jun 4, 2024
48464af
cast to decimal
GregoryTravis Jun 5, 2024
6b694c7
float int biginteger
GregoryTravis Jun 5, 2024
7951326
conversion failure ints
GregoryTravis Jun 5, 2024
1e3da12
Merge branch 'develop' into wip/gmt/10162-d-col-conversions
GregoryTravis Jun 6, 2024
aba8d48
loss of decimal precision
GregoryTravis Jun 6, 2024
145f9c0
precision loss for mixed column to float
GregoryTravis Jun 6, 2024
8ce0dfe
mixed columns
GregoryTravis Jun 6, 2024
e41bf5b
loss of precision on inexact float conversion
GregoryTravis Jun 6, 2024
1200cce
cleanup, reuse
GregoryTravis Jun 6, 2024
8d94d5c
changelog
GregoryTravis Jun 6, 2024
88e6714
Merge branch 'develop' into wip/gmt/10162-d-col-conversions
GregoryTravis Jun 7, 2024
631820c
review
GregoryTravis Jun 7, 2024
d4ded95
no fits bd
GregoryTravis Jun 7, 2024
24b265d
no warning on 0.1 conversion
GregoryTravis Jun 7, 2024
3ac81b7
fmt
GregoryTravis Jun 7, 2024
a4b364b
big_decimal_fetcher
GregoryTravis Jun 7, 2024
2553533
Merge branch 'develop' into wip/gmt/10213-read-pg
GregoryTravis Jun 7, 2024
c20d830
default fetcher and statement setting
GregoryTravis Jun 7, 2024
0e6bd39
Merge branch 'develop' into wip/gmt/10213-read-pg
GregoryTravis Jun 17, 2024
5d7f770
round-trip d
GregoryTravis Jun 17, 2024
1f97af7
fix warning
GregoryTravis Jun 17, 2024
49e50b3
Merge branch 'develop' into wip/gmt/10213-read-pg
GregoryTravis Jun 18, 2024
c11e3f6
expr +10
GregoryTravis Jun 18, 2024
b62dcea
double builder retype to bigdecimal
GregoryTravis Jun 19, 2024
32fe09b
Use BD fetcher for underspecified postgres numeric column, not inferr…
GregoryTravis Jun 20, 2024
643c231
Merge branch 'develop' into wip/gmt/10213-read-pg
GregoryTravis Jun 21, 2024
2dd3927
fix tests
GregoryTravis Jun 21, 2024
f66889e
fix test
GregoryTravis Jun 21, 2024
7e748e1
Merge branch 'develop' into wip/gmt/10213-read-pg
GregoryTravis Jun 24, 2024
60c494f
cast_op_type
GregoryTravis Jun 25, 2024
0eeba74
Merge branch 'develop' into wip/gmt/10213-read-pg
GregoryTravis Jun 26, 2024
b9afc85
no-ops for other dialects
GregoryTravis Jun 26, 2024
111fad1
Types
GregoryTravis Jun 26, 2024
4ae2902
sum + avg
GregoryTravis Jun 26, 2024
4c4bc5b
avg + sum test
GregoryTravis Jun 26, 2024
b4f7115
fix test
GregoryTravis Jun 26, 2024
4206905
update agg type inference test
GregoryTravis Jun 26, 2024
358f89f
Merge branch 'develop' into wip/gmt/10213-read-pg
GregoryTravis Jun 27, 2024
4815631
wip
GregoryTravis Jun 27, 2024
2c87d7a
is_int8, stddev
GregoryTravis Jun 27, 2024
bea018a
more doc, overflow check
GregoryTravis Jun 27, 2024
40f9b8d
fmt
GregoryTravis Jun 27, 2024
357935b
Merge branch 'develop' into wip/gmt/10213-read-pg
GregoryTravis Jun 28, 2024
8652753
finish round-trip test
GregoryTravis Jun 28, 2024
6e7524f
wip
GregoryTravis Jun 28, 2024
47dc410
Merge branch 'develop' into wip/gmt/10213-read-pg
GregoryTravis Jul 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,12 @@ type Redshift_Dialect
_ = [approximate_result_type, infer_result_type_from_database_callback]
column

## PRIVATE
Add an extra cast to adjust the output type of certain operations with certain arguments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very useful, I imagine I may want to use a hook like this in the Snowflake dialect as well to handle some weird edge cases.

I'd appreciate if we can make this comment a bit more detailed to understand how it is supposed to be used.

I'd like to know:

  • when is this method called: is it when returning a result from any operation? or just some operations?
  • I imagine it's good to say that 'In most cases this method will just return the expression unchanged, it is used only to override the type in cases where the default one that the database uses is not what we want.'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

cast_op_type self (op_kind:Text) (args:(Vector Internal_Column)) (expression:SQL_Expression) =
_ = [op_kind, args]
expression

## PRIVATE
prepare_fetch_types_query : SQL_Expression -> Context -> SQL_Statement
prepare_fetch_types_query self expression context =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ make_aggregate_column : DB_Table -> Aggregate_Column -> Text -> Dialect -> (Any
make_aggregate_column table aggregate as dialect infer_return_type problem_builder =
is_non_empty_selector v = v.is_nothing.not && v.not_empty
simple_aggregate op_kind columns =
expression = SQL_Expression.Operation op_kind (columns.map c->c.expression)
expression = dialect.cast_op_type op_kind columns (SQL_Expression.Operation op_kind (columns.map c->c.expression))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding the cast here, at the SQL level, instead of using DB_Column.cast. This seems like the right thing to me, because we really just with that Postgres understood that a sum of integers is also an integer, and since it doesn't, we're helping it out here.

And by doing it at this point, the regular type inference code will take care of the rest.

Copy link
Member

@radeusgd radeusgd Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the right place to call into this hook.

I'm wondering if this is the right thing to do with the type change though.
While I understand we want to work with long and not BigInteger in the in-memory backend, in a remote Postgres Database I'm not sure if the big-integer arithmetic will be the bottleneck. It may be in some cases I guess...

The problem is - what if the result overflows? If the type of the Sum is the same as the type of the input, it is very easy to overflow it - just take two elements close to the maximum value representable in the input type and sum them.

I think that is exactly the reason why the Postgres DB has these rules for 'promoting' the aggregate type to a larger type - to be able to hold a sum. I imagine it still has its limitations - a sum of int4 column is promoted to int8 so if we sum 2^32 values of max int4 we could get it to overflow - but I assume it is rather rare to sum that many values, so it's probably 'good enough'.

However as noted above - if we do not enlarge the type, we can get an overflow very easily with just 2 rows.

Probably the problem is that our columns are mostly int8 and so the only 'larger' type that we can promote to is the numeric (big-integer) type.

Maybe we should use the int4 type more if our data fits only 32-bits? Then the aggregate would become int8, still being a 'fast' number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we still want to have this cast, we should at least ensure that we will not corrupt the data. We must add a test case where we have a column containing e.g. 2 rows with max int8 value, and add them together. Let's see what happens :)

I imagine if we get some kind of 'overflow' error - that will be OK. Ideally we should intercept this error and give the user a clear indication what happened and suggest that they can cast to a larger type (e.g. numeric) to get the result (paying the performance price).

I think if we'd get a value that is modulo-truncated - that would be unacceptable as it would be a silent data corruption. But looking at a simple example:

radeusgd=> SELECT pg_typeof(9223372036854775808);
 pg_typeof
-----------
 numeric
(1 row)

radeusgd=> SELECT CAST(9223372036854775808 AS int4);
ERROR:  integer out of range

I imagine we will likely just get an error, so it would just be good to intercept it and handle in a nice way for the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, I have wrote this comment before reading Postgres_Dialect.enso and I think I misunderstood the idea here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so do I understand correctly that the cast is needed to set the scale of the NUMERIC type to 0 so that it is interpreted as BigInteger and not BigDecimal?

I thought that we are casting to int8 but apparently we are not, sorry for my confusing comments above...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, I guess it may be good to add such a test (Summing 2 rows that contain MAX INT64 values). It will be a good test verifying that our implementation is correct and does not cause data corruption in all kinds of DB backends (as I imagine how well the backends handle this may vary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this test.

sql_type_ref = infer_return_type op_kind columns expression
Internal_Column.Value as sql_type_ref expression

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,19 @@ big_integer_fetcher =
make_builder_from_java_object_builder java_builder
Column_Fetcher.Value fetch_value make_builder

## PRIVATE
big_decimal_fetcher : Column_Fetcher
big_decimal_fetcher =
fetch_value rs i =
big_decimal = rs.getBigDecimal i
if rs.wasNull then Nothing else
big_decimal
make_builder initial_size java_problem_aggregator =
_ = java_problem_aggregator
java_builder = Java_Exports.make_bigdecimal_builder initial_size
make_builder_from_java_object_builder java_builder
Column_Fetcher.Value fetch_value make_builder

## PRIVATE
text_fetcher : Value_Type -> Column_Fetcher
text_fetcher value_type =
Expand Down Expand Up @@ -178,14 +191,14 @@ default_fetcher_for_value_type value_type =
# We currently don't distinguish timestamps without a timezone on the Enso value side.
Value_Type.Date_Time has_timezone ->
if has_timezone then date_time_fetcher else local_date_time_fetcher
# If we can determine that scale = 0
## If we can determine that scale <= 0, we use BigIntegerBuilder.
Otherwise, we use InferredBuilder, since it's possible some values
will be BigDecimal.
Value_Type.Decimal _ scale ->
is_guaranteed_integer = scale.is_nothing.not && scale <= 0
case is_guaranteed_integer of
True -> big_integer_fetcher
# If we cannot guarantee that the column is integer, we will fall back to Float values, since there is no BigDecimal implementation yet.
# In another place this will trigger a Inexact_Type_Coercion warning.
False -> double_fetcher
False -> big_decimal_fetcher
_ -> fallback_fetcher

## PRIVATE
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ import project.SQL_Type.SQL_Type
from project.Errors import SQL_Error, Unsupported_Database_Operation
from project.Internal.IR.Operation_Metadata import Date_Period_Metadata

polyglot java import java.sql.Types

## PRIVATE

The dialect of PostgreSQL databases.
Expand Down Expand Up @@ -195,6 +197,24 @@ type Postgres_Dialect
Illegal_State.Error "The type computed by our logic is Char, but the Database computed a non-text type ("+db_type.to_display_text+"). This should never happen and should be reported as a bug in the Database library."
False -> column

## PRIVATE
Add an extra cast to adjust the output type of certain operations with certain arguments.
cast_op_type self (op_kind:Text) (args:(Vector Internal_Column)) (expression:SQL_Expression) =
is_bigint ic = ic.sql_type_reference.get.typeid == Types.BIGINT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I prefer to call it int8 even though BIGINT is indeed Postgres's 'primary' name for this. It is just easy to confuse it with Java's BigInteger (Postgres's NUMERIC).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

is_int ic =
typeid = ic.sql_type_reference.get.typeid
typeid == Types.SMALLINT || typeid == Types.INTEGER || typeid == Types.BIGINT

cast_to = case op_kind of
"SUM" ->
if is_bigint (args.at 0) then "numeric(1000,0)" else Nothing
"AVG" ->
if is_int (args.at 0) then "float8" else Nothing
_ -> Nothing

if cast_to.is_nothing then expression else
SQL_Expression.Operation "CAST" [expression, SQL_Expression.Literal cast_to]

## PRIVATE
prepare_fetch_types_query : SQL_Expression -> Context -> SQL_Statement
prepare_fetch_types_query self expression context =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,12 @@ type SQLite_Dialect
new_sql_type_reference = SQL_Type_Reference.from_constant sql_type
Internal_Column.Value column.name new_sql_type_reference new_expression

## PRIVATE
Add an extra cast to adjust the output type of certain operations with certain arguments.
cast_op_type self (op_kind:Text) (args:(Vector Internal_Column)) (expression:SQL_Expression) =
_ = [op_kind, args]
expression

## PRIVATE
prepare_fetch_types_query : SQL_Expression -> Context -> SQL_Statement
prepare_fetch_types_query self expression context =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ fill_hole_default stmt i type_hint value = case value of
big_decimal = NumericConverter.bigIntegerAsBigDecimal value
stmt.setBigDecimal i big_decimal
False -> stmt.setLong i value
_ : Float -> stmt.setDouble i value
_ : Decimal -> stmt.setBigDecimal i value.big_decimal
_ : Float -> stmt.setDouble i value
_ : Text -> stmt.setString i value
_ : Date_Time ->
has_timezone = case type_hint of
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,12 @@ type Snowflake_Dialect
_ = [approximate_result_type, infer_result_type_from_database_callback]
column

## PRIVATE
Add an extra cast to adjust the output type of certain operations with certain arguments.
cast_op_type self (op_kind:Text) (args:(Vector Internal_Column)) (expression:SQL_Expression) =
_ = [op_kind, args]
expression

## PRIVATE
prepare_fetch_types_query : SQL_Expression -> Context -> SQL_Statement
prepare_fetch_types_query self expression context =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import project.Internal.Storage
import project.Value_Type.Bits
import project.Value_Type.Value_Type

polyglot java import org.enso.table.data.column.builder.BigDecimalBuilder
polyglot java import org.enso.table.data.column.builder.BigIntegerBuilder
polyglot java import org.enso.table.data.column.builder.BoolBuilder
polyglot java import org.enso.table.data.column.builder.DateBuilder
Expand Down Expand Up @@ -36,6 +37,11 @@ make_biginteger_builder : Integer -> ProblemAggregator -> BigIntegerBuilder
make_biginteger_builder initial_size java_problem_aggregator=(Missing_Argument.ensure_present "java_problem_aggregator") =
BigIntegerBuilder.new initial_size java_problem_aggregator

## PRIVATE
make_bigdecimal_builder : Integer -> BigDecimalBuilder
make_bigdecimal_builder initial_size =
BigDecimalBuilder.new initial_size

## PRIVATE
make_string_builder : Integer -> Value_Type -> StringBuilder
make_string_builder initial_size value_type=Value_Type.Char =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,6 @@ public void appendRawNoGrow(BigDecimal value) {
data[currentSize++] = value;
}

@Override
public void append(Object o) {
appendNoGrow(o);
}

@Override
public boolean accepts(Object o) {
return o instanceof BigDecimal;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
package org.enso.table.data.column.builder;

import java.math.BigDecimal;
import java.math.BigInteger;
import java.util.Arrays;

import org.enso.base.polyglot.NumericConverter;
import org.enso.table.data.column.storage.Storage;
import org.enso.table.data.column.storage.numeric.AbstractLongStorage;
import org.enso.table.data.column.storage.numeric.BigIntegerStorage;
import org.enso.table.data.column.storage.type.AnyObjectType;
import org.enso.table.data.column.storage.type.BigDecimalType;
import org.enso.table.data.column.storage.type.BigIntegerType;
import org.enso.table.data.column.storage.type.FloatType;
import org.enso.table.data.column.storage.type.IntegerType;
Expand Down Expand Up @@ -37,7 +40,7 @@ public void retypeToMixed(Object[] items) {

@Override
public boolean canRetypeTo(StorageType type) {
return type instanceof FloatType || type instanceof AnyObjectType;
return type instanceof FloatType || type instanceof BigDecimalType || type instanceof AnyObjectType;
}

@Override
Expand All @@ -53,6 +56,16 @@ public TypedBuilder retypeTo(StorageType type) {
}
}
return res;
} else if (type instanceof BigDecimalType) {
BigDecimalBuilder res = new BigDecimalBuilder(currentSize);
for (int i = 0; i < currentSize; i++) {
if (data[i] == null) {
res.appendNulls(1);
} else {
res.appendNoGrow(data[i]);
}
}
return res;
} else if (type instanceof AnyObjectType) {
Object[] widenedData = Arrays.copyOf(data, data.length, Object[].class);
ObjectBuilder res = new MixedBuilder(widenedData);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import org.enso.table.data.column.storage.numeric.AbstractLongStorage;
import org.enso.table.data.column.storage.numeric.BigIntegerStorage;
import org.enso.table.data.column.storage.numeric.DoubleStorage;
import org.enso.table.data.column.storage.type.BigDecimalType;
import org.enso.table.data.column.storage.type.BigIntegerType;
import org.enso.table.data.column.storage.type.BooleanType;
import org.enso.table.data.column.storage.type.FloatType;
Expand Down Expand Up @@ -39,12 +40,26 @@ public void retypeToMixed(Object[] items) {

@Override
public boolean canRetypeTo(StorageType type) {
return false;
return type instanceof BigDecimalType;
}

@Override
public TypedBuilder retypeTo(StorageType type) {
throw new UnsupportedOperationException();
if (type instanceof BigDecimalType) {
BigDecimalBuilder res = new BigDecimalBuilder(currentSize);
for (int i = 0; i < currentSize; i++) {
if (isNothing.get(i)) {
res.appendNulls(1);
} else {
double d = Double.longBitsToDouble(data[i]);
BigDecimal bigDecimal = BigDecimal.valueOf(d);
res.appendNoGrow(bigDecimal);
}
}
return res;
} else {
throw new UnsupportedOperationException();
}
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import org.enso.base.polyglot.NumericConverter;
import org.enso.base.polyglot.Polyglot_Utils;
import org.enso.table.data.column.storage.Storage;
import org.enso.table.data.column.storage.type.BigDecimalType;
import org.enso.table.data.column.storage.type.BigIntegerType;
import org.enso.table.data.column.storage.type.BooleanType;
import org.enso.table.data.column.storage.type.DateTimeType;
Expand Down Expand Up @@ -138,9 +139,7 @@ private record RetypeInfo(Class<?> clazz, StorageType type) {}
new RetypeInfo(Long.class, IntegerType.INT_64),
new RetypeInfo(Double.class, FloatType.FLOAT_64),
new RetypeInfo(String.class, TextType.VARIABLE_LENGTH),
// TODO [RW] I think BigDecimals should not be coerced to floats, we should add Decimal
// support to in-memory tables at some point
// new RetypeInfo(BigDecimal.class, StorageType.FLOAT_64),
new RetypeInfo(BigDecimal.class, BigDecimalType.INSTANCE),
new RetypeInfo(LocalDate.class, DateType.INSTANCE),
new RetypeInfo(LocalTime.class, TimeOfDayType.INSTANCE),
new RetypeInfo(ZonedDateTime.class, DateTimeType.INSTANCE),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1044,13 +1044,13 @@ add_specs suite_builder setup =
result.to_vector.should_equal [0, 1, 3, 4, 0, -1, -3, -4]
result.name . should_equal "round([x])"

group_builder.specify "should allow round on a float column (to >0 decimal places)" <|
group_builder.specify "should allow round on a "+type.to_text+" column (to >0 decimal places)" <|
table = table_builder [["x", [0.51, 0.59, 3.51, 3.59, -0.51, -0.59, -3.51, -3.59]]]
result = table.at "x" . cast type . round 1
result.to_vector.should_equal [0.5, 0.6, 3.5, 3.6, -0.5, -0.6, -3.5, -3.6]
result.name . should_equal "round([x])"

group_builder.specify "should allow round on a float column (to <0 decimal places)" <|
group_builder.specify "should allow round on a "+type.to_text+" column (to <0 decimal places)" <|
table = table_builder [["x", [51.2, 59.3, 351.45, 359.11, -51.2, -59.3, -351.23, -359.69]]]
result = table.at "x" . cast type . round -1
result.to_vector.should_equal [50.0, 60.0, 350.0, 360.0, -50.0, -60.0, -350.0, -360.0]
Expand All @@ -1075,8 +1075,9 @@ add_specs suite_builder setup =
result.name . should_equal "floor([x])"

test_floatlike Value_Type.Float
if setup.test_selection.supports_decimal_type then
test_floatlike Value_Type.Decimal
group_builder.specify "should allow round on a Decimal column" pending="https:/enso-org/enso/issues/10344" <|
if setup.test_selection.supports_decimal_type then
test_floatlike Value_Type.Decimal

group_builder.specify "should allow round on an int column" <|
table = table_builder [["x", [1, 9, 31, 39, -1, -9, -31, -39]]]
Expand Down Expand Up @@ -1114,7 +1115,7 @@ add_specs suite_builder setup =
decimal_col.value_type.is_decimal . should_be_true
decimal_col2 = decimal_col + decimal_col*decimal_col
[(.floor), (.ceil), (.truncate), (x-> x.round 0), (x-> x.round 2)].each op->
op decimal_col2 . to_vector . should_equal [i1 + i1*i1 . to_float]
op decimal_col2 . to_vector . should_equal [i1 + i1*i1]

group_builder.specify "should allow Nothing/NULL" <|
table = table_builder [["x", [Nothing, 0.51, 0.59, 3.51, Nothing, 3.59, -0.51, -0.59, -3.51, -3.59]]]
Expand Down
Loading
Loading