Support for Teradata's CLOB data type in row validations #1256

karcot1 · 2024-09-05T16:05:19Z

Context: row validations
Source: Teradata
Target: BigQuery

Details:
Table in Teradata has a CLOB column which maps to a String column in BigQuery. When attempting row validations on this table, we are seeing the following error:

[Teradata Database] [Errpr 9881] Function 'TransUnicodeToUTF8' called with an invalid number or type of parameters

TransUnicodeToUTF8 cannot be applied to non-VARCHAR data types. Casting to VARCHAR normally happens for columns prior to this step, but this column is already a "string" field in ibis so the casting is not happening.

Adding support for the CLOB data type specifically would ensure the casting to VARCHAR occurs and this error is bypassed.

nehanene15 · 2024-09-11T19:24:08Z

The TransUnicodeToUTF8 function requires a VARCHAR(32000) or CHAR data type. We can cast CLOB columns to VARCHAR, but there will be data loss since the maximum number of characters that can be stored in a VARCHAR is 64000, and the maximum for CLOB is 2097088000. The cast will truncate the full string in the CLOB column.

sundar-mudupalli-work · 2024-09-15T23:40:30Z

Hi,

CLOB is a variable length character data type in Teradata - which can hold strings longer 64000 characters. Being a character data type, CLOB can store both Latin1 and Unicode character sets. Strings in BigQuery (and other databases) are stored in UTF8 encoding. To compare CLOB columns in Teradata and BigQuery STRING, we will need to convert from Latin1 / Unicode character set to UTF8.

The only function to convert Latin1 / Unicode to UTF8 in Teradata is TransUnicodeToUTF8 which takes as input VARCHAR and outputs VARBYTE, with the VARBYTE limited to 64000 characters. This conversion is difficult/impossible to do outside Teradata because the Latin1 character set used by Teradata is not the same as the ISO/IEC 8859-1 standard. I am not aware of a Teradata function similar to TransUnicodeToUTF8 that takes a CLOB data type and converts it to UTF8 BLOB data type.

Here are the options to compare CLOB columns in Teradata with BigQuery:

If the size of the largest string in the CLOB column is less than 64000 Latin1 (or 32000 Unicode) characters - i.e. it will fit in VARCHAR - then a cast (clob_column as VARCHAR(64000)) will convert that column to VARCHAR and the comparison can be done easily (you may need to use custom-query)
If the size of the largest string in the CLOB column is 100-200K characters, then it can be split into a number of VARCHAR fields - a similar split can be done on the BQ side and two sets of columns can be compared (you will need to use custom query)
If the size is much larger than 200K, the above approach is not feasible. Then a UDF similar to TransUnicodeToUTF8 is needed that converts a Teradata CLOB data type to a BLOB data type which also does UTF8 conversion. This UDF (in C?) will need to use the TransUnicodeToUTF8 UDF. We will also need a UDF like hash_sha256 that takes a BLOB data type and computes a hash based on sha256 algorithm. This can be an update to the hash_sha256 we have currently created. Unfortunately, I don't have the Teradata expertise to create these functions - happy to guide someone if needed. I do not know if we have anyone in PSO with that level of expertise in Teradata UDFs.

Sundar Mudupalli

helensilva14 · 2024-09-26T20:41:39Z

Decreasing priority, issue can be tackled in the future since there are workarounds available, including the ones mentioned by Neha and Sundar previously and the following ones:

Depending on how many characters the CLOB column has, a concat validation might work
Perform hash validation in all other columns and exclude the CLOB one via --exclude-columns flag and then do a separate concat or comp-fields validation only for the CLOB column. It makes sense to use comp-fields since it's an one-column validation only. Also this is what we do on Oracle too, because CLOB columns cannot be combined with other columns and need to be validated independently.

nehanene15 self-assigned this Sep 9, 2024

helensilva14 changed the title ~~Support for CLOB data type in row validations~~ Support for Teradata's CLOB data type in row validations Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Teradata's CLOB data type in row validations #1256

Support for Teradata's CLOB data type in row validations #1256

karcot1 commented Sep 5, 2024

nehanene15 commented Sep 11, 2024 •

edited

Loading

sundar-mudupalli-work commented Sep 15, 2024

helensilva14 commented Sep 26, 2024 •

edited

Loading

Support for Teradata's CLOB data type in row validations #1256

Support for Teradata's CLOB data type in row validations #1256

Comments

karcot1 commented Sep 5, 2024

nehanene15 commented Sep 11, 2024 • edited Loading

sundar-mudupalli-work commented Sep 15, 2024

helensilva14 commented Sep 26, 2024 • edited Loading

nehanene15 commented Sep 11, 2024 •

edited

Loading

helensilva14 commented Sep 26, 2024 •

edited

Loading