Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Teradata's CLOB data type in row validations #1256

Open
karcot1 opened this issue Sep 5, 2024 · 3 comments
Open

Support for Teradata's CLOB data type in row validations #1256

karcot1 opened this issue Sep 5, 2024 · 3 comments
Assignees
Labels
priority: p3 Lowest priority. This can be something deprioritized or not time sensitive. type: feature request 'Nice-to-have' improvement, new feature or different behavior or design.

Comments

@karcot1
Copy link

karcot1 commented Sep 5, 2024

Context: row validations
Source: Teradata
Target: BigQuery

Details:
Table in Teradata has a CLOB column which maps to a String column in BigQuery. When attempting row validations on this table, we are seeing the following error:

[Teradata Database] [Errpr 9881] Function 'TransUnicodeToUTF8' called with an invalid number or type of parameters

TransUnicodeToUTF8 cannot be applied to non-VARCHAR data types. Casting to VARCHAR normally happens for columns prior to this step, but this column is already a "string" field in ibis so the casting is not happening.

Adding support for the CLOB data type specifically would ensure the casting to VARCHAR occurs and this error is bypassed.

@helensilva14 helensilva14 added priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. type: feature request 'Nice-to-have' improvement, new feature or different behavior or design. labels Sep 6, 2024
@nehanene15 nehanene15 self-assigned this Sep 9, 2024
@nehanene15
Copy link
Collaborator

nehanene15 commented Sep 11, 2024

The TransUnicodeToUTF8 function requires a VARCHAR(32000) or CHAR data type. We can cast CLOB columns to VARCHAR, but there will be data loss since the maximum number of characters that can be stored in a VARCHAR is 64000, and the maximum for CLOB is 2097088000. The cast will truncate the full string in the CLOB column.

@sundar-mudupalli-work
Copy link
Collaborator

Hi,

CLOB is a variable length character data type in Teradata - which can hold strings longer 64000 characters. Being a character data type, CLOB can store both Latin1 and Unicode character sets. Strings in BigQuery (and other databases) are stored in UTF8 encoding. To compare CLOB columns in Teradata and BigQuery STRING, we will need to convert from Latin1 / Unicode character set to UTF8.

The only function to convert Latin1 / Unicode to UTF8 in Teradata is TransUnicodeToUTF8 which takes as input VARCHAR and outputs VARBYTE, with the VARBYTE limited to 64000 characters. This conversion is difficult/impossible to do outside Teradata because the Latin1 character set used by Teradata is not the same as the ISO/IEC 8859-1 standard. I am not aware of a Teradata function similar to TransUnicodeToUTF8 that takes a CLOB data type and converts it to UTF8 BLOB data type.

Here are the options to compare CLOB columns in Teradata with BigQuery:

  1. If the size of the largest string in the CLOB column is less than 64000 Latin1 (or 32000 Unicode) characters - i.e. it will fit in VARCHAR - then a cast (clob_column as VARCHAR(64000)) will convert that column to VARCHAR and the comparison can be done easily (you may need to use custom-query)
  2. If the size of the largest string in the CLOB column is 100-200K characters, then it can be split into a number of VARCHAR fields - a similar split can be done on the BQ side and two sets of columns can be compared (you will need to use custom query)
  3. If the size is much larger than 200K, the above approach is not feasible. Then a UDF similar to TransUnicodeToUTF8 is needed that converts a Teradata CLOB data type to a BLOB data type which also does UTF8 conversion. This UDF (in C?) will need to use the TransUnicodeToUTF8 UDF. We will also need a UDF like hash_sha256 that takes a BLOB data type and computes a hash based on sha256 algorithm. This can be an update to the hash_sha256 we have currently created. Unfortunately, I don't have the Teradata expertise to create these functions - happy to guide someone if needed. I do not know if we have anyone in PSO with that level of expertise in Teradata UDFs.

Sundar Mudupalli

@helensilva14 helensilva14 changed the title Support for CLOB data type in row validations Support for Teradata's CLOB data type in row validations Sep 26, 2024
@helensilva14 helensilva14 added priority: p3 Lowest priority. This can be something deprioritized or not time sensitive. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Sep 26, 2024
@helensilva14
Copy link
Collaborator

helensilva14 commented Sep 26, 2024

Decreasing priority, issue can be tackled in the future since there are workarounds available, including the ones mentioned by Neha and Sundar previously and the following ones:

  • Depending on how many characters the CLOB column has, a concat validation might work
  • Perform hash validation in all other columns and exclude the CLOB one via --exclude-columns flag and then do a separate concat or comp-fields validation only for the CLOB column. It makes sense to use comp-fields since it's an one-column validation only. Also this is what we do on Oracle too, because CLOB columns cannot be combined with other columns and need to be validated independently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p3 Lowest priority. This can be something deprioritized or not time sensitive. type: feature request 'Nice-to-have' improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

4 participants