Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve traceability of row validations when large number of partitions are validated #1276

Open
sundar-mudupalli-work opened this issue Sep 20, 2024 · 0 comments
Assignees
Labels
good first issue Good issue for new DVT contributors priority: p2 Medium priority. Fix may not be included in next release (e.g. minor documentation, cleanup) type: feature request 'Nice-to-have' improvement, new feature or different behavior or design.

Comments

@sundar-mudupalli-work
Copy link
Collaborator

Hi,

When generate-table-partitions generates yaml files with validations, it is very hard to trace validation output to specific yaml files and validations within the yaml file - since one yaml file can contain validations for multiple partitions. We recommend that BigQuery be used for validation output and cloud run be used to run validations. Within BigQuery, we can track a validation output to a run id. We cannot go from run-id to a specific yaml file or cloud run task without a) looking into the logs of each cloud run task or b) figure out the yaml file from the primary keys reported in validations.

I am suggesting two changes - one for generate-table-partitions - by default to add two labels - yaml-file (for yaml file name, e.g. 0004.yaml) and source-filter (for the filter used on the source). The second one is needed because one yaml file can contain multiple validations and each validation has its own run-id. generate-table-partitions can take a --no-labels or -nl option if the user does not want any labels.

I am also suggesting a change to configs run - to take --labels or -l parameter so the user can inject labels when the yaml file is run in cloud run - for e.g data-validation configs run -l task-exec-id="$CLOUD_RUN_EXECUTION",task-index="$CLOUD_RUN_TASK_INDEX" -cdir ...

Sundar Mudupalli

@sundar-mudupalli-work sundar-mudupalli-work added the good first issue Good issue for new DVT contributors label Sep 20, 2024
@helensilva14 helensilva14 added type: feature request 'Nice-to-have' improvement, new feature or different behavior or design. priority: p2 Medium priority. Fix may not be included in next release (e.g. minor documentation, cleanup) labels Sep 20, 2024
@luispavaogoogle luispavaogoogle self-assigned this Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good issue for new DVT contributors priority: p2 Medium priority. Fix may not be included in next release (e.g. minor documentation, cleanup) type: feature request 'Nice-to-have' improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants