Improve traceability of row validations when large number of partitions are validated #1276
Labels
good first issue
Good issue for new DVT contributors
priority: p2
Medium priority. Fix may not be included in next release (e.g. minor documentation, cleanup)
type: feature request
'Nice-to-have' improvement, new feature or different behavior or design.
Hi,
When generate-table-partitions generates yaml files with validations, it is very hard to trace validation output to specific yaml files and validations within the yaml file - since one yaml file can contain validations for multiple partitions. We recommend that BigQuery be used for validation output and cloud run be used to run validations. Within BigQuery, we can track a validation output to a run id. We cannot go from run-id to a specific yaml file or cloud run task without a) looking into the logs of each cloud run task or b) figure out the yaml file from the primary keys reported in validations.
I am suggesting two changes - one for
generate-table-partitions
- by default to add two labels - yaml-file (for yaml file name, e.g. 0004.yaml) and source-filter (for the filter used on the source). The second one is needed because one yaml file can contain multiple validations and each validation has its own run-id.generate-table-partitions
can take a--no-labels
or-nl
option if the user does not want any labels.I am also suggesting a change to
configs run
- to take--labels
or-l
parameter so the user can inject labels when the yaml file is run in cloud run - for e.gdata-validation configs run -l task-exec-id="$CLOUD_RUN_EXECUTION",task-index="$CLOUD_RUN_TASK_INDEX" -cdir ...
Sundar Mudupalli
The text was updated successfully, but these errors were encountered: