Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-985] [Feature] on-run-end hook and results variable support for source freshness command #5609

Closed
3 tasks done
Tracked by #9099
oravi opened this issue Aug 3, 2022 · 9 comments · Fixed by #9366
Closed
3 tasks done
Tracked by #9099
Assignees
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors

Comments

@oravi
Copy link

oravi commented Aug 3, 2022

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

  • Currently dbt source freshness does not support running an on-run-end hook.
  • Other dbt commands like dbt run, dbt test, dbt snapshot and more already support running an on-run-end hook and getting results jinja variable as their context.
  • Using an on-run-end hook and the results variable in the dbt source freshness command, we can upload the source freshness results to the data warehouse. This could be used for displaying source freshness results on a dashboard, monitoring, and alerting.

Describe alternatives you've considered

  • The dbt source freshness command already writes the freshness results to a file called sources.json.
  • A possible alternative is to process this json file to get the source freshness results.
  • This solution requires implementing an external solution that will process this json and automatically upload the results to the data warehouse after every execution of the dbt source freshness command.
  • This alternative is harder to implement and requires implementing code outside of the dbt project.
  • Additionally, this solution requires orchestrating this new tool and giving it access to the same project so it can extract the source freshness results from the sources.json file.
  • Support for on-run-end and results variable will simplify this dramatically, enabling it to be implemented in the dbt project itself.

Who will this benefit?

  • dbt users who rely on on-run-end hook and the results variable to monitor their dbt commands, executions and results.
  • dbt users who want to build dashboards to present their source freshness results.
  • dbt users who rely on dbt packages like Elementary to upload their dbt artifacts and monitor their dbt projects.

Are you interested in contributing this feature?

Yes

Anything else?

No response

@oravi oravi added enhancement New feature or request triage labels Aug 3, 2022
@github-actions github-actions bot changed the title [Feature] on-run-end hook and results variable support for source freshness command [CT-985] [Feature] on-run-end hook and results variable support for source freshness command Aug 3, 2022
@lostmygithubaccount lostmygithubaccount self-assigned this Aug 3, 2022
@jtalmi
Copy link

jtalmi commented Aug 8, 2022

@lostmygithubaccount is something I can contribute to?

@lostmygithubaccount lostmygithubaccount added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors and removed triage labels Aug 11, 2022
@lostmygithubaccount
Copy link
Contributor

@jtalmi apologies I missed that comment! I've been messing around with how I check my GitHub notifications and I think it fell through the cracks. We would welcome contribution on this issue and can likely provide some guidance if needed.

@oravi thank you for opening. dbt source freshness is a bit different than the other commands that support on-run-end hooks, but we don't see any reason not to allow this. That said, it's likely not something we can prioritize in the near future and so we would be looking for help with the community (it sounds like @jtalmi might be up for it!). We'll leave this issue open on the backlog and mark it as help wanted.

@lostmygithubaccount lostmygithubaccount removed their assignment Aug 11, 2022
@jtcohen6
Copy link
Contributor

The on-run-* hook + context behavior is defined in the safe_run_hooks, before_run, and after_run methods of the RunTask:

def run_hooks(self, adapter, hook_type: RunHookType, extra_context):
ordered_hooks = self.get_hooks_by_type(hook_type)
# on-run-* hooks should run outside of a transaction. This happens
# b/c psycopg2 automatically begins a transaction when a connection
# is created.
adapter.clear_transaction()
if not ordered_hooks:
return
num_hooks = len(ordered_hooks)
with TextOnly():
fire_event(EmptyLine())
fire_event(HooksRunning(num_hooks=num_hooks, hook_type=hook_type))
startctx = TimestampNamed("node_started_at")
finishctx = TimestampNamed("node_finished_at")
for idx, hook in enumerate(ordered_hooks, start=1):
hook._event_status["started_at"] = datetime.utcnow().isoformat()
hook._event_status["node_status"] = RunningStatus.Started
sql = self.get_hook_sql(adapter, hook, idx, num_hooks, extra_context)
hook_text = "{}.{}.{}".format(hook.package_name, hook_type, hook.index)
hook_meta_ctx = HookMetadata(hook, self.index_offset(idx))
with UniqueID(hook.unique_id):
with hook_meta_ctx, startctx:
fire_event(
PrintHookStartLine(
statement=hook_text,
index=idx,
total=num_hooks,
node_info=hook.node_info,
)
)
with Timer() as timer:
if len(sql.strip()) > 0:
response, _ = adapter.execute(sql, auto_begin=False, fetch=False)
status = response._message
else:
status = "OK"
self.ran_hooks.append(hook)
hook._event_status["finished_at"] = datetime.utcnow().isoformat()
with finishctx, DbtModelState({"node_status": "passed"}):
hook._event_status["node_status"] = RunStatus.Success
fire_event(
PrintHookEndLine(
statement=hook_text,
status=status,
index=idx,
total=num_hooks,
execution_time=timer.elapsed,
node_info=hook.node_info,
)
)
# `_event_status` dict is only used for logging. Make sure
# it gets deleted when we're done with it
del hook._event_status["started_at"]
del hook._event_status["finished_at"]
del hook._event_status["node_status"]
self._total_executed += len(ordered_hooks)
with TextOnly():
fire_event(EmptyLine())
def safe_run_hooks(
self, adapter, hook_type: RunHookType, extra_context: Dict[str, Any]
) -> None:
try:
self.run_hooks(adapter, hook_type, extra_context)
except RuntimeException:
fire_event(DatabaseErrorRunning(hook_type=hook_type.value))
raise

def after_run(self, adapter, results):
# in on-run-end hooks, provide the value 'database_schemas', which is a
# list of unique (database, schema) pairs that successfully executed
# models were in. For backwards compatibility, include the old
# 'schemas', which did not include database information.
database_schema_set: Set[Tuple[Optional[str], str]] = {
(r.node.database, r.node.schema)
for r in results
if r.node.is_relational
and r.status not in (NodeStatus.Error, NodeStatus.Fail, NodeStatus.Skipped)
}
self._total_executed += len(results)
extras = {
"schemas": list({s for _, s in database_schema_set}),
"results": results,
"database_schemas": list(database_schema_set),
}
with adapter.connection_named("master"):
self.safe_run_hooks(adapter, RunHookType.End, extras)

The reason FreshnessTask doesn't run hooks today is that it inherits from the GraphRunnableTask (which doesn't know about on-run-* hooks), rather than the RunTask. To turn on hooks for source freshness, that hook code might need to move back onto GraphRunnableTask to accommodate. If that's the approach taken, CompileTask would need to disable those hook methods in order to preserve existing behavior (= not running on-run-* hooks).

@oravi
Copy link
Author

oravi commented Aug 19, 2022

Thanks @jtcohen6 ! could you please share a bit more context about GraphRunnableTask and RunTask? What's the unique purpose of each one of them? Specifically curious to learn why FreshnessTask does not inherit from RunTask.

Also in case we will go with the suggested approach - as you mentioned I saw that CompileTask inherits from GraphRunnableTask and RunTask inherits from CompileTask. When you say that we need to disable those hooks to avoid running them during compile - you mean we just need to override them inside CompileTask and do nothing inside their implementation (just return). Then in RunTask should we specifically call the implementation of GraphRunnableTask?

Thanks for all your help here!

@elongl
Copy link

elongl commented Jan 6, 2023

Hi @jtcohen6 @lostmygithubaccount,
I'm thinking of picking this task up soon.
I would appreciate a confirmation that this feature could be merged once implemented and reviewed.
Thanks.

@lostmygithubaccount
Copy link
Contributor

hi @elongl, Jeremy or one of the engineers should be able to answer for sure -- I think so, though there are some ongoing internal API refactors for v1.5 that could affect this

@jtcohen6 jtcohen6 self-assigned this Jan 9, 2023
@jtcohen6
Copy link
Contributor

@oravi @elongl Sorry for the long delay getting back to you!

Our current task logic is implemented with a lot of class inheritance, rather than composition, which makes this logic much trickier to follow than it should be. That is something we'd be hoping to improve as part of our API refactors over the next several months, but we haven't touched this part of the codebase just yet.

I think the desirable outcomes here are:

  • Move logic on running hooks up into GraphRunnableTask
  • Use that same logic in FreshnessTask, which should run hooks (this issue)
  • Disable that logic in CompileTask, which shouldn't run hooks

If one of you wanted to open a PR for that, I'd be happy to see it merged!

CompileTask, which shouldn't run hooks

This is status quo & my opinion — does anyone have a different opinion?

@ofek1weiss
Copy link
Contributor

Hi @jtcohen6 👋
I'd be happy to contribute this feature, tho i have a question about the solution you suggested:
As i look at the code now, it seems that more classes inherit from GraphRunnableTask (like CloneTask and ListTask), so your suggested solution of moving the logic to it and disabling it in the inheriting classes sounds a bit messy, so i thought of a few solutions:

  • Make FreshnessTask inherit from RunTask as well, tho there might be a reason i do not fully understand that explains why this cannot be the case.
  • Add a new class to the hierarchy, something like HooksTask that exists lower in the hierarchy tree, and both RunTask and FreshnessTask inherit.

@ofek1weiss
Copy link
Contributor

ofek1weiss commented Dec 31, 2023

Went ahead and opened this PR that implements this suggested solution:

  • Make FreshnessTask inherit from RunTask as well, tho there might be a reason i do not fully understand that explains why this cannot be the case.

Functionally it seems to work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors
Projects
None yet
7 participants