Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-344] New context + RefResolver for compile_sql + execute_sql methods #4851

Closed
jtcohen6 opened this issue Mar 10, 2022 · 3 comments
Closed
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 10, 2022

Motivation

#4641 (comment)

It should be possible to run this code as arbitrary SQL, via the compile_sql or run_sql methods of dbt.lib:

select * 
from {{ metrics.metric(
    metric_name='new_customers',
    grain='week',
    dimensions=['plan', 'country']
) }}

That should run successfully, without needing to manually specify -- depends on: {{ ref('dim_customers') }} — even though this macro does indeed include a nested reference to dim_customers, which cannot be known at parse time.

If the same code were to be included in a model, it should require the manual reference, since it now has implications for DAG ordering: dbt was unable to infer all dependencies...

Background

The compile_sql + execute_sql methods of dbt.lib render arbitrary dbt-SQL using the context produced by generate_runtime_model_context, which uses RuntimeRefResolver to resolve ref(). The basic codepath: task.sql.GenericSqlRunner.compilecompile_node_compile_node_create_node_context)

If RuntimeRefResolver finds a ref at runtime that the current node didn't capture as a dependency at parse time, it raises an error. When defined inside a DAG node, that makes sense! We ask users to explicitly specify refs, to ensure that the dependency is captured at parse time, and the DAG runs in the right order (docs). This all works well.

When compiling/running a block of arbitrary SQL, outside the context of a DAG-running task, it makes a lot less sense. The SQL block simply isn't a DAG node. To that end, we support OperationRefResolver, for dbt run-operation, which does not care about finding a ref it didn't expect to be there (by skipping validate).

The catch: OperationRefResolver does not support ref() of ephemeral models:

# In operations, we can't ref() ephemeral nodes, because
# ParsedMacros do not support set_cte
raise_compiler_error(
"Operations can not ref() ephemeral nodes, but {} is ephemeral".format(
target_model.name
),
self.model,
)

Operations haven't been able to ref() ephemeral models for a long time (the comment goes back to #2085). This has to do with the compilation behavior of ephemeral models—they're really pointers that get stuck onto the referencing model, to be compiled/run when that model is compiled/run. That doesn't work with ParsedMacro today.

That's an ongoing bummer for dbt run-operation, but it's actually okay in this case! The compile_sql + run_sql methods create CompiledSqlNode (node type: SqlOperation), which inherits from CompiledNode and does understand how to add CTEs.

Implementation details

I think we could pick one of two paths here:

  1. OperationRefResolver should check the class/node type of self.model, or better yet hasattr(self.model, "set_cte"), before assuming that ephemeral models cannot be referenced
  2. The SqlOperation node type ("compile/run this arbitrary SQL") deserves its own provider context, which serves as a halfway point between the standard runtime for DAG nodes and the operation runtime for macros.

In either case, _create_node_context needs to know when and how to create the alternate context. Plumbing that through will be the challenge (and testing this, of course, given that we don't currently have any test coverage in dbt-core for these lib methods). By way of demonstration, here's a janky first stab with minimal code changes: feb37c5. I made the change for both CompiledSqlNode + CompiledRPCNode, since they're very very similar, and dbt-rpc was easier for me to test locally.

Example

With an ephemeral model named ephemeral, and a macro like:

{% macro nested_ref() %}

  {% if execute %}
    select * from {{ ref('ephemeral') }}
  {% endif %}

{% endmacro %}

SQL block:

{{ nested_ref() }}

This works in neither a model nor a macro operation.

Using the RPC run_sql method, before the code change:

Compilation Error in rpc my_first_query (from remote system)\\n  dbt was unable to infer all dependencies for the model \"my_first_query\".\\n  This typically happens when ref() is placed within a conditional block.\\n  \\n  To fix this, add the following hint to the top of the model \"my_first_query\":\\n  \\n  -- depends_on: {{ ref(\\'ephemeral\\') }}\\n  \\n  > in macro nested_ref (macros/select_from_eph.sql)\\n  > called by rpc my_first_query (from remote system)\\n  > called by rpc my_first_query (from remote system)

After the code change in feb37c5:

\n\n  \n    with __dbt__cte__ephemeral as (\n\n\nselect 1 as id\n)select * from __dbt__cte__ephemeral\n  \n\n

Questions

Should the rule of thumb be: Any code you can run as arbitrary SQL, you could also stick verbatim in a model (DAG node)? Or any code you can run as arbitrary SQL, you could also run as a macro operation?

The proposed answer in this issue is neither: you could ref things without worrying about DAG implications (no dice for models), and you can ref ephemeral models (no dice for macro operations... until we fix that, someday).

Note: there are cases where this may feel awkward. The way the dbt Cloud IDE works today, when code is written in a model, and passed to compile SQL / preview data, it would work—and then would quickly fail, with the exact same code, in a subsequent dbt run. (That feels like an issue worth solving, by compiling that SQL as a model node, rather than a chunk of arbitrary SQL, for lots of other reasons too: dbt-labs/dbt-rpc#46, #3931)

@jtcohen6 jtcohen6 added enhancement New feature or request Team:Language labels Mar 10, 2022
@github-actions github-actions bot changed the title New context + RefResolver for compile_sql + execute_sql methods [CT-344] New context + RefResolver for compile_sql + execute_sql methods Mar 10, 2022
@jtcohen6
Copy link
Contributor Author

From this week's edition of "DMing with @dave-connors-3":

It's going to be very very common that folks want models/snapshots built off one or more metrics queries. The current metrics package macros will break DAG lineage; we should seek to mend it, following the thrust of this proposal.

Is there any chance this problem is just solved by... [waving hands furiously] ...making it possible to ref metrics, as real DAG nodes? Then the dependency on the underlying model would be passed along, like any parent's parent?

That might solve half of the problem, but leave another half unsolved. Compiling metric code would need to work a bit more like ephemeral model compilation (gross!). And ephemeral models (as explained above) aren't exactly supported by the macro-like use case: "I'm just a visitor to this DAG, not a member."

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Sep 12, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

1 participant