Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-466] [Feature] Make run-operation accept selectors to be able to use the selected_resources Jinja variable #5005

Open
1 task done
b-per opened this issue Apr 7, 2022 · 12 comments
Labels
enhancement New feature or request paper_cut A small change that impacts lots of users in their day-to-day

Comments

@b-per
Copy link
Contributor

b-per commented Apr 7, 2022

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

This is a spin-off from #3471. The PR #5001 is adding selected_resources to the Jinja context but this is currently not available to the run-operation command that doesn't allow usage of selectors.

The approach I was thinking to follow would be to:

  1. allow run-operation to accept selectors arguments
  2. move the variables job_queue, _flattened_nodes and previous_state from the GraphRunnableTask to the ManifestTask
    self.job_queue: Optional[GraphQueue] = None
    self._flattened_nodes: Optional[List[CompileResultNode]] = None
    self.run_count: int = 0
    self.num_nodes: int = 0
    self.node_results = []
    self._skipped_children = {}
    self._raise_next_tick = None
    self.previous_state: Optional[PreviousState] = None
    self.set_previous_state()

The impact of this though would be that run-operation would then build the graph, which it is not doing today, which will imply a longer runtime for the command.

A question is also what type of parameters do we want to allow run-operation to use?

My original thinking is all the build ones but I'd be keen for some feedback. The build ones are:

  • --select
  • --exclude
  • --state
  • --resource-type
  • --selector
  • (and I am not sure about --indirect-selection)

Describe alternatives you've considered

Not making run-operation be able to leverage the selected_resources variable

Who will this benefit?

People who want to write macros and operations that leverage dbt selectors.

Are you interested in contributing this feature?

Yes, I have some draft code already

Anything else?

No response

@b-per b-per added enhancement New feature or request triage labels Apr 7, 2022
@github-actions github-actions bot changed the title [Feature] Make run-operation accept selectors to be able to use the selected_resources Jinja variable [CT-466] [Feature] Make run-operation accept selectors to be able to use the selected_resources Jinja variable Apr 7, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 7, 2022

Love the issue @b-per! I'll leave some comments below, but I also don't want to stop someone from the Execution team from chiming in with more.

From my perspective, there are two things that "real" dbt tasks have access to today, which run-operation (custom user macros) don't:

  1. Real selection logic — accessible to the dbt-Jinja context as of Add selected_resources to the Jinja context #5001
  2. Parallelism. There's no simple way to make this work in Jinja, I think... and it's definitely come up as something that would be highly desirable: Multi-threading for dbt_external_tables dbt-external-tables#109

(There's one other distinction between run-operation and other "runnable" tasks, which is highly technical and hard to wrap your head around. Suffice to say, ref works slightly differently, and ephemeral models are unsupported. For an overview: #4851)

What you're getting at feels both like a natural extension of run-operation, and like a slightly different thing: The ability to compile a DAG, and perform multi-threaded processing of the nodes in that DAG, where each node undergoes a custom process, defined by a user-provided macro.

That's honestly not far off from how other tasks work today:

  • dbt source freshness does this with the collect_freshness macro (just sources, not in DAG order)
  • dbt run (seed/snapshot/etc) does this, where the user-provided macro is a materialization

So there's a lot of existing art here to leverage. The question is, how to factor this code, and present these concepts, in a way that makes sense. (Is this... custom tasks? #2381)

@iknox-fa iknox-fa removed the triage label Apr 7, 2022
@iknox-fa
Copy link
Contributor

iknox-fa commented Apr 7, 2022

@jtcohen6 Pretty much covered this already, but here's a few more thoughts:

move the variables job_queue, _flattened_nodes and previous_state from the GraphRunnableTask to the ManifestTask

We probably want to make the RunOperationTask inherit from GraphRunnableTask rather than move the vars.

Parallelism. There's no simple way to make this work in Jinja, I think... and it's definitely come up as something that would be highly desirable

This is def going to be a thing and I'd say that it's a requirement to add this as a proper feature to dbt.

That said, it seems like a sound idea and probably quite useful if we can sort out now to keep it parallelized.

@b-per
Copy link
Contributor Author

b-per commented Apr 8, 2022

I'll admit that my original idea didn't take the parallelism in consideration.

The use case I was thinking of at first was an existing run-operation that people might want to filter based on the --selector provided (like running dbt_meta_testing but on a selected number of models, using the power of dbt selectors). In those cases, this is actually just 1 operation looping through the graph that wouldn't get any value from parallelism as it is actually not even running any SQL.

Should those operations with no/low actual queries then be differentiated from operations that rely on many calls to the Warehouse and would need parallelism to be performant?

I could help with the "1. Real selection logic" but "2. Parallelism" looks like requiring much more work and design decisions. I think that 1 without 2 could already cover some use cases but if we want to solve both at the same time and avoid rework I might just leave this to someone else.

@b-per
Copy link
Contributor Author

b-per commented Apr 12, 2022

I have been looking at it a bit more and I think that there is another design decision that needs to be taken:
What kind of resources would we want selected_resources to contain when generated from run-operation

Currently, we have:

  • test returns tests only
  • run returns models only
  • build returns all nodes + tests, e.g. models, seeds, snapshots and tests

If we want to leverage selected_resources and run-operation for dbt_external_tables, we would most likely want to add the selected sources to the variable.

The question then becomes if we also want to add the selected metrics and exposures

@jaysobel
Copy link

jaysobel commented Oct 4, 2022

This would be helpful for the Slim CI in dbt cloud workflow which includes a run-operation to zero-copy-clone tables into the temporary CI schema. With 1,000+ models, this takes ~15+ minutes.

If the cloning run-operation could accept selectors then -s 1+state:modified would cut this down to <1 minute for most PRs, and cut down on total CI time immensely.

@jtcohen6 jtcohen6 added the paper_cut A small change that impacts lots of users in their day-to-day label Jan 12, 2023
@jtcohen6
Copy link
Contributor

Coming back to this, I think it makes sense to treat separately the two paths I was describing above:

  1. run-operation + node selection — this issue
  2. User-space access to threads/parallelism — future, out of scope for this issue

Support for node selection in run-operation, whereby users can access both the graph and the selected_resources variable within the Jinja context, would be a compelling way to unlock/unblock some advanced use cases. We could even do a bit more of the legwork, and avoid some repeated boilerplate, by creating a context member that is just the selected subset of the graph. Pseudo-code: [node for node in graph.nodes.values() if node.unique_id in selected_resources], and the same for sources/exposures/metrics.

Support for threads/parallelism. We shouldn't try exposing this to the Jinja context. Instead, we should (a) add built-in support for common use cases where multi-threading is essential, and (b) work toward a future where users can define their own custom tasks/commands (in Python).

@matt-winkler
Copy link
Contributor

Would it be reasonable to enable an additional configuration on table, snapshot and incremental materialization to clone_from another database location / environment? Going the materialization route enables parallelism. It also feels reasonable that the result of a materialization is in fact a clone.

@dbeatty10
Copy link
Contributor

@matt-winkler if we enabled this additional configuration for the table materialization,

  1. what might an example model {{ config(...) }} block look like?
  2. what SQL would you imagine being executed as a result of that configuration?
    • Two different examples would be nice, so let's suppose dbt-snowflake and dbt-postgres

@matt-winkler
Copy link
Contributor

For Snowflake:

-- in models/fct_orders.sql
{{
    config(
        materialized = 'table',
        tags=['finance'],
        clone_from={'database': 'analytics_mwinkler_dbt_workspace', 'schema': 'dbt_mwinkler'}
    )
}}
-- in macros/materializations/snowflake__create_table_as.sql
{% macro snowflake__create_table_as(temporary, relation, compiled_code, language='sql') -%}
  {%- if language == 'sql' -%}
    {%- set transient = config.get('transient', default=true) -%}
    {%- set clone_from = config.get('clone_from', default=none) -%}
    {%- set cluster_by_keys = config.get('cluster_by', default=none) -%}
    {%- set enable_automatic_clustering = config.get('automatic_clustering', default=false) -%}
    {%- set copy_grants = config.get('copy_grants', default=false) -%}

    {%- if clone_from is not none -%}
      {# 
        needs logic to validate clone_from has database / schema config'd 
      #}
      {%- set clone_database = clone_from['database'] -%}
      {%- set clone_schema = clone_from['schema'] -%}
    {%- endif -%}

    {%- if cluster_by_keys is not none and cluster_by_keys is string -%}
      {%- set cluster_by_keys = [cluster_by_keys] -%}
    {%- endif -%}
    {%- if cluster_by_keys is not none -%}
      {%- set cluster_by_string = cluster_by_keys|join(", ")-%}
    {% else %}
      {%- set cluster_by_string = none -%}
    {%- endif -%}
    {%- set sql_header = config.get('sql_header', none) -%}

    {{ sql_header if sql_header is not none }}

        create or replace {% if temporary -%}
          temporary
        {%- elif transient -%}
          transient
        {%- endif %} table {{ relation }} 
        {% if clone_from -%}
          clone {{ clone_database }}.{{ clone_schema }}.{{ relation.name }};
        
        {%- else -%}
          {#
            What of the below needs to be applied to a clone? Anything?
          #}
          {%- set contract_config = config.get('contract') -%}
          {%- if contract_config.enforced -%}
            {{ get_assert_columns_equivalent(sql) }}
            {{ get_table_columns_and_constraints() }}
            {% set compiled_code = get_select_subquery(compiled_code) %}
          {% endif %}
          {% if copy_grants and not temporary -%} copy grants {%- endif %} as
          (
            {%- if cluster_by_string is not none -%}
              select * from (
                {{ compiled_code }}
                ) order by ({{ cluster_by_string }})
            {%- else -%}
              {{ compiled_code }}
            {%- endif %}
          );

        {%- endif -%}

      {% if cluster_by_string is not none and not temporary -%}
        alter table {{relation}} cluster by ({{cluster_by_string}});
      {%- endif -%}
      {% if enable_automatic_clustering and cluster_by_string is not none and not temporary  -%}
        alter table {{relation}} resume recluster;
      {%- endif -%}

  {%- elif language == 'python' -%}
    {{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary) }}
  {%- else -%}
      {% do exceptions.raise_compiler_error("snowflake__create_table_as macro didn't get supported language, it got %s" % language) %}
  {%- endif -%}

{% endmacro %}

@jtcohen6
Copy link
Contributor

@matt-winkler Is there anything that you prefer about a clone_from config, versus the proposal in #7258?

That proposal shakes out to:

  • a first-class clone materialization
  • a new command, dbt clone, that will clone all selected resources (using the clone materialization instead of their otherwise configured one)

@matt-winkler
Copy link
Contributor

matt-winkler commented May 26, 2023

Even better @jtcohen6. 0 issues with that and will avoid stacking on top of existing materialization logic in a confusing way.

I'll pause on this approach. LMK if anything I can help with to accelerate the other.

I think then the CI process looks like

dbt clone -s config.materialized:incremental
dbt build -s @state:modified

yeah?

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Feb 24, 2024
@dbeatty10 dbeatty10 removed the stale Issues that have gone stale label Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request paper_cut A small change that impacts lots of users in their day-to-day
Projects
None yet
Development

No branches or pull requests

6 participants