Support individual model runs (2 possible implementations) #754

dwreeves · 2023-12-08T16:35:09Z

Context

A common pattern is-- I will deploy a code change, and I will want to either do a full refresh on an existing model, or run a new model so I can see it immediately in production without waiting on the next scheduled DAG run. Also, I do not necessarily want to run the entire DAG when I do this.

I've been running dbt inside Airflow since prior to Astronomer Cosmos even existing, and this usage pattern has come up a lot in my experience. Right now, I don't think right now that Astronomer Cosmos supports it very well. The recent addition in #623 is much needed for supporting the pattern I am describing, but it does not go far enough.

Right now, in our Astronomer Cosmos deployment, we have a parameterized DAG that allows for this single node invocation, but in order to implement it, I needed to subclass the dbt operator and make everything I wanted into templated fields. I think though that astronomer-cosmos should support this pattern more directly without needing to do that.

Implementation idea #1 (simpler)

Allow the user to more easily parameterize a DAG with the following additions:

Make base_cmd a template field, ~~and allow it to support str types (and if isinstance(base_cmd, str): base_cmd = [base_cmd])~~ Update: realized this is not necessary; just do "{{ [params.base_cmd] }}".
Make select, selector, etc. all template fields.

This will allow the user to set up a simple DAG that passes in Param()s that looks something like this:

with DAG(
    ...,
    params={
        "full_refresh": Param(default=True, ...),
        "dbt_cmd": Param(default="build", ...)
        "select": Param(...)
    },
    render_template_as_native_obj=True
) as dag:
    DbtLocalOperator(
        task_id="dbt_task",
        full_refresh="{{ params.full_refresh }}",
        dbt_cmd=" {{ params.dbt_cmd }}"
        select="{{ params.select }}"
    )

The main benefit to this approach compared to the alternative is it does not expand the high level API surface, however it does add functionality to the existing base operator that would be dangerous to ever deprecate. However I do not think the added functionality is hard to support or is even a downside to have; templating these fields could be beneficial to many users.

Implementation idea #2 (more opinionated, easier to set up)

Set up an "individual run" DAG class for the user, that is pre-parameterized.

from cosmos import DbtParametrizedDag

parametrized_dag = DbtParametrizedDag(
    profile_config=...,
    dag_id="dbt_run_manually",
    schedule_interval=None,
    start_date=...
)

There are two advantages to this approach. First, it is easier for the user to set up and would be a supported API. Second and optionally, this can be implemented without impacting the APIs of the dbt base operators, if there is any desire to do that.

The text was updated successfully, but these errors were encountered:

dosubot · 2023-12-08T18:59:03Z

🤖

It's great to see your enthusiasm for improving the functionality of Cosmos! Both of your implementation ideas sound promising and could definitely enhance the user experience for running individual model runs.

Implementing these ideas would certainly make it easier for users to work with Cosmos and would be a valuable addition to the platform. Your suggestions show a clear understanding of the current limitations and how to address them.

I encourage you to go ahead and create a pull request with your proposed changes. Your contributions will be greatly appreciated by the Cosmos community. Keep up the great work!

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dwreeves · 2023-12-09T15:03:11Z

An implementation note-- This would require running multiple dbt nodes in a single operator. This is feasible, but not a typical pattern in the Cosmos context. This makes the threads arg in the profiles matter a little more than it currently does.
Come to think of it, the first option would require a "build" operator, or something else, to work nicely. A big reason why is because you cannot just access --full-refresh via the base operator (unless that API decision were to change), meaning a single static operator type cannot handle for users every possible thing they would want to run-- except for the dbt build command, which doesn't have an associated operator.

joppevos · 2023-12-11T10:28:37Z

Great description @dwreeves . I am really interested in this problem! For me the addition I wrote in #623 has been usable in production. However, I agree that we can improve it even further. I prefer the first solution as abstracting the parameters in their own DAG seems limited benefit to the end-user.

the first option would require a "build" operator, or something else, to work nicely. A big reason why is because you cannot just access --full-refresh via the base operator

Could you clarify this a bit more? We can only run full-refresh on the Seed an RunOperator. Why do we need to access full-refresh from the base operator? The select and full-refresh don't need to be aware of each other right?

dwreeves · 2023-12-11T14:09:37Z

Could you clarify this a bit more? We can only run full-refresh on the Seed an RunOperator. Why do we need to access full-refresh from the base operator? The select and full-refresh don't need to be aware of each other right?

The pattern is basically this:

Create a normal Airflow DAG (not a cosmos.DbtDag) with a single dbt operator.
The DAG is parameterized for the --select and --full-refresh.
This operator should be able to run seeds, snapshots, models, tests, etc.

^ Those are the requirements of our little system for manually scheduling dbt nodes.

So how do you achieve this with the current Cosmos API? The answer right now is you need to subclass a dbt operator, and you'd still need to do this even after adding the template fields.

Alternate approaches and why they don't work:

You can use a dbt run operator, but this means you cannot run other node types like seeds that you may want to trigger.
You can use the dbt base operator and parametrize the base_cmd, but this means no --full-refresh.

So basically, the only way to meet the requirements of the system is to subclass. Template fields alone don't fulfill the requirements of the system.

I think the requirements of the system are reasonable, as per the notes in my original post. It will usually be the case that you are running model nodes, but not always; for example, sometimes a downstream system like a dashboard may be selecting directly from a seed and you need to update the seed mid-day. Or maybe, to save on time and compute, your company have a policy of only running seeds manually. I don't know, but there are various reasons to want to run both seeds and models using the same parametrized DAG.

I'm not necessarily saying the dbt operator should have --full-refresh. That is one option though. Another option is to have an operator for dbt build.

joppevos · 2023-12-11T15:44:48Z

What do you think of keeping the operations separated?
More like the example below: ( need to check how to get this working)

with DAG(
    ...,
    params={
        "full_refresh": Param(default=True, ...),
        "select": Param(default="+my_models")
    },
) as dag:
    seed = DbtSeedLocalOperator(
        task_id="seed",
        project_dir=CURRENT_DIR,
        profile_config=profile_config,
        full_refresh="{{ params.full_refresh }}",
        select="{{ params.select }}"
    )

    tg = DbtTaskGroup(
        task_id="dbt_task",
        full_refresh="{{ params.full_refresh }}",
        select="{{ params.select }}"
    )
    seed > tg

You can use a dbt run operator, but this means you cannot run other node types like seeds that you may want to trigger.

This could run whatever operators the person has within the DAG. Then the user does not need to provide a parametrized base_cmd

A user gives the two requested parameters in Airflow Console. The DAG will still be rendered with all nodes, but would only run the selected nodes. If the select contains a seed, then it will run the seed. Otherwise dbt will skip over it. In my case, I only experienced users wanting to run dbt run from the console, mostly to backfill.

Do you feel a strong case for having other dbt commands being trigger from console?

dwreeves · 2024-01-07T00:31:37Z

Sorry, just getting back to this since I am looking to just implement the templating of these fields.

@joppevos A note-- the issue with that code example is that you cannot parametrize the task group. Although Airflow supports dynamically generated tasks, such tasks cannot have inter-dependencies. You could create code that works to dynamically generate all nodes selected via --select mymodel+, but the nodes would not have proper dependencies among each other.

You can have both a seed operator and model run operator and pass to both the same {{ params.select }} and that does support; alternatively you can just have two separate DAGs. Or, just do what I am currently doing, and subclass for dbt build. Those are all valid options.

I do feel that the most sensible way to have a manual run that triggers on multiple nodes, and multiple node types in a one-shot fashion that doesn't require subclassing is to have an operator that supports dbt build. I also do feel this is a good pattern, and I would advocate for it. But I also understand that this is a little more on the niche side, since an operator for dbt build would not be used anywhere other than for this specific use case.

I'm going to split template fields and either dbt build or a "DbtParametrizedDag" / "DbtBackfillDag" into separate PRs. I think adding template fields should be uncontroversial. I would like maintainers to give their $0.02 on the other thing regarding manual run patterns, as needing to manually run comes up a lot in practical settings, and needing to manually run seeds or snapshots specifically is rarer but still does happen.

This partially addresses #754 via allowing for built-in templating support for the `DbtBaseOperator`. I also noticed `--full-refresh` was not documented so I added that in. ## Still missing Manual run pattern is not documented; the fact that these fields are templated is not documented. I don't really know where in the docs to put this. The docs are very API-focused more than narrative-based or suggestive, and Cosmos's maintainers prefer this style of documentation so it's hard to find a spot for this. It's possible that that's fine and we just keep this as a feature for more advanced users who dig into source code to discover for themselves? 🤷 Co-authored-by: Tatiana Al-Chueyr <[email protected]>

dwreeves · 2024-03-05T05:16:01Z

With 1.4.0, I think we have an acceptable solution that makes this easier for users, and this issue can probably be closed although I will keep it open for now, with one caveat.

The DbtBuildOperator with full_refresh and select as templated fields means that setting up a manual, parametrized DAG which runs arbitrary node types does not require any subclassing. (Users who also want to parametrize the command will still need to subclass, but that's fine.)

The rest of the work for set something like that up (mostly passing params={} and then having a task like DbtBuildOperator(select="{{ params.select }}") is very idiomatic within Airflow world, and in my opinion does not require further, potentially un-idiomatic or obtrusive abstraction + simplification.

Here's the caveat for maybe why this issue should stay open, or perhaps more appropriately this issue gets closed and we open a new issue: I do think that the documentation should document this pattern, since it is not clear to users (1) that they should even do it in the first place, and (2) how to do it, if they are new to Airflow and/or dbt. As I've mentioned elsewhere, there is not a great place to put something like this in the docs, which is part of the issue. Once this is documented, I would consider the issue fully complete.

This partially addresses astronomer#754 via allowing for built-in templating support for the `DbtBaseOperator`. I also noticed `--full-refresh` was not documented so I added that in. ## Still missing Manual run pattern is not documented; the fact that these fields are templated is not documented. I don't really know where in the docs to put this. The docs are very API-focused more than narrative-based or suggestive, and Cosmos's maintainers prefer this style of documentation so it's hard to find a spot for this. It's possible that that's fine and we just keep this as a feature for more advanced users who dig into source code to discover for themselves? 🤷 Co-authored-by: Tatiana Al-Chueyr <[email protected]>

dwreeves mentioned this issue Jan 7, 2024

Add more template fields to DbtBaseOperator #786

Merged

tatiana added this to the 1.5.0 milestone May 17, 2024

tatiana added the triage-needed Items need to be reviewed / assigned to milestone label May 17, 2024

tatiana self-assigned this May 17, 2024

tatiana removed the triage-needed Items need to be reviewed / assigned to milestone label May 17, 2024

tatiana mentioned this issue May 17, 2024

Release Cosmos 1.5.0 #982

Closed

tatiana added the epic-assigned label May 17, 2024

tatiana modified the milestones: Cosmos 1.5.0, Cosmos 1.6.0 Jun 6, 2024

tatiana modified the milestones: Cosmos 1.6.0, Cosmos 1.7.0 Jul 1, 2024

tatiana modified the milestones: Cosmos 1.7.0, Triage Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support individual model runs (2 possible implementations) #754

Support individual model runs (2 possible implementations) #754

dwreeves commented Dec 8, 2023 •

edited

Loading

dosubot bot commented Dec 8, 2023

dwreeves commented Dec 9, 2023 •

edited

Loading

joppevos commented Dec 11, 2023 •

edited

Loading

dwreeves commented Dec 11, 2023 •

edited

Loading

joppevos commented Dec 11, 2023 •

edited

Loading

dwreeves commented Jan 7, 2024 •

edited

Loading

dwreeves commented Mar 5, 2024 •

edited

Loading

Support individual model runs (2 possible implementations) #754

Support individual model runs (2 possible implementations) #754

Comments

dwreeves commented Dec 8, 2023 • edited Loading

Context

Implementation idea #1 (simpler)

Implementation idea #2 (more opinionated, easier to set up)

dosubot bot commented Dec 8, 2023

dwreeves commented Dec 9, 2023 • edited Loading

joppevos commented Dec 11, 2023 • edited Loading

dwreeves commented Dec 11, 2023 • edited Loading

joppevos commented Dec 11, 2023 • edited Loading

dwreeves commented Jan 7, 2024 • edited Loading

dwreeves commented Mar 5, 2024 • edited Loading

dwreeves commented Dec 8, 2023 •

edited

Loading

dwreeves commented Dec 9, 2023 •

edited

Loading

joppevos commented Dec 11, 2023 •

edited

Loading

dwreeves commented Dec 11, 2023 •

edited

Loading

joppevos commented Dec 11, 2023 •

edited

Loading

dwreeves commented Jan 7, 2024 •

edited

Loading

dwreeves commented Mar 5, 2024 •

edited

Loading