Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support individual model runs (2 possible implementations) #754

Open
dwreeves opened this issue Dec 8, 2023 · 7 comments
Open

Support individual model runs (2 possible implementations) #754

dwreeves opened this issue Dec 8, 2023 · 7 comments
Assignees
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc area:rendering Related to rendering, like Jinja, Airflow tasks, etc area:selector Related to selector, like DAG selector, DBT selector, etc dbt:run Primarily related to dbt run command or functionality epic-assigned
Milestone

Comments

@dwreeves
Copy link
Collaborator

dwreeves commented Dec 8, 2023

Context

A common pattern is-- I will deploy a code change, and I will want to either do a full refresh on an existing model, or run a new model so I can see it immediately in production without waiting on the next scheduled DAG run. Also, I do not necessarily want to run the entire DAG when I do this.

I've been running dbt inside Airflow since prior to Astronomer Cosmos even existing, and this usage pattern has come up a lot in my experience. Right now, I don't think right now that Astronomer Cosmos supports it very well. The recent addition in #623 is much needed for supporting the pattern I am describing, but it does not go far enough.

Right now, in our Astronomer Cosmos deployment, we have a parameterized DAG that allows for this single node invocation, but in order to implement it, I needed to subclass the dbt operator and make everything I wanted into templated fields. I think though that astronomer-cosmos should support this pattern more directly without needing to do that.

Implementation idea #1 (simpler)

Allow the user to more easily parameterize a DAG with the following additions:

  • Make base_cmd a template field, and allow it to support str types (and if isinstance(base_cmd, str): base_cmd = [base_cmd]) Update: realized this is not necessary; just do "{{ [params.base_cmd] }}".
  • Make select, selector, etc. all template fields.

This will allow the user to set up a simple DAG that passes in Param()s that looks something like this:

with DAG(
    ...,
    params={
        "full_refresh": Param(default=True, ...),
        "dbt_cmd": Param(default="build", ...)
        "select": Param(...)
    },
    render_template_as_native_obj=True
) as dag:
    DbtLocalOperator(
        task_id="dbt_task",
        full_refresh="{{ params.full_refresh }}",
        dbt_cmd=" {{ params.dbt_cmd }}"
        select="{{ params.select }}"
    )

The main benefit to this approach compared to the alternative is it does not expand the high level API surface, however it does add functionality to the existing base operator that would be dangerous to ever deprecate. However I do not think the added functionality is hard to support or is even a downside to have; templating these fields could be beneficial to many users.

Implementation idea #2 (more opinionated, easier to set up)

Set up an "individual run" DAG class for the user, that is pre-parameterized.

from cosmos import DbtParametrizedDag

parametrized_dag = DbtParametrizedDag(
    profile_config=...,
    dag_id="dbt_run_manually",
    schedule_interval=None,
    start_date=...
)

There are two advantages to this approach. First, it is easier for the user to set up and would be a supported API. Second and optionally, this can be implemented without impacting the APIs of the dbt base operators, if there is any desire to do that.

@dosubot dosubot bot added area:config Related to configuration, like YAML files, environment variables, or executer configuration area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc area:rendering Related to rendering, like Jinja, Airflow tasks, etc area:selector Related to selector, like DAG selector, DBT selector, etc dbt:run Primarily related to dbt run command or functionality labels Dec 8, 2023
Copy link

dosubot bot commented Dec 8, 2023

🤖

It's great to see your enthusiasm for improving the functionality of Cosmos! Both of your implementation ideas sound promising and could definitely enhance the user experience for running individual model runs.

Implementing these ideas would certainly make it easier for users to work with Cosmos and would be a valuable addition to the platform. Your suggestions show a clear understanding of the current limitations and how to address them.

I encourage you to go ahead and create a pull request with your proposed changes. Your contributions will be greatly appreciated by the Cosmos community. Keep up the great work!


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dwreeves
Copy link
Collaborator Author

dwreeves commented Dec 9, 2023

  1. An implementation note-- This would require running multiple dbt nodes in a single operator. This is feasible, but not a typical pattern in the Cosmos context. This makes the threads arg in the profiles matter a little more than it currently does.

  2. Come to think of it, the first option would require a "build" operator, or something else, to work nicely. A big reason why is because you cannot just access --full-refresh via the base operator (unless that API decision were to change), meaning a single static operator type cannot handle for users every possible thing they would want to run-- except for the dbt build command, which doesn't have an associated operator.

@joppevos
Copy link
Contributor

joppevos commented Dec 11, 2023

Great description @dwreeves . I am really interested in this problem! For me the addition I wrote in #623 has been usable in production. However, I agree that we can improve it even further. I prefer the first solution as abstracting the parameters in their own DAG seems limited benefit to the end-user.

the first option would require a "build" operator, or something else, to work nicely. A big reason why is because you cannot just access --full-refresh via the base operator

Could you clarify this a bit more? We can only run full-refresh on the Seed an RunOperator. Why do we need to access full-refresh from the base operator? The select and full-refresh don't need to be aware of each other right?

@dwreeves
Copy link
Collaborator Author

dwreeves commented Dec 11, 2023

Could you clarify this a bit more? We can only run full-refresh on the Seed an RunOperator. Why do we need to access full-refresh from the base operator? The select and full-refresh don't need to be aware of each other right?

The pattern is basically this:

  • Create a normal Airflow DAG (not a cosmos.DbtDag) with a single dbt operator.
  • The DAG is parameterized for the --select and --full-refresh.
  • This operator should be able to run seeds, snapshots, models, tests, etc.

^ Those are the requirements of our little system for manually scheduling dbt nodes.

So how do you achieve this with the current Cosmos API? The answer right now is you need to subclass a dbt operator, and you'd still need to do this even after adding the template fields.

Alternate approaches and why they don't work:

  • You can use a dbt run operator, but this means you cannot run other node types like seeds that you may want to trigger.
  • You can use the dbt base operator and parametrize the base_cmd, but this means no --full-refresh.

So basically, the only way to meet the requirements of the system is to subclass. Template fields alone don't fulfill the requirements of the system.

I think the requirements of the system are reasonable, as per the notes in my original post. It will usually be the case that you are running model nodes, but not always; for example, sometimes a downstream system like a dashboard may be selecting directly from a seed and you need to update the seed mid-day. Or maybe, to save on time and compute, your company have a policy of only running seeds manually. I don't know, but there are various reasons to want to run both seeds and models using the same parametrized DAG.


I'm not necessarily saying the dbt operator should have --full-refresh. That is one option though. Another option is to have an operator for dbt build.

@joppevos
Copy link
Contributor

joppevos commented Dec 11, 2023

What do you think of keeping the operations separated?
More like the example below: ( need to check how to get this working)

with DAG(
    ...,
    params={
        "full_refresh": Param(default=True, ...),
        "select": Param(default="+my_models")
    },
) as dag:
    seed = DbtSeedLocalOperator(
        task_id="seed",
        project_dir=CURRENT_DIR,
        profile_config=profile_config,
        full_refresh="{{ params.full_refresh }}",
        select="{{ params.select }}"
    )

    tg = DbtTaskGroup(
        task_id="dbt_task",
        full_refresh="{{ params.full_refresh }}",
        select="{{ params.select }}"
    )
    seed > tg

You can use a dbt run operator, but this means you cannot run other node types like seeds that you may want to trigger.

This could run whatever operators the person has within the DAG. Then the user does not need to provide a parametrized base_cmd

A user gives the two requested parameters in Airflow Console. The DAG will still be rendered with all nodes, but would only run the selected nodes. If the select contains a seed, then it will run the seed. Otherwise dbt will skip over it. In my case, I only experienced users wanting to run dbt run from the console, mostly to backfill.

Do you feel a strong case for having other dbt commands being trigger from console?

@dwreeves
Copy link
Collaborator Author

dwreeves commented Jan 7, 2024

Sorry, just getting back to this since I am looking to just implement the templating of these fields.

@joppevos A note-- the issue with that code example is that you cannot parametrize the task group. Although Airflow supports dynamically generated tasks, such tasks cannot have inter-dependencies. You could create code that works to dynamically generate all nodes selected via --select mymodel+, but the nodes would not have proper dependencies among each other.

You can have both a seed operator and model run operator and pass to both the same {{ params.select }} and that does support; alternatively you can just have two separate DAGs. Or, just do what I am currently doing, and subclass for dbt build. Those are all valid options.

I do feel that the most sensible way to have a manual run that triggers on multiple nodes, and multiple node types in a one-shot fashion that doesn't require subclassing is to have an operator that supports dbt build. I also do feel this is a good pattern, and I would advocate for it. But I also understand that this is a little more on the niche side, since an operator for dbt build would not be used anywhere other than for this specific use case.


I'm going to split template fields and either dbt build or a "DbtParametrizedDag" / "DbtBackfillDag" into separate PRs. I think adding template fields should be uncontroversial. I would like maintainers to give their $0.02 on the other thing regarding manual run patterns, as needing to manually run comes up a lot in practical settings, and needing to manually run seeds or snapshots specifically is rarer but still does happen.

tatiana added a commit that referenced this issue Feb 29, 2024
This partially addresses #754 via allowing for built-in templating
support for the `DbtBaseOperator`.

I also noticed `--full-refresh` was not documented so I added that in.

## Still missing

Manual run pattern is not documented; the fact that these fields are
templated is not documented. I don't really know where in the docs to
put this. The docs are very API-focused more than narrative-based or
suggestive, and Cosmos's maintainers prefer this style of documentation
so it's hard to find a spot for this. It's possible that that's fine and
we just keep this as a feature for more advanced users who dig into
source code to discover for themselves? 🤷

Co-authored-by: Tatiana Al-Chueyr <[email protected]>
@dwreeves
Copy link
Collaborator Author

dwreeves commented Mar 5, 2024

With 1.4.0, I think we have an acceptable solution that makes this easier for users, and this issue can probably be closed although I will keep it open for now, with one caveat.

The DbtBuildOperator with full_refresh and select as templated fields means that setting up a manual, parametrized DAG which runs arbitrary node types does not require any subclassing. (Users who also want to parametrize the command will still need to subclass, but that's fine.)

The rest of the work for set something like that up (mostly passing params={} and then having a task like DbtBuildOperator(select="{{ params.select }}") is very idiomatic within Airflow world, and in my opinion does not require further, potentially un-idiomatic or obtrusive abstraction + simplification.

Here's the caveat for maybe why this issue should stay open, or perhaps more appropriately this issue gets closed and we open a new issue: I do think that the documentation should document this pattern, since it is not clear to users (1) that they should even do it in the first place, and (2) how to do it, if they are new to Airflow and/or dbt. As I've mentioned elsewhere, there is not a great place to put something like this in the docs, which is part of the issue. Once this is documented, I would consider the issue fully complete.

@tatiana tatiana added this to the 1.5.0 milestone May 17, 2024
@tatiana tatiana added the triage-needed Items need to be reviewed / assigned to milestone label May 17, 2024
@tatiana tatiana self-assigned this May 17, 2024
@tatiana tatiana removed the triage-needed Items need to be reviewed / assigned to milestone label May 17, 2024
@tatiana tatiana modified the milestones: Cosmos 1.5.0, Cosmos 1.6.0 Jun 6, 2024
@tatiana tatiana modified the milestones: Cosmos 1.6.0, Cosmos 1.7.0 Jul 1, 2024
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this issue Jul 14, 2024
This partially addresses astronomer#754 via allowing for built-in templating
support for the `DbtBaseOperator`.

I also noticed `--full-refresh` was not documented so I added that in.

## Still missing

Manual run pattern is not documented; the fact that these fields are
templated is not documented. I don't really know where in the docs to
put this. The docs are very API-focused more than narrative-based or
suggestive, and Cosmos's maintainers prefer this style of documentation
so it's hard to find a spot for this. It's possible that that's fine and
we just keep this as a feature for more advanced users who dig into
source code to discover for themselves? 🤷

Co-authored-by: Tatiana Al-Chueyr <[email protected]>
@tatiana tatiana modified the milestones: Cosmos 1.7.0, Triage Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc area:rendering Related to rendering, like Jinja, Airflow tasks, etc area:selector Related to selector, like DAG selector, DBT selector, etc dbt:run Primarily related to dbt run command or functionality epic-assigned
Projects
None yet
Development

No branches or pull requests

3 participants