-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support individual model runs (2 possible implementations) #754
Comments
🤖 It's great to see your enthusiasm for improving the functionality of Cosmos! Both of your implementation ideas sound promising and could definitely enhance the user experience for running individual model runs. Implementing these ideas would certainly make it easier for users to work with Cosmos and would be a valuable addition to the platform. Your suggestions show a clear understanding of the current limitations and how to address them. I encourage you to go ahead and create a pull request with your proposed changes. Your contributions will be greatly appreciated by the Cosmos community. Keep up the great work! This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
|
Great description @dwreeves . I am really interested in this problem! For me the addition I wrote in #623 has been usable in production. However, I agree that we can improve it even further. I prefer the first solution as abstracting the parameters in their own DAG seems limited benefit to the end-user.
Could you clarify this a bit more? We can only run full-refresh on the Seed an RunOperator. Why do we need to access full-refresh from the base operator? The select and full-refresh don't need to be aware of each other right? |
The pattern is basically this:
^ Those are the requirements of our little system for manually scheduling dbt nodes. So how do you achieve this with the current Cosmos API? The answer right now is you need to subclass a dbt operator, and you'd still need to do this even after adding the template fields. Alternate approaches and why they don't work:
So basically, the only way to meet the requirements of the system is to subclass. Template fields alone don't fulfill the requirements of the system. I think the requirements of the system are reasonable, as per the notes in my original post. It will usually be the case that you are running model nodes, but not always; for example, sometimes a downstream system like a dashboard may be selecting directly from a seed and you need to update the seed mid-day. Or maybe, to save on time and compute, your company have a policy of only running seeds manually. I don't know, but there are various reasons to want to run both seeds and models using the same parametrized DAG. I'm not necessarily saying the dbt operator should have |
What do you think of keeping the operations separated? with DAG(
...,
params={
"full_refresh": Param(default=True, ...),
"select": Param(default="+my_models")
},
) as dag:
seed = DbtSeedLocalOperator(
task_id="seed",
project_dir=CURRENT_DIR,
profile_config=profile_config,
full_refresh="{{ params.full_refresh }}",
select="{{ params.select }}"
)
tg = DbtTaskGroup(
task_id="dbt_task",
full_refresh="{{ params.full_refresh }}",
select="{{ params.select }}"
)
seed > tg
This could run whatever operators the person has within the DAG. Then the user does not need to provide a parametrized A user gives the two requested parameters in Airflow Console. The DAG will still be rendered with all nodes, but would only run the selected nodes. If the select contains a seed, then it will run the seed. Otherwise dbt will skip over it. In my case, I only experienced users wanting to run Do you feel a strong case for having other dbt commands being trigger from console? |
Sorry, just getting back to this since I am looking to just implement the templating of these fields. @joppevos A note-- the issue with that code example is that you cannot parametrize the task group. Although Airflow supports dynamically generated tasks, such tasks cannot have inter-dependencies. You could create code that works to dynamically generate all nodes selected via You can have both a seed operator and model run operator and pass to both the same I do feel that the most sensible way to have a manual run that triggers on multiple nodes, and multiple node types in a one-shot fashion that doesn't require subclassing is to have an operator that supports I'm going to split template fields and either |
This partially addresses #754 via allowing for built-in templating support for the `DbtBaseOperator`. I also noticed `--full-refresh` was not documented so I added that in. ## Still missing Manual run pattern is not documented; the fact that these fields are templated is not documented. I don't really know where in the docs to put this. The docs are very API-focused more than narrative-based or suggestive, and Cosmos's maintainers prefer this style of documentation so it's hard to find a spot for this. It's possible that that's fine and we just keep this as a feature for more advanced users who dig into source code to discover for themselves? 🤷 Co-authored-by: Tatiana Al-Chueyr <[email protected]>
With 1.4.0, I think we have an acceptable solution that makes this easier for users, and this issue can probably be closed although I will keep it open for now, with one caveat. The The rest of the work for set something like that up (mostly passing Here's the caveat for maybe why this issue should stay open, or perhaps more appropriately this issue gets closed and we open a new issue: I do think that the documentation should document this pattern, since it is not clear to users (1) that they should even do it in the first place, and (2) how to do it, if they are new to Airflow and/or dbt. As I've mentioned elsewhere, there is not a great place to put something like this in the docs, which is part of the issue. Once this is documented, I would consider the issue fully complete. |
This partially addresses astronomer#754 via allowing for built-in templating support for the `DbtBaseOperator`. I also noticed `--full-refresh` was not documented so I added that in. ## Still missing Manual run pattern is not documented; the fact that these fields are templated is not documented. I don't really know where in the docs to put this. The docs are very API-focused more than narrative-based or suggestive, and Cosmos's maintainers prefer this style of documentation so it's hard to find a spot for this. It's possible that that's fine and we just keep this as a feature for more advanced users who dig into source code to discover for themselves? 🤷 Co-authored-by: Tatiana Al-Chueyr <[email protected]>
Context
A common pattern is-- I will deploy a code change, and I will want to either do a full refresh on an existing model, or run a new model so I can see it immediately in production without waiting on the next scheduled DAG run. Also, I do not necessarily want to run the entire DAG when I do this.
I've been running dbt inside Airflow since prior to Astronomer Cosmos even existing, and this usage pattern has come up a lot in my experience. Right now, I don't think right now that Astronomer Cosmos supports it very well. The recent addition in #623 is much needed for supporting the pattern I am describing, but it does not go far enough.
Right now, in our Astronomer Cosmos deployment, we have a parameterized DAG that allows for this single node invocation, but in order to implement it, I needed to subclass the dbt operator and make everything I wanted into templated fields. I think though that astronomer-cosmos should support this pattern more directly without needing to do that.
Implementation idea #1 (simpler)
Allow the user to more easily parameterize a DAG with the following additions:
base_cmd
a template field,and allow it to supportUpdate: realized this is not necessary; just dostr
types (andif isinstance(base_cmd, str): base_cmd = [base_cmd]
)"{{ [params.base_cmd] }}"
.select
,selector
, etc. all template fields.This will allow the user to set up a simple DAG that passes in
Param()
s that looks something like this:The main benefit to this approach compared to the alternative is it does not expand the high level API surface, however it does add functionality to the existing base operator that would be dangerous to ever deprecate. However I do not think the added functionality is hard to support or is even a downside to have; templating these fields could be beneficial to many users.
Implementation idea #2 (more opinionated, easier to set up)
Set up an "individual run" DAG class for the user, that is pre-parameterized.
There are two advantages to this approach. First, it is easier for the user to set up and would be a supported API. Second and optionally, this can be implemented without impacting the APIs of the dbt base operators, if there is any desire to do that.
The text was updated successfully, but these errors were encountered: