Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running a specific model, dbt should not try to create the schemas of all other models #2681

Closed
1 of 5 tasks
FurcyPin opened this issue Aug 4, 2020 · 4 comments
Closed
1 of 5 tasks
Labels
bug Something isn't working wontfix Not a bug or out of scope for dbt-core

Comments

@FurcyPin
Copy link

FurcyPin commented Aug 4, 2020

Describe the bug

When running a specific model using dbt/run model -m model_1 the command fails because of permission
errors on other models.

This is super annoying because adding a new model with a configuration error in prod will make ALL the other models fail,
and this is not seen by the compile step.

Steps To Reproduce

Create two models model_1 and model_2
model_1 creates a table in project_1.dataset_1.table_1
model_2 creates a table in project_2.dataset_2.table_2

but where the configured connection does not have the rights on project_2.

then run dbt run -m model_1

you will get an error like this:

Traceback (most recent call last):
  File "/home/user/dbt/core/dbt/main.py", line 124, in main
    results, succeeded = handle_and_check(args)
  File "/home/user/dbt/core/dbt/main.py", line 202, in handle_and_check
    task, res = run_from_args(parsed)
  File "/home/user/dbt/core/dbt/main.py", line 255, in run_from_args
    results = task.run()
  File "/home/user/dbt/core/dbt/task/runnable.py", line 414, in run
    result = self.execute_with_hooks(selected_uids)
  File "/home/user/dbt/core/dbt/task/runnable.py", line 374, in execute_with_hooks
    self.before_run(adapter, selected_uids)
  File "/home/user/dbt/core/dbt/task/run.py", line 405, in before_run
    self.create_schemas(adapter, selected_uids)
  File "/home/user/dbt/core/dbt/task/runnable.py", line 518, in create_schemas
    create_future.result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/user/dbt/core/dbt/utils.py", line 469, in connected
    return func(*args, **kwargs)
  File "/home/user/dbt/core/dbt/task/runnable.py", line 480, in create_schema
    adapter.create_schema(relation)
  File "/home/user/test_project/venv/lib/python3.6/site-packages/dbt/adapters/bigquery/impl.py", line 298, in create_schema
    self.connections.create_dataset(database, schema)
  File "/home/user/test_project/venv/lib/python3.6/site-packages/dbt/adapters/bigquery/connections.py", line 355, in create_dataset
    self._retry_and_handle(msg='create dataset', conn=conn, fn=fn)
  File "/home/user/test_project/venv/lib/python3.6/site-packages/dbt/adapters/bigquery/connections.py", line 373, in _retry_and_handle
    deadline=None)
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/user/test_project/venv/lib/python3.6/site-packages/dbt/adapters/bigquery/connections.py", line 93, in exception_handler
    self.handle_error(e, message)
  File "/home/user/test_project/venv/lib/python3.6/site-packages/dbt/adapters/bigquery/connections.py", line 77, in handle_error
    raise DatabaseException(error_msg)
dbt.exceptions.DatabaseException: Database Error
  Access Denied: Project project_2: User does not have bigquery.datasets.create permission in project project_2.

Expected behavior

Running dbt run -m model_1 should not try to create schemas used by model_2

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.18.0-b1
   latest version: 0.17.2

Your version of dbt is ahead of the latest release!

Plugins:
  - redshift: 0.18.0b1
  - bigquery: 0.18.0b1
  - postgres: 0.18.0b1

The operating system you're using:
Linux Mint 19 (based on ubuntu bionic)

The output of python --version:
Python 3.6.9

Additional context

Add any other context about the problem here.

@FurcyPin FurcyPin added bug Something isn't working triage labels Aug 4, 2020
@jtcohen6 jtcohen6 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Aug 4, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 4, 2020

Thanks for the detailed writeup, @FurcyPin. I totally get why this is frustrating. While I don't think we're going to prioritize a code change to resolve it, I'll explain our rationale and suggest some other steps you could take to support your specific use case.

Ever since we introduced caching, we try to enable dbt to do as much database introspection and administrivia as possible at the very start of the run. That means grabbing metadata for all schemata/datasets (in all databases/projects) where it plans to materialize models, or creating those schemata/datasets if they do not exist already. That caching happens once at the start of the run—any run—no matter how many models are actually selected to run.

In addition, as practitioners, we hold the opinionated view that:

  • dbt should have the ability to create new datasets/schemas in which it expects to create resources
  • dbt should have full managerial control over the datasets/schemas in which it creates resources
  • dbt projects are coherent entities. Any time dbt is run, it should have the ability to interact with all project resources it knows about—models, seeds, tests, source freshness—based on how they're configured in the current environment.

The purpose of the --models flag is efficiency only; it does not tell dbt to pretend that unselected models don't exist; dbt still compiles the entire project, including the graph of all enabled nodes and the cache of corresponding database metadata. As such, defining models in dbt that it doesn't have permissions to create is an anti-pattern. Configuring those models in an env-aware way, while it requires more code, is a pattern that we're committing to supporting.

So, here are three things you can do today:

  1. There are dedicated macros to controlling env-based behavior for custom databases/projects and schemas/datasets. Check out generate_schema_name and generate_database_name.
  2. Without writing or overriding macros, you can set the database/project and schema/dataset properties of model_2 to dynamic Jinja statements that depend on the target or env vars (using v0.16.0 or later):
models:
  my_other_project_models:
    +project: "{{ 'project_2' if target.name == 'prod' else 'project_1' }}"
    +dataset: "{{ 'dataset_2' if target.name == 'prod' else 'project_1' }}"
  1. You can force dbt to ignore an entire subset of models by using the enabled flag. (They'll be missing from the DAG, so this won't work if you still want to build those models' dependencies.) As above, you can set the values of enabled via a dynamic Jinja that depends on the target or env vars (using v0.17.1 or newer):
models:
  my_other_project_models:
    +enabled: "{{ ('true' if target.name == 'prod' else 'false') | as_boolean }}"

I'm going to close this issue for now, because we don't have immediate plans to change this behavior. That may change someday, though, if we find we need to question the principles I laid out above.

@FurcyPin
Copy link
Author

FurcyPin commented Aug 6, 2020

Thank you @jtcohen6 for the detailed answer.

I do understand and agree with your philosophy of wanting dbt to have full managerial control over the datasets/schemas in which it creates resources.

If you don't mind I would like to provide some more details about the way we are using dbt right now, and perhaps you can help me figure out how to do things differently.

We used to have a single bigquery project used by our data-engineering team. All the tables within it were managed with dbt and it went well. Recently we migrated all the views of our data-analytics team to dbt. Historically the tables are located in another BigQuery project, because we prefer to separate access rights for the two teams.

Since most of the data-analyst's tables are build on top on our data-engineer's table, it made sense for us to keep everything in the same dbt model.

So we had a folder structure like this:

models/
├── data_a
│   └── model_2.sql
└── data_e
    └── model_1.sql

And we made a custom generate_database_name macro that made the models point to the data_a_project or data_e_project depending on their parent folder.

Our dbt models are all executed separately with dbt run -m model_name run by Apache Airflow. The reason for that is that we believe that dbt should not handle the scheduling, as if we ran all tables with a single dbt run command, we would lose several feature that Airflow provides to us, including:

  • Automatic retries one single command
  • Email alerting on error with the name of the failed model
  • The capacity to resume only the failed parts rather than re-running everything

This is a question of opinion, but I used to have a tool that was capable to run a whole dependency graph by itself
like dbt, but I came back from that and I now believe that scheduling features (DAG execution, retries, failure recovery) should be left to schedulers.

Anyway, the reason why I called the described behavior as "super annoying" was because after we added these new data_a models and deployed them in prod but (without scheduling them yet), it caused a production incident were all our pre-existing data_e models failed during the night because the new data_a models were pointing on a project that dbt didn't have rights on yet (because we were planning to test them properly in prod the next morning).
Of course we should not have made that mistake in the first place, but I would have preferred the system to be a little more robust.

Thanks again for your suggestions, and if you have more advice to share in the light of this extended context I would be happy to read them.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 9, 2020

Our dbt models are all executed separately with dbt run -m model_name run by Apache Airflow. The reason for that is that we believe that dbt should not handle the scheduling

This is an interesting perspective! I imagine there's quite a bit of work involved to update the Airflow DAG any time a dbt dependency is added or changed?

as if we ran all tables with a single dbt run command, we would lose several feature that Airflow provides to us, including:

  • Automatic retries one single command
  • Email alerting on error with the name of the failed model
  • The capacity to resume only the failed parts rather than re-running everything

These are things that we're thinking hard about, and we have some progress in the works:

This is a question of opinion, but I used to have a tool that was capable to run a whole dependency graph by itself
like dbt, but I came back from that and I now believe that scheduling features (DAG execution, retries, failure recovery) should be left to schedulers.

I hear you on this broader point: dbt will never be an orchestration tool on par with Airflow/Jenkins/Dagster/Luigi/etc, nor should it ever try to be. I feel that this is mostly true for inter-process dependencies and coordination, however, rather than intra-process dependencies. While Airflow will always know more about the massive DAG of all data processes at your org, it will never know as much about dbt's DAG as dbt does, at least natively.

I do think there's a compelling handoff point whereby dbt is able to give Airflow exactly the information it needs to report on and retry build failures. I'd be curious to hear your feedback about the constructs we're using to get there.

@FurcyPin
Copy link
Author

FurcyPin commented Sep 1, 2020

Our dbt models are all executed separately with dbt run -m model_name run by Apache Airflow. The reason for that is that we believe that dbt should not handle the scheduling

This is an interesting perspective! I imagine there's quite a bit of work involved to update the Airflow DAG any time a dbt dependency is added or changed?

Yes and no.
To give a little more context, we have nearly 1000 jobs that run every night (around 750 spark jobs and 250 dbt queries).
They all are interconnected: the dependency graph roughly has one giant connected component.

We could easily make a script that would generate a huge single Airflow DAG, but we don't.
Here are the pros and cons:

pros of an automatically generated DAG:

  • Less work to write and maintain
  • No risk of mistake, always up to date if a dependency changes
  • Optimal level of parallelism

cons:

  • Difficult to reason with for human beings
  • The dag displayed would probably be super ugly
  • We do have some dependency violations that where made willingly, an automatically generated DAG would not accept them (or we would have to write weird edge-case exceptions
  • In order to keep our data workflow more robust, some jobs may still run even when their upstream failed. Airflow is super useful for that.

This is why we prefer to write and maintain our Airflow DAGs manually, which does add some overhead on our development process but allow us to split that huge connected component into smaller DAGs that 'make sense' and are more readable (we often add DummyOperators as "flow control points" just to make the DAG look nicer).

In order to avoid any mistake between the dependencies declared in the DAGs and the actual dependencies between our jobs,
we wrote a small script that performs a validation check and report any incoherences (missing dependency, undagged job, etc.). Depending on the criticality of each validation rule, the script may throw a warning or an error.

Currently this only checks the Spark dependencies, but we plan to integrate with DBT to be able to check the BigQuery dependencies as well.

we have some progress in the works [...]

Indeed, storing the results of the last execution might be a way to better handle retries and recovery from errors.
As I said before, I tend to think that these kind of features belong to the scope of what a scheduler does.

I think that many people think that making a scheduler is easy, but in fact making a good scheduler is very task. This is demonstrated by the huge number of in-house schedulers projects: Yahoo made Oozie, LinkedIn made Azkaban, Facebook made Dataswarm, Criteo made Cuttle, Spotify made Luigi, AirBnB made Airflow...
Until Airflow and Luigi came up, nobody was satisfied with existing scheduler and many thought they could do better... and did not. Even Airflow is plenty perfectible.

Just to give an example, even if you add the retry feature to dbt, it will still be lacking many features, like:

  • Handling failures by running specific on-fail tasks, or to run a task despite failures upstream
  • Running workflows with non-sql tasks that dbt can't handle
  • Visualizing in what state your execution dag currently is (I assume you plan to add this in dbt Cloud if not already)
  • Monitoring the evolution of the duration and number of retries of your jobs
  • Sending alert emails etc.
  • Being able to do dirty hacks / custom stuff when nothing else works

The last point is one of the things that makes Airflow great. Being able to declare your DAG with python code
let you do all kind of custom things with it, that other scheduler won't let you do.

I do think there's a compelling handoff point whereby dbt is able to give Airflow exactly the information it needs to report on and retry build failures. I'd be curious to hear your feedback about the constructs we're using to get there.

Indeed, there is a middle ground where you could use airflow to execute dbt run commands that will execute more than one task. In a way this would be like a sub-dag entirely handled by dbt. This is actually the kind of things we did in my previous experience.

We used a tool called flamy that did pretty much what dbt did, except with Hive. We had our in-house Python scheduler, that were comparable to an "Airflow without a GUI". We had workflows with multiple Hive queries intertwined with Spark jobs. Flamy wasn't able to automatically manage Spark dependencies, so what we did was that our scheduler used flamy for all our Hive queries by running commands that looked like flamy run --from source1 [source2 ...] --to dest1 [dest2 ...]. The --from --to options were great to delimit a sub-graph and run everything within. When we added a new job in the middle it would automatically run it without us having to change our scheduling. One problem was that sometimes new jobs were added but were outside of this sub-dag definition, so we forgot to run them.

We also had the same issue of failure recovery: "how to restart only the failed tasks". We solved it differently: instead of storing the result of each Hive query somewhere, we looked at the table's timestamps and compared that with the upstream table's timestamps. If the data was not up to date, it meant it had to re-run. This worked well except for partitioned tables, where we tried to do some super fancy stuff, that worked but was probably an over-complicated way to solve the problem.

Anyway, these days I tend to prefer the approach described above: let Airflow run each task, take a little more time to write the dags manually (but at least they look nicer and you can do exactly what you want), and use automated checks to avoid any human mistake.

One last thing: I recently participated in a discussion with the Airflow community that might relate on what you are trying to do.

In short, Airflow provides an x-com feature that lets tasks talk to each other, but Airflow currently doesn't provide a feature that lets tasks store persisting information for their future retries. What this mean is that if you want to make a DbtOperator that can persist run_results.json and, in case of failure, read it on the next try, then this operator will have to persist it to some external storage (e.g. s3, gs, Azure blob storage, or perhaps some custom dbt Cloud storage), which is doable but makes it less 'ready to use' as a plugin.

There has been ample discussion between the Airflow maintainers whether they should add such feature or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

2 participants