Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2315] [Feature] Ignore duplicate edges while building the subgraph for the job queue #7191

Closed
3 tasks done
MatthieuBlais opened this issue Mar 19, 2023 · 12 comments · Fixed by #7192
Closed
3 tasks done
Assignees
Labels
enhancement New feature or request performance

Comments

@MatthieuBlais
Copy link
Contributor

MatthieuBlais commented Mar 19, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Context: I'm investigating an issue with the command dbt build that became very slow in our project (1,300+ models and 20,000+ tests). It impacts user productivity (waiting time) and time-sensitive scheduled jobs.

I suspect the function Graph.get_subset_graph can become slow when running dbt build in a project with wide tables and many tests.

Let's consider a simple example:

Source A -----> Model B --------> Model C
     |             |                |
    \/             \/               \/ 
  Test A         Test B           Test C

The function loops through all the graph nodes and extract the upstream & downstream nodes. It creates the product of these nodes and adds the edges to the subgraph.

for node in self:
         if node not in include_nodes:
                source_nodes = [x for x, _ in new_graph.in_edges(node)]
                target_nodes = [x for _, x in new_graph.out_edges(node)]

                new_edges = product(source_nodes, target_nodes)
                non_cyclic_new_edges = [
                    (source, target) for source, target in new_edges if source != target
                ]  # removes cyclic refs

                new_graph.add_edges_from(non_cyclic_new_edges)
                new_graph.remove_node(node)

During dbt build, when evaluating the node Test B (and looking at the test nodes only), source_nodes = [Test A] and target_nodes = [Test C]. So, the edges to add to the subgraph are [(Test A, Test C)].
-> Is Test A considered a source nodes due to the add_test_edges function in the build task?

Now, let's consider we have 30 tests in Source A and 500 tests in Model C --> We get 15000 new edges to add to the graph for node Test B.
And now let's consider we also have 100 tests in Model B. For all of them, we have the same source nodes and target nodes, so we have the same 15000 edges to add 100 times to the new graph (1,500,000 edges). The high number of edges to add makes the function new_graph.add_edges_from slow.

As the edges are 100 times the same, I think we could prevent duplicate work by ignoring edges that already exist in the new graph.

new_edges = [
     (source, target) for source, target in new_edges if source != target and not new_graph.has_edge(source, target)
]  # removes cyclic refs and edges already existing in the new graph

What do you think?

Note that the issue doesn't seem to happen if we run dbt run or dbt test (because source_nodes is empty in these cases).

Describe alternatives you've considered

I don't observe the same behavior when running dbt run or dbt test. So, alternatively, we can use these 2 commands sequentially instead of dbt build, but it's not exactly the same

Who will this benefit?

dbt build ... is a standard CLI command many DBT users execute.

Are you interested in contributing this feature?

Yes

Anything else?

No response

@MatthieuBlais MatthieuBlais added enhancement New feature or request triage labels Mar 19, 2023
@github-actions github-actions bot changed the title [Feature] Ignore duplicate edges while building the subgraph for the job queue [CT-2315] [Feature] Ignore duplicate edges while building the subgraph for the job queue Mar 19, 2023
@ttusing
Copy link
Contributor

ttusing commented Mar 19, 2023

Awesome! I am experiencing this same issue and have been fiddling with how to address it.

Similar concerns to #6073

I think your evaluation is correct that this is a special case for when there are a lot of tests, especially on wide models.

It also seems especially bad when you are selecting a small subset of your DAG, since it has to run this relatively expensive function to change the edge associations. For me, with ~300 models and ~4k tests, dbt build spins up in seconds, but dbt build --select model is very slow since it has to remove ~4200 nodes.

One efficiency idea I had (and am trying to implement but a bit stuck on setting up dev environment): Sort the nodes by number of degrees and first remove any non-selected nodes that have degree 0 or 1 before doing the edge replacement. We can trivially remove any nodes with 0 edges, and any node with 1 edge can also be trivially removed with running the edge replacement because it does not have any pairs of connecting nodes to join on removal. Iterating over this before doing the current method allows us to "trim" the graph down efficiently, assuming that sorting the graph by degree is fast. I can imagine removing a source model then removing all of its tests this way is much faster than randomly trimming the tests out in the middle, reconnecting the graph, then getting back to the source eventually.

Snippet:

        # remove nodes that have 0 or 1 degree and are not in include_nodes then repeat
        while True:
            nodes_to_remove = [
                node
                for node in self
                if (node not in include_nodes and new_graph.degree(node) <= 1)
            ]
            if not nodes_to_remove:
                break
            new_graph.remove_nodes_from(nodes_to_remove)

Similarly, I wonder if this issue is still persistent, which might be doubling the number of edges needlessly: #6073 (comment)

Thanks for the work, keeping an eye on this and will try it out your PR if I can get my dev setup working :).

@ttusing
Copy link
Contributor

ttusing commented Mar 20, 2023

Added #7195 and got my dev environment up/PR, excited for multiple possible efficiency gains on this!

@MatthieuBlais
Copy link
Contributor Author

Thanks for linking the issues @ttusing , I had not seen them.

I think the idea of trimming the graph is the way to go as it is scalable. Preventing duplicate edges alone doesn't prevent the runtime for dbt build to slowly increase over time as the number of nodes increase. I don't know enough about the source code to evaluate the impact of your change though, so I leave it to the maintainers.

I've tested your changes though and it looks promising!

dbt build runtime in my project:

  • Current: ~10min
  • With this PR only: ~3min
  • With this PR and your PR: ~1:45min
  • With your PR only: ~1:45min

So I think if your change doesn't have any side effect, we could keep yours only!

@ttusing
Copy link
Contributor

ttusing commented Mar 20, 2023

Thanks for linking the issues @ttusing , I had not seen them.

I think the idea of trimming the graph is the way to go as it is scalable. Preventing duplicate edges alone doesn't prevent the runtime for dbt build to slowly increase over time as the number of nodes increase. I don't know enough about the source code to evaluate the impact of your change though, so I leave it to the maintainers.

I've tested your changes though and it looks promising!

dbt build runtime in my project:

  • Current: ~10min
  • With this PR only: ~3min
  • With this PR and your PR: ~1:45min
  • With your PR only: ~1:45min

So I think if your change doesn't have any side effect, we could keep yours only!

I think your changes can still improve runtime for some selectors that still require adding edges after removing nodes. For instance, if you have two disconnected subgraphs that are connected in the main DAG, or a model that is re-joined multiple times downstream, this could shave off a few seconds I'd bet.

Although to be fair I am still unsure why the subgraph needs to calculate these distant ancestors in the first place 😅 . Seems like the type of selector comes up uncommon enough from my usage patterns that I'm not too concerned in picking it apart.

@ttusing
Copy link
Contributor

ttusing commented Mar 20, 2023

@MatthieuBlais I added an adjustment to my PR to sort the DAG by node degree before searching and it went down to 10 seconds build time (down from 26 in last commit)! Hopefully this is also helpful for you.

#7194

@MatthieuBlais
Copy link
Contributor Author

I think your changes can still improve runtime for some selectors that still require adding edges after removing nodes. For instance, if you have two disconnected subgraphs that are connected in the main DAG, or a model that is re-joined multiple times downstream, this could shave off a few seconds I'd bet.

Yes you're right, after trimming the graph, the model selector matters. I can see a few additional seconds being shaved off. With your change, now the waiting time (excluding partial parsing time) is down to a few seconds only, that looks awesome!

@jtcohen6
Copy link
Contributor

@ttusing @MatthieuBlais Awesome conversation! Would it make sense to consolidate this issue with #7195, as well as the two associated PRs (#7192 + #7194)? Or would it still make sense to consider & review them independently?

@jtcohen6 jtcohen6 removed the triage label Mar 22, 2023
@iknox-fa iknox-fa self-assigned this Mar 22, 2023
@ttusing
Copy link
Contributor

ttusing commented Mar 22, 2023

@ttusing @MatthieuBlais Awesome conversation! Would it make sense to consolidate this issue with #7195, as well as the two associated PRs (#7192 + #7194)? Or would it still make sense to consider & review them independently?

Could merge the issues! Looks like #7192 is almost ready to merge to main? Would be awesome to get both of these PRs in to speed up everyone's workflow 🚀

@peterallenwebb
Copy link
Contributor

@ttusing @MatthieuBlais Thanks so much for your high-quality analysis and coding on this. Our subject area expert still needs to weigh in on the PRs, but they look good to me. We've been aware that this code was bottlenecking some use cases, but to my knowledge the Core team hasn't had access to real-world dbt projects that exhibited the issue in a way that has enabled us to optimize it. With that in mind, if you are willing and able to share your projects with us, it would be enormously helpful as we monitor and improve performance in real-world scenarios. I can arrange a private channel for that if it's something you can do.

@ttusing
Copy link
Contributor

ttusing commented Mar 24, 2023

@peterallenwebb Yeah, sounds tricky. Think our legal function would want to look at this.

We use DBT cloud, is that potentially a path?

It might also be helpful to establish a pool of testers to run their own benchmarks and report back on a variety of projects, connectors, machines, envs, etc. Does such a thing already exist?

@boxysean
Copy link
Contributor

@ttusing, in support of @peterallenwebb's request, I have a little script here that takes in your manifest.json file and outputs a dbt models folder with the same DAG shape, but all models are simply SELECT 1. This project could then be run using (say) dbt-duckdb, for profiling purposes.

Would you be willing to share the output of the script?

@ttusing
Copy link
Contributor

ttusing commented Mar 24, 2023

I think so, let me run it by our team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants