Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3361] Improve Docs Parsing Performance #9037

Closed
peterallenwebb opened this issue Nov 8, 2023 · 5 comments · Fixed by dbt-labs/dbt-common#205
Closed

[CT-3361] Improve Docs Parsing Performance #9037

peterallenwebb opened this issue Nov 8, 2023 · 5 comments · Fixed by dbt-labs/dbt-common#205
Assignees
Labels
enhancement New feature or request performance

Comments

@peterallenwebb
Copy link
Contributor

peterallenwebb commented Nov 8, 2023

We've received a complaint that dbt-core's parsing performance is surprisingly slow for large docs files. On an M1 Mac, files of around 500K can take over a minute to parse, and appears to increase super-linearly. The critically slow step is the call of extract_toplevel_blocks() on the file contents. The extraction of top-level jinja blocks is could likely be made much faster, but this is extremely critical code and we need to preserve existing behavior.

This does not appear to be a regression, but current performance is embarrassingly bad.

To generate a file which reproduces the performance problem, repeat the following snippet a few thousand times in a text file with the .md (markdown) extension, and add it to a dbt project, or call extract_toplevel_blocks() on it directly.

{% docs table_events %}

This table contains clickstream events from the marketing website.

The events in this table are recorded by Snowplow and piped into the warehouse on an hourly basis. The following pages of the marketing site are tracked:
 - /
 - /about
 - /team
 - /contact-us

{% enddocs %}{% docs table_events %}

This table contains clickstream events from the marketing website.

The events in this table are recorded by Snowplow and piped into the warehouse on an hourly basis. The following pages of the marketing site are tracked:
 - /
 - /about
 - /team
 - /contact-us

{% enddocs %}

Impact on other teams

None

Needs backport?

Unsure

@github-actions github-actions bot changed the title Improve Docs Parsing Performance [CT-3361] Improve Docs Parsing Performance Nov 8, 2023
@dbeatty10 dbeatty10 removed the triage label Nov 8, 2023
@fredriv
Copy link

fredriv commented Mar 21, 2024

Is there any progress on this issue? Our dbt docs are about 1M and full project parse (dbt parse --no-partial-parse) takes about 2-3 minutes on M1 Mac.

@larssnek
Copy link

larssnek commented Jun 6, 2024

@aranke here is the issue mentioned during the dbt meetup today on slow documentation parsing. There is also a closed PR that proposed a fix to this.

Hope you will be able to prioritize this 🙏🤩

@fredriv
Copy link

fredriv commented Sep 10, 2024

Here is a flame graph of doing a full parse of our dbt project (~2300 models). Our documentation markdown file is just shy of 1MB.

As you can see, extract_toplevel_blocks() takes about 75% of the time of dbt parse:

dbt-full-parse-flamegraph

If we empty out our Markdown docs file and remove all doc references from our config files, the dbt parse runs about 4x faster.

@fredriv
Copy link

fredriv commented Sep 10, 2024

Have replicated the changes in #9045 in a new PR for dbt-common: dbt-labs/dbt-common#189

This change reduces dbt parse for our dbt project from 2m20s to 41s on my M1 Mac.

@peterallenwebb
Copy link
Contributor Author

peterallenwebb commented Oct 15, 2024

@fredriv Thank you for keeping us focused on this issue. Because I wanted to make some further tweaks to the implementation, I opened a separate PR and it will soon be merged. I shouted you out in the PR notes.

dbt-labs/dbt-common#205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
4 participants