[CT-3361] Improve Docs Parsing Performance #9037

peterallenwebb · 2023-11-08T16:55:53Z

We've received a complaint that dbt-core's parsing performance is surprisingly slow for large docs files. On an M1 Mac, files of around 500K can take over a minute to parse, and appears to increase super-linearly. The critically slow step is the call of extract_toplevel_blocks() on the file contents. The extraction of top-level jinja blocks is could likely be made much faster, but this is extremely critical code and we need to preserve existing behavior.

This does not appear to be a regression, but current performance is embarrassingly bad.

To generate a file which reproduces the performance problem, repeat the following snippet a few thousand times in a text file with the .md (markdown) extension, and add it to a dbt project, or call extract_toplevel_blocks() on it directly.

{% docs table_events %}

This table contains clickstream events from the marketing website.

The events in this table are recorded by Snowplow and piped into the warehouse on an hourly basis. The following pages of the marketing site are tracked:
 - /
 - /about
 - /team
 - /contact-us

{% enddocs %}{% docs table_events %}

This table contains clickstream events from the marketing website.

The events in this table are recorded by Snowplow and piped into the warehouse on an hourly basis. The following pages of the marketing site are tracked:
 - /
 - /about
 - /team
 - /contact-us

{% enddocs %}

Impact on other teams

None

Needs backport?

Unsure

The text was updated successfully, but these errors were encountered:

fredriv · 2024-03-21T13:53:33Z

Is there any progress on this issue? Our dbt docs are about 1M and full project parse (dbt parse --no-partial-parse) takes about 2-3 minutes on M1 Mac.

larssnek · 2024-06-06T16:57:08Z

@aranke here is the issue mentioned during the dbt meetup today on slow documentation parsing. There is also a closed PR that proposed a fix to this.

Hope you will be able to prioritize this 🙏🤩

fredriv · 2024-09-10T13:10:55Z

Here is a flame graph of doing a full parse of our dbt project (~2300 models). Our documentation markdown file is just shy of 1MB.

As you can see, extract_toplevel_blocks() takes about 75% of the time of dbt parse:

If we empty out our Markdown docs file and remove all doc references from our config files, the dbt parse runs about 4x faster.

fredriv · 2024-09-10T13:37:39Z

Have replicated the changes in #9045 in a new PR for dbt-common: dbt-labs/dbt-common#189

This change reduces dbt parse for our dbt project from 2m20s to 41s on my M1 Mac.

peterallenwebb · 2024-10-15T22:08:32Z

@fredriv Thank you for keeping us focused on this issue. Because I wanted to make some further tweaks to the implementation, I opened a separate PR and it will soon be merged. I shouted you out in the PR notes.

dbt-labs/dbt-common#205

peterallenwebb added enhancement New feature or request triage performance labels Nov 8, 2023

github-actions bot changed the title ~~Improve Docs Parsing Performance~~ [CT-3361] Improve Docs Parsing Performance Nov 8, 2023

dbeatty10 removed the triage label Nov 8, 2023

peterallenwebb mentioned this issue Nov 9, 2023

Make extract_toplevel_blocks() Faster #9045

Closed

5 tasks

graciegoheen assigned peterallenwebb Nov 20, 2023

fredriv mentioned this issue Sep 10, 2024

Make extract_top_level_blocks() faster dbt-labs/dbt-common#189

Closed

5 tasks

peterallenwebb mentioned this issue Oct 15, 2024

Accelerate block tag iteration dbt-labs/dbt-common#205

Merged

5 tasks

peterallenwebb closed this as completed Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-3361] Improve Docs Parsing Performance #9037

[CT-3361] Improve Docs Parsing Performance #9037

peterallenwebb commented Nov 8, 2023 •

edited

Loading

fredriv commented Mar 21, 2024

larssnek commented Jun 6, 2024

fredriv commented Sep 10, 2024

fredriv commented Sep 10, 2024

peterallenwebb commented Oct 15, 2024 •

edited

Loading

[CT-3361] Improve Docs Parsing Performance #9037

[CT-3361] Improve Docs Parsing Performance #9037

Comments

peterallenwebb commented Nov 8, 2023 • edited Loading

Impact on other teams

Needs backport?

fredriv commented Mar 21, 2024

larssnek commented Jun 6, 2024

fredriv commented Sep 10, 2024

fredriv commented Sep 10, 2024

peterallenwebb commented Oct 15, 2024 • edited Loading

peterallenwebb commented Nov 8, 2023 •

edited

Loading

peterallenwebb commented Oct 15, 2024 •

edited

Loading