Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add detailed tracking for start-of-invocation parse time #2823

Closed
jtcohen6 opened this issue Oct 8, 2020 · 3 comments · Fixed by #2893
Closed

Add detailed tracking for start-of-invocation parse time #2823

jtcohen6 opened this issue Oct 8, 2020 · 3 comments · Fixed by #2893
Assignees
Labels
enhancement New feature or request performance

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 8, 2020

Describe the feature

  • We know that the time it takes to parse especially big projects is a major pain point, and that this parse time is increasing linearly with project size
  • We plan to dive deep into parsing over the coming months and identify areas for improvement
  • As we roll out changes that promise to decrease parse time, we need the ability to measure empirical treatment effects for active projects in the wild

Proposal: Extend our existing opt-out anonymous usage tracking to present a detailed picture of project size, resource distribution, and start-of-invocation steps with associated execution times.

As far as the implementation, I'm taking some indirect inspiration from Snowplow's performance timing context. I'm envisioning a custom structured event that does two things:

  • Splits out start-of-invocation tasks into discrete components—read files from disk, create hologrammed objects, run metadata queries against database, etc—and calculates the time it takes to do each of them
  • Includes a dictionary representation of project attributes:
    • Found 1 model, 1 test, 1 snapshot, 1 analysis, 143 macros, 0 operations, 4 seed files, 0 sources
    • Total number of files

Describe alternatives you've considered

  • Use current tracking to accomplish a crummier version of this. We fire events on invocation start and for each model (if dbt run); we could take the difference from the first model run to approximate total start-of-invocation-stuff time.
  • Develop an internal suite of sample projects that represent most possible project permutations (1000s of models, 1000s of schema tests, 1000s of macros, 1000s of docs blocks, etc.). Test potential changes against that suite during CI and hope it translates to real-world projects.
    • I don't think this is an alternative as much as something we should do in addition. We'll still want to know the on-the-ground impact of our changes, no matter how promising they seem in development.

Who will this benefit?

  • Especially large organizations that will have large dbt projects from day 1
  • Mature projects that continue to grow, slowly but surely
@jtcohen6 jtcohen6 added enhancement New feature or request performance labels Oct 8, 2020
@jtcohen6 jtcohen6 added this to the Kiyoshi Kuromiya milestone Oct 8, 2020
@drewbanin
Copy link
Contributor

drewbanin commented Oct 8, 2020

@jtcohen6 I've been thinking about this as a breakdown across node type parsing. Maybe an example payload would be:

[
  {
    "type": "model",
    "parse_time": 10.2,
    "count_parsed": 1000
  },
  {
    "type": "source",
    "parse_time": 2.4,
    "count_parsed": 128
  }
]

I think it might be challenging to break down parse times across file loading / jinja interpolation / data encoding/decoding. Instead, we can use this dataset to understand empirical performance across node types, then drill into specific performance improvements in a testing/sample project using something like cProfile and snakeviz. That's just my 2 cents - you buy it?

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Oct 8, 2020

Totally. I imagined the difficulties here to be reversed, whereby it'd be reasonable to break down parse time by functional area of the codebase and quite challenging to break it down by specific node. I'm not exactly sure why I thought that, but I'm equally happy with the approach you outline.

@kwigley kwigley self-assigned this Oct 29, 2020
@kwigley
Copy link
Contributor

kwigley commented Nov 16, 2020

We settled on tracking higher level timing info that is already being recorded by #2883

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants