Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data processed info into dbt run logs for all statement types #2526

Closed
alepuccetti opened this issue Jun 10, 2020 · 16 comments · Fixed by #2530
Closed

Add data processed info into dbt run logs for all statement types #2526

alepuccetti opened this issue Jun 10, 2020 · 16 comments · Fixed by #2530
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!

Comments

@alepuccetti
Copy link
Contributor

alepuccetti commented Jun 10, 2020

Describe the feature

When using BigQuery dbt run logs should show the data scan value for create table statement and not just for script.

Describe alternatives you've considered

No alternative

Additional context

This specific improvement is about BigQuery. However, this might apply for other database too I do not have experience to give specific examples.

Who will this benefit?

Everyone using dbt to get easier access to usage.

@alepuccetti alepuccetti added enhancement New feature or request triage labels Jun 10, 2020
@jtcohen6
Copy link
Contributor

Very cool idea!

The code for printing out bytes billed is fairly straightforward:
https:/fishtown-analytics/dbt/blob/c9b34681b756ccd4b5c4f0a3bd8f12f48f23081d/plugins/bigquery/dbt/adapters/bigquery/connections.py#L250-L252

I think it would just be a matter of reporting the total_bytes_processed for other query_job statement types.

@alepuccetti Is this something you'd be interested in working on? I think this could be a great opportunity for a first contribution :)

@jtcohen6 jtcohen6 added bigquery good_first_issue Straightforward + self-contained changes, good for new contributors! and removed triage labels Jun 10, 2020
@alepuccetti
Copy link
Contributor Author

alepuccetti commented Jun 10, 2020

@jtcohen6 Yes, I would be interested. You can assigned to me if this is the procedure.

@jtcohen6
Copy link
Contributor

It's all yours! Happy to help if you want more direction or get stuck

@alepuccetti alepuccetti changed the title Add data scanned info into dbt run logs Add data processed info into dbt run logs for all statement types Jun 10, 2020
@drewbanin drewbanin added this to the Marian Anderson milestone Jun 12, 2020
@bodschut
Copy link
Contributor

bodschut commented Jun 22, 2020

Hi Guys,

While the bytes processed is an interesting KPI for those bigquery customers that are on an on-demand pricing, the average slot usage of a query is interesting for those customers on a slot commitment.

It is very similar to calculate from the query job object you get when launching a query: you have to calculate the duration of the query in milliseconds using query_job.ended - query_job.started and then divide query_job.slot_millis by that value.

Could we add this info in the same PR, or do we create a new issue for this?

@jtcohen6
Copy link
Contributor

@bodschut Thanks for the suggestion! We're packing a lot of info into limited horizontal space right now. Your point has crystallized for me that we should give the user more control.

Here's what makes sense to me: a set of additional configuration parameters in profiles.yml to control what dbt includes in its info output.

config:
  bytes_processed: true | false
  slot_duration: true | false

@drewbanin @alepuccetti What do you think here? I know this throws one more wrench into #2530, though I feel a lot better knowing that we'd be:

  • supporting both BQ pricing mechanisms
  • displaying a less overwhelming amount of information by default

@alepuccetti
Copy link
Contributor Author

@jtcohen6 Having the user configure it is a good idea. I second that. I am up for implementing the configuration parameters, might take a while, I am doing this in my spare time.

About the average slots usage. I had a discussion with google engineers on how to measure slots utilisation to estimate if it is worth it moving to flat-rate pricing and how to evaluate the performance implications. The answer was that there is no formula but only try your workload and look how it performs.
Because of this I am not not convinced about the metric as it is proposed if it actually serve a purpose of measuring the slots utilisation.
I can think of an example: if you have multiple queries in your model that can run in parallel is very difficult to understand what is happening because query running at the same time share the slots but the metrics as it is prosed will not account for this concurrency.

I am in favour having a slots metric but I think it should be discussed more of what it should be and how to interpret it in a separate issue. For instance, I think that even just slot_millis is a good metric.

@bodschut
Copy link
Contributor

bodschut commented Jun 22, 2020

Hi @alepuccetti

You are right to say that the slot usage is not the only parameter to consider a move to a slot commitment. But, it's an interesting indicator when you are on a slot commitment and need to optimize your slot reservations for certain projects. At that point, knowing which queries consume how many slots on average can be interesting to optimize or to decide on changing a certain slot reservation or even upgrading your commitment with another package of 500 slots.

You could indeed include the slot millis like that, without calculating the query duration, but the average slot usage as calculated above is the best indicator for the 'slot load' of the query. I don't know if the model duration that dbt logs by default is the actual query duration or whether it includes the setup time. If it is the actual query duration, you could indeed calculate the average slots used yourself.

@drewbanin
Copy link
Contributor

Huh, I think that this is a string templating issue, and I think Jinja could be a good tool to solve that kind of problem :)

I think we should keep the focus narrow for this particular PR, but in the future, I would like to:

  1. Provide a config option that accepts a templated string for the model status
  2. Provide a context that has variables in scope (like job, materialization, status, runtime, etc)

What do y'all think?

@alepuccetti
Copy link
Contributor Author

@drewbanin This is a great idea. I am not sure I will be able to implement something like that with my limited knowledge of dbt internals.

Huh, I think that this is a string templating issue, and I think Jinja could be a good tool to solve that kind of problem :)

I think we should keep the focus narrow for this particular PR, but in the future, I would like to:

  1. Provide a config option that accepts a templated string for the model status
  2. Provide a context that has variables in scope (like job, materialization, status, runtime, etc)

What do y'all think?

@drewbanin This is a great idea.

@alepuccetti
Copy link
Contributor Author

@bodschut I like the idea of having a slot usage metrics. I just think it needs more discussion. Because it is not straightforward like number of rows or byte processed.

@bodschut
Copy link
Contributor

Agreed, let's move it to another discussion. But with the suggestion of @drewbanin it could all be included: if you have a job object in your variable scope, you can output the slot usage using macro's for example.

@jtcohen6
Copy link
Contributor

I opened a new issue, let's keep the discussion going over there: #2580

This issue will (rightly) be closed when we merge #2530.

@meurant-naude
Copy link

Hi there,
Thanks for adding this.
I was wondering whether total_bytes_billed can be added as well?
I'm assuming that the data comes from INFORMATION_SCHEMA.JOBS_BY_PROJECT and total_bytes_billed is now available too.

Thanks again,
Meurant

@bodschut
Copy link
Contributor

@meurant-naude I opened a new issue (#2747) to add the bytes_processed to other dbt commands, maybe better to take this question there.

@alepuccetti
Copy link
Contributor Author

@meurant-naude I opened a new issue (#2747) to add the bytes_processed to other dbt commands, maybe better to take this question there.

Agree, I will answer there.

@meurant-naude
Copy link

Thanks all. Much appreciated

jtcohen6 pushed a commit to dbt-labs/dbt-bigquery that referenced this issue Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants