Add data processed info into dbt run logs for all statement types #2526

alepuccetti · 2020-06-10T15:47:42Z

Describe the feature

When using BigQuery dbt run logs should show the data scan value for create table statement and not just for script.

Describe alternatives you've considered

No alternative

Additional context

This specific improvement is about BigQuery. However, this might apply for other database too I do not have experience to give specific examples.

Who will this benefit?

Everyone using dbt to get easier access to usage.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-06-10T16:03:05Z

Very cool idea!

The code for printing out bytes billed is fairly straightforward:
https:/fishtown-analytics/dbt/blob/c9b34681b756ccd4b5c4f0a3bd8f12f48f23081d/plugins/bigquery/dbt/adapters/bigquery/connections.py#L250-L252

I think it would just be a matter of reporting the total_bytes_processed for other query_job statement types.

@alepuccetti Is this something you'd be interested in working on? I think this could be a great opportunity for a first contribution :)

alepuccetti · 2020-06-10T16:06:28Z

@jtcohen6 Yes, I would be interested. You can assigned to me if this is the procedure.

jtcohen6 · 2020-06-10T17:12:47Z

It's all yours! Happy to help if you want more direction or get stuck

bodschut · 2020-06-22T14:12:22Z

Hi Guys,

While the bytes processed is an interesting KPI for those bigquery customers that are on an on-demand pricing, the average slot usage of a query is interesting for those customers on a slot commitment.

It is very similar to calculate from the query job object you get when launching a query: you have to calculate the duration of the query in milliseconds using query_job.ended - query_job.started and then divide query_job.slot_millis by that value.

Could we add this info in the same PR, or do we create a new issue for this?

jtcohen6 · 2020-06-22T15:08:38Z

@bodschut Thanks for the suggestion! We're packing a lot of info into limited horizontal space right now. Your point has crystallized for me that we should give the user more control.

Here's what makes sense to me: a set of additional configuration parameters in profiles.yml to control what dbt includes in its info output.

config:
  bytes_processed: true | false
  slot_duration: true | false

@drewbanin @alepuccetti What do you think here? I know this throws one more wrench into #2530, though I feel a lot better knowing that we'd be:

supporting both BQ pricing mechanisms
displaying a less overwhelming amount of information by default

alepuccetti · 2020-06-22T15:51:32Z

@jtcohen6 Having the user configure it is a good idea. I second that. I am up for implementing the configuration parameters, might take a while, I am doing this in my spare time.

About the average slots usage. I had a discussion with google engineers on how to measure slots utilisation to estimate if it is worth it moving to flat-rate pricing and how to evaluate the performance implications. The answer was that there is no formula but only try your workload and look how it performs.
Because of this I am not not convinced about the metric as it is proposed if it actually serve a purpose of measuring the slots utilisation.
I can think of an example: if you have multiple queries in your model that can run in parallel is very difficult to understand what is happening because query running at the same time share the slots but the metrics as it is prosed will not account for this concurrency.

I am in favour having a slots metric but I think it should be discussed more of what it should be and how to interpret it in a separate issue. For instance, I think that even just slot_millis is a good metric.

bodschut · 2020-06-22T16:07:45Z

Hi @alepuccetti

You are right to say that the slot usage is not the only parameter to consider a move to a slot commitment. But, it's an interesting indicator when you are on a slot commitment and need to optimize your slot reservations for certain projects. At that point, knowing which queries consume how many slots on average can be interesting to optimize or to decide on changing a certain slot reservation or even upgrading your commitment with another package of 500 slots.

You could indeed include the slot millis like that, without calculating the query duration, but the average slot usage as calculated above is the best indicator for the 'slot load' of the query. I don't know if the model duration that dbt logs by default is the actual query duration or whether it includes the setup time. If it is the actual query duration, you could indeed calculate the average slots used yourself.

drewbanin · 2020-06-22T16:12:34Z

Huh, I think that this is a string templating issue, and I think Jinja could be a good tool to solve that kind of problem :)

I think we should keep the focus narrow for this particular PR, but in the future, I would like to:

Provide a config option that accepts a templated string for the model status
Provide a context that has variables in scope (like job, materialization, status, runtime, etc)

What do y'all think?

alepuccetti · 2020-06-22T16:16:18Z

@drewbanin This is a great idea. I am not sure I will be able to implement something like that with my limited knowledge of dbt internals.

Huh, I think that this is a string templating issue, and I think Jinja could be a good tool to solve that kind of problem :)

I think we should keep the focus narrow for this particular PR, but in the future, I would like to:

Provide a config option that accepts a templated string for the model status

Provide a context that has variables in scope (like job, materialization, status, runtime, etc)

What do y'all think?

@drewbanin This is a great idea.

alepuccetti · 2020-06-22T16:17:32Z

@bodschut I like the idea of having a slot usage metrics. I just think it needs more discussion. Because it is not straightforward like number of rows or byte processed.

bodschut · 2020-06-22T16:18:41Z

Agreed, let's move it to another discussion. But with the suggestion of @drewbanin it could all be included: if you have a job object in your variable scope, you can output the slot usage using macro's for example.

jtcohen6 · 2020-06-22T19:27:38Z

I opened a new issue, let's keep the discussion going over there: #2580

This issue will (rightly) be closed when we merge #2530.

meurant-naude · 2020-09-11T07:18:50Z

Hi there,
Thanks for adding this.
I was wondering whether total_bytes_billed can be added as well?
I'm assuming that the data comes from INFORMATION_SCHEMA.JOBS_BY_PROJECT and total_bytes_billed is now available too.

Thanks again,
Meurant

bodschut · 2020-09-11T07:27:59Z

@meurant-naude I opened a new issue (#2747) to add the bytes_processed to other dbt commands, maybe better to take this question there.

alepuccetti · 2020-09-11T07:35:39Z

@meurant-naude I opened a new issue (#2747) to add the bytes_processed to other dbt commands, maybe better to take this question there.

Agree, I will answer there.

meurant-naude · 2020-09-13T21:25:15Z

Thanks all. Much appreciated

Fixes dbt-labs/dbt-core#2526

alepuccetti added enhancement New feature or request triage labels Jun 10, 2020

jtcohen6 added bigquery good_first_issue Straightforward + self-contained changes, good for new contributors! and removed triage labels Jun 10, 2020

alepuccetti mentioned this issue Jun 10, 2020

Add data processed info into dbt run logs for all statement types #2530

Merged

4 tasks

alepuccetti changed the title ~~Add data scanned info into dbt run logs~~ Add data processed info into dbt run logs for all statement types Jun 10, 2020

drewbanin added this to the Marian Anderson milestone Jun 12, 2020

jtcohen6 mentioned this issue Jun 22, 2020

Configurable node status in log output #2580

Closed

jtcohen6 closed this as completed in #2530 Jun 23, 2020

bodschut mentioned this issue Sep 10, 2020

BigQuery: bytes processed should be logged for all dbt commands that perform queries #2747

Closed

jtcohen6 pushed a commit to dbt-labs/dbt-bigquery that referenced this issue Sep 29, 2021

plugins/bigquery: log processed bytes value for all statement types

23d16c1

Fixes dbt-labs/dbt-core#2526

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data processed info into dbt run logs for all statement types #2526

Add data processed info into dbt run logs for all statement types #2526

alepuccetti commented Jun 10, 2020 •

edited

Loading

jtcohen6 commented Jun 10, 2020

alepuccetti commented Jun 10, 2020 •

edited

Loading

jtcohen6 commented Jun 10, 2020

bodschut commented Jun 22, 2020 •

edited

Loading

jtcohen6 commented Jun 22, 2020

alepuccetti commented Jun 22, 2020

bodschut commented Jun 22, 2020 •

edited

Loading

drewbanin commented Jun 22, 2020

alepuccetti commented Jun 22, 2020

alepuccetti commented Jun 22, 2020

bodschut commented Jun 22, 2020

jtcohen6 commented Jun 22, 2020

meurant-naude commented Sep 11, 2020

bodschut commented Sep 11, 2020

alepuccetti commented Sep 11, 2020

meurant-naude commented Sep 13, 2020

Add data processed info into dbt run logs for all statement types #2526

Add data processed info into dbt run logs for all statement types #2526

Comments

alepuccetti commented Jun 10, 2020 • edited Loading

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

jtcohen6 commented Jun 10, 2020

alepuccetti commented Jun 10, 2020 • edited Loading

jtcohen6 commented Jun 10, 2020

bodschut commented Jun 22, 2020 • edited Loading

jtcohen6 commented Jun 22, 2020

alepuccetti commented Jun 22, 2020

bodschut commented Jun 22, 2020 • edited Loading

drewbanin commented Jun 22, 2020

alepuccetti commented Jun 22, 2020

alepuccetti commented Jun 22, 2020

bodschut commented Jun 22, 2020

jtcohen6 commented Jun 22, 2020

meurant-naude commented Sep 11, 2020

bodschut commented Sep 11, 2020

alepuccetti commented Sep 11, 2020

meurant-naude commented Sep 13, 2020

alepuccetti commented Jun 10, 2020 •

edited

Loading

alepuccetti commented Jun 10, 2020 •

edited

Loading

bodschut commented Jun 22, 2020 •

edited

Loading

bodschut commented Jun 22, 2020 •

edited

Loading