Add data processed info into dbt run logs for all statement types #2530

alepuccetti · 2020-06-10T19:00:24Z

resolves #2526

Description

Changing the log output of BigQuery query CREATE_TABLE_AS_SELECT statement to include byte processed.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

Sorry but at the moment, I cannot install all the requirements to run the tests suite.
I am not sure what to write and where. @jtcohen6 can you offer an advice?

jtcohen6 · 2020-06-10T19:55:25Z

Sure thing! Have you taken a look at the contributing guide, specifically the section about testing? It looks like the failure in CircleCI is related to pep8 style. You can start by testing for this locally with a combination of virtualenv, tox, and flake8.

As far as the code itself: Did you also want to add bytes processed to the 'INSERT', 'DELETE', 'MERGE' query jobs? This would be relevant for incremental models during standard (non-full refresh) runs. Up to you.

alepuccetti · 2020-06-10T20:38:28Z

@jtcohen6 Thank you for the response.
I did not notice that the other statements types case was missing the bytes. I read the testing guide I understood I had to run all the test. I run the linter it should be fine now.
I could not run make test-unit because I have to install a lot of things and with my connection will take forever. Sorry for the inconvenient.

alepuccetti · 2020-06-10T20:43:29Z

@jtcohen6 I just realised that when I wrote "I am not sure what to write and where." I did not specified the fundamental part that I was talking about the CHANGELOG.

jtcohen6 · 2020-06-11T20:10:41Z

This worked for me locally! I just kicked off the rest of the integration tests.

My only hesitation on this is purely cosmetic: It's a lot more CLI text than we're used to printing to info. I wonder if we should try to summarize the row count, e.g. 78m rows instead of 78873386 rows. The word "processed" also takes up a lot of horizontal space.

@drewbanin Could you lend an aesthetic eye?

alepuccetti · 2020-06-12T08:56:46Z

@ jtcohen6
I agree that 78m it easier to read that it would make follow the same idea of having the processed byte formatted and the world "processed" it seems a bit redundant, I just followed the format for SCRIPT. CREATE TABLE (78M | 19.8GB) seems clear to me. Don't know if it is better small or capitol m.

drewbanin · 2020-06-12T13:06:23Z

This is groovy! I do agree - this is a lot of characters if we're working in an ~80 character budget. My vote would be for something like:

[CREATE TABLE (78m rows, 18.7 GB) in 60.1s]

I don't feel super strongly about that though - happy to discuss if anyone thinks differently.

This is really cool @alepuccetti - nice work so far!

alepuccetti · 2020-06-12T13:44:15Z

@drewbanin this looks great to me.

[CREATE TABLE (78m rows, 18.7 GB) in 60.1s]

I can write a format_rows_number to do so.

However I have a couple of questions on the implementation.

Should we use [k,m,b] as units of measure?
I noticed that the current implementation of format_bytes return > 1024 TB instead of a more accurate number. I understand that running queries that use more than 1024TB it is an incredibly edge case but shouldn't we report a more accurate number (e.g. 3014TB). It would be more accurate, wouldn't it?

Alternative for rows number formatting (I am not a fan but could be an option):
Use a format like 1,3 * 10^12 and increase the exponent always by 3. So, for example, we could have 123.02 * 10^6.

Thoughts?

jtcohen6 · 2020-06-12T18:15:36Z

I'm all for:

Using [k,m,b,t] to abbreviate row counts
Reporting the actual number of bytes process, even if it's > 1024 TB (that's a five thousand dollar query!)

Big fan of scientific notation in general, but agree it doesn't feel right here :)

I'm also all for you coding up a format_rows_number function and applying it across the board. I don't think scripts have an associated rowcount, but we should keep its bytes reporting consistent with the other query types.

alepuccetti · 2020-06-14T11:44:47Z

@jtcohen6 Queries/Scripts > 1024TB are definitely an ultra-niche. But company might use flat-rate pricing instead of on demand. Maybe we could also add PB.

I will do these changes in the next days.

alepuccetti · 2020-06-16T18:35:55Z

Add the format_rows_number with tests.
Updated format_bytes and it tests. Also, I took the liberty to add PB as unit.
And update the status message to use ,.

@jtcohen6, @drewbanin: I finally got the time to finish this.

jtcohen6

This is looking great! Let's reduce the print width a smidge further by cutting the word processed. I think it's fine to leave that word for script output, though, because:

it implies "total processed" / "processed overall"
there's no row count to show, so the widths end up about the same

I'm not sure why the py38 integration test failed on Postgres. Much more important is that the BigQuery integration tests passed.

core/dbt/utils.py

plugins/bigquery/dbt/adapters/bigquery/connections.py

alepuccetti · 2020-06-18T06:37:06Z

@jtcohen6 I think that not having "processed" can be misleading. It can be interpreted as the size of the results.

I squashed the proposed changes for the format_rows_number because I had to update the unit tests.

jtcohen6

Good point! Ok, I'm happy with how you've set this up, and I'm glad we have a future vision of how to make this more configurable for users (#2580).

In the meantime, let's get these tests running. Can you merge or rebase the changes from dev/marian-anderson? Doing so should kick off integration tests automatically once the unit tests are passing.

jtcohen6 · 2020-06-22T19:13:14Z

core/dbt/utils.py

+def format_rows_number(rows_number):
+ for unit in ['', 'k', 'm', 'b', 't']:
+ if abs(rows_number) < 1000.0:
+ return f"{rows_number:3.1f} {unit}".strip()


Missed this one! I think this is the cause of the failing unit test:

Suggested change

return f"{rows_number:3.1f} {unit}".strip()

return f"{rows_number:3.1f}{unit}".strip()

… value

Fixes #2526

alepuccetti · 2020-06-23T07:25:05Z

This is greenest as it comes. Very excited for my first contribution.

jtcohen6

@alepuccetti Looks great! I appreciate your patience on this. I left one last comment about consolidating the changelog notes, once that's set this is good to merge.

CHANGELOG.md

Co-authored-by: Jeremy Cohen <[email protected]>

alepuccetti · 2020-06-23T14:48:35Z

@jtcohen6 Done ✅

jtcohen6 · 2020-06-23T18:10:55Z

@beckjake Postgres integration test failed with this error:

server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

Otherwise, this is good to merge from my point of view

alepuccetti changed the title ~~plugins/bigquery: add processed bytes value for CREATE_TABLE_AS_SELECT~~ Add processed bytes value for BigQuery queries CREATE_TABLE_AS_SELECT statements Jun 10, 2020

cla-bot bot added the cla:yes label Jun 10, 2020

alepuccetti changed the title ~~Add processed bytes value for BigQuery queries CREATE_TABLE_AS_SELECT statements~~ Add data processed info into dbt run logs for all statement types Jun 10, 2020

jtcohen6 reviewed Jun 17, 2020

View reviewed changes

core/dbt/utils.py Outdated Show resolved Hide resolved

plugins/bigquery/dbt/adapters/bigquery/connections.py Show resolved Hide resolved

plugins/bigquery/dbt/adapters/bigquery/connections.py Show resolved Hide resolved

alepuccetti requested a review from jtcohen6 June 22, 2020 13:56

jtcohen6 mentioned this pull request Jun 22, 2020

Add data processed info into dbt run logs for all statement types #2526

Closed

jtcohen6 reviewed Jun 22, 2020

View reviewed changes

alepuccetti added 4 commits June 22, 2020 22:17

core/dbt/utils: add PB to format_bytes and always return the accurate…

55f9720

… value

core/dbt/utils: add format_rows_number

d2b2f2f

plugins/bigquery: log processed bytes value for all statement types

5eb9523

Fixes #2526

core/dbt/utils: remove extra empty line

272cad7

drewbanin mentioned this pull request Jun 22, 2020

FAQ: What does the number in the CLI output mean? dbt-labs/docs.getdbt.com#215

Closed

jtcohen6 approved these changes Jun 23, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

CHANGELOG.md Show resolved Hide resolved

Update CHANGELOG.md

1da4bef

Co-authored-by: Jeremy Cohen <[email protected]>

jtcohen6 merged commit 76f9f23 into dbt-labs:dev/marian-anderson Jun 23, 2020

dbeatty10 mentioned this pull request Jan 28, 2023

Use IEC standard abbreviations (GiB, TiB, etc) dbt-labs/dbt-bigquery#482

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data processed info into dbt run logs for all statement types #2530

Add data processed info into dbt run logs for all statement types #2530

alepuccetti commented Jun 10, 2020 •

edited

Loading

jtcohen6 commented Jun 10, 2020

alepuccetti commented Jun 10, 2020 •

edited

Loading

alepuccetti commented Jun 10, 2020 •

edited

Loading

jtcohen6 commented Jun 11, 2020

alepuccetti commented Jun 12, 2020

drewbanin commented Jun 12, 2020

alepuccetti commented Jun 12, 2020

jtcohen6 commented Jun 12, 2020 •

edited

Loading

alepuccetti commented Jun 14, 2020

alepuccetti commented Jun 16, 2020 •

edited

Loading

jtcohen6 left a comment

alepuccetti commented Jun 18, 2020 •

edited

Loading

jtcohen6 left a comment

jtcohen6 Jun 22, 2020

alepuccetti commented Jun 23, 2020

jtcohen6 left a comment

alepuccetti commented Jun 23, 2020

jtcohen6 commented Jun 23, 2020

	return f"{rows_number:3.1f} {unit}".strip()
	return f"{rows_number:3.1f}{unit}".strip()

Add data processed info into dbt run logs for all statement types #2530

Add data processed info into dbt run logs for all statement types #2530

Conversation

alepuccetti commented Jun 10, 2020 • edited Loading

Description

Checklist

jtcohen6 commented Jun 10, 2020

alepuccetti commented Jun 10, 2020 • edited Loading

alepuccetti commented Jun 10, 2020 • edited Loading

jtcohen6 commented Jun 11, 2020

alepuccetti commented Jun 12, 2020

drewbanin commented Jun 12, 2020

alepuccetti commented Jun 12, 2020

jtcohen6 commented Jun 12, 2020 • edited Loading

alepuccetti commented Jun 14, 2020

alepuccetti commented Jun 16, 2020 • edited Loading

jtcohen6 left a comment

Choose a reason for hiding this comment

alepuccetti commented Jun 18, 2020 • edited Loading

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Jun 22, 2020

Choose a reason for hiding this comment

alepuccetti commented Jun 23, 2020

jtcohen6 left a comment

Choose a reason for hiding this comment

alepuccetti commented Jun 23, 2020

jtcohen6 commented Jun 23, 2020

alepuccetti commented Jun 10, 2020 •

edited

Loading

alepuccetti commented Jun 10, 2020 •

edited

Loading

alepuccetti commented Jun 10, 2020 •

edited

Loading

jtcohen6 commented Jun 12, 2020 •

edited

Loading

alepuccetti commented Jun 16, 2020 •

edited

Loading

alepuccetti commented Jun 18, 2020 •

edited

Loading