Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data processed info into dbt run logs for all statement types #2530

Merged
merged 5 commits into from
Jun 23, 2020
Merged

Add data processed info into dbt run logs for all statement types #2530

merged 5 commits into from
Jun 23, 2020

Conversation

alepuccetti
Copy link
Contributor

@alepuccetti alepuccetti commented Jun 10, 2020

resolves #2526

Description

Changing the log output of BigQuery query CREATE_TABLE_AS_SELECT statement to include byte processed.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

Sorry but at the moment, I cannot install all the requirements to run the tests suite.
I am not sure what to write and where. @jtcohen6 can you offer an advice?

@alepuccetti alepuccetti changed the title plugins/bigquery: add processed bytes value for CREATE_TABLE_AS_SELECT Add processed bytes value for BigQuery queries CREATE_TABLE_AS_SELECT statements Jun 10, 2020
@jtcohen6
Copy link
Contributor

Sure thing! Have you taken a look at the contributing guide, specifically the section about testing? It looks like the failure in CircleCI is related to pep8 style. You can start by testing for this locally with a combination of virtualenv, tox, and flake8.

As far as the code itself: Did you also want to add bytes processed to the 'INSERT', 'DELETE', 'MERGE' query jobs? This would be relevant for incremental models during standard (non-full refresh) runs. Up to you.

@cla-bot cla-bot bot added the cla:yes label Jun 10, 2020
@alepuccetti
Copy link
Contributor Author

alepuccetti commented Jun 10, 2020

@jtcohen6 Thank you for the response.
I did not notice that the other statements types case was missing the bytes. I read the testing guide I understood I had to run all the test. I run the linter it should be fine now.
I could not run make test-unit because I have to install a lot of things and with my connection will take forever. Sorry for the inconvenient.

@alepuccetti
Copy link
Contributor Author

alepuccetti commented Jun 10, 2020

@jtcohen6 I just realised that when I wrote "I am not sure what to write and where." I did not specified the fundamental part that I was talking about the CHANGELOG.

@alepuccetti alepuccetti changed the title Add processed bytes value for BigQuery queries CREATE_TABLE_AS_SELECT statements Add data processed info into dbt run logs for all statement types Jun 10, 2020
@jtcohen6
Copy link
Contributor

This worked for me locally! I just kicked off the rest of the integration tests.

My only hesitation on this is purely cosmetic: It's a lot more CLI text than we're used to printing to info. I wonder if we should try to summarize the row count, e.g. 78m rows instead of 78873386 rows. The word "processed" also takes up a lot of horizontal space.

Screen Shot 2020-06-11 at 4 07 41 PM

@drewbanin Could you lend an aesthetic eye?

@alepuccetti
Copy link
Contributor Author

@ jtcohen6
I agree that 78m it easier to read that it would make follow the same idea of having the processed byte formatted and the world "processed" it seems a bit redundant, I just followed the format for SCRIPT. CREATE TABLE (78M | 19.8GB) seems clear to me. Don't know if it is better small or capitol m.

@drewbanin
Copy link
Contributor

This is groovy! I do agree - this is a lot of characters if we're working in an ~80 character budget. My vote would be for something like:

[CREATE TABLE (78m rows, 18.7 GB) in 60.1s]

I don't feel super strongly about that though - happy to discuss if anyone thinks differently.

This is really cool @alepuccetti - nice work so far!

@alepuccetti
Copy link
Contributor Author

@drewbanin this looks great to me.

[CREATE TABLE (78m rows, 18.7 GB) in 60.1s]

I can write a format_rows_number to do so.

However I have a couple of questions on the implementation.

  • Should we use [k,m,b] as units of measure?
  • I noticed that the current implementation of format_bytes return > 1024 TB instead of a more accurate number. I understand that running queries that use more than 1024TB it is an incredibly edge case but shouldn't we report a more accurate number (e.g. 3014TB). It would be more accurate, wouldn't it?

Alternative for rows number formatting (I am not a fan but could be an option):
Use a format like 1,3 * 10^12 and increase the exponent always by 3. So, for example, we could have 123.02 * 10^6.

Thoughts?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 12, 2020

I'm all for:

  • Using [k,m,b,t] to abbreviate row counts
  • Reporting the actual number of bytes process, even if it's > 1024 TB (that's a five thousand dollar query!)

Big fan of scientific notation in general, but agree it doesn't feel right here :)

I'm also all for you coding up a format_rows_number function and applying it across the board. I don't think scripts have an associated rowcount, but we should keep its bytes reporting consistent with the other query types.

@alepuccetti
Copy link
Contributor Author

@jtcohen6 Queries/Scripts > 1024TB are definitely an ultra-niche. But company might use flat-rate pricing instead of on demand. Maybe we could also add PB.

I will do these changes in the next days.

@alepuccetti
Copy link
Contributor Author

alepuccetti commented Jun 16, 2020

  • Add the format_rows_number with tests.
  • Updated format_bytes and it tests. Also, I took the liberty to add PB as unit.
  • And update the status message to use ,.

@jtcohen6, @drewbanin: I finally got the time to finish this.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! Let's reduce the print width a smidge further by cutting the word processed. I think it's fine to leave that word for script output, though, because:

  • it implies "total processed" / "processed overall"
  • there's no row count to show, so the widths end up about the same

I'm not sure why the py38 integration test failed on Postgres. Much more important is that the BigQuery integration tests passed.

core/dbt/utils.py Outdated Show resolved Hide resolved
@alepuccetti
Copy link
Contributor Author

alepuccetti commented Jun 18, 2020

@jtcohen6 I think that not having "processed" can be misleading. It can be interpreted as the size of the results.

I squashed the proposed changes for the format_rows_number because I had to update the unit tests.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Ok, I'm happy with how you've set this up, and I'm glad we have a future vision of how to make this more configurable for users (#2580).

In the meantime, let's get these tests running. Can you merge or rebase the changes from dev/marian-anderson? Doing so should kick off integration tests automatically once the unit tests are passing.

def format_rows_number(rows_number):
for unit in ['', 'k', 'm', 'b', 't']:
if abs(rows_number) < 1000.0:
return f"{rows_number:3.1f} {unit}".strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this one! I think this is the cause of the failing unit test:

Suggested change
return f"{rows_number:3.1f} {unit}".strip()
return f"{rows_number:3.1f}{unit}".strip()

@alepuccetti
Copy link
Contributor Author

This is greenest as it comes. Very excited for my first contribution.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alepuccetti Looks great! I appreciate your patience on this. I left one last comment about consolidating the changelog notes, once that's set this is good to merge.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
Co-authored-by: Jeremy Cohen <[email protected]>
@alepuccetti
Copy link
Contributor Author

@jtcohen6 Done ✅

@jtcohen6
Copy link
Contributor

@beckjake Postgres integration test failed with this error:

server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

Otherwise, this is good to merge from my point of view

@jtcohen6 jtcohen6 merged commit 76f9f23 into dbt-labs:dev/marian-anderson Jun 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add data processed info into dbt run logs for all statement types
3 participants