-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-12913] Enable configuration of query priority in ReadFromBigQuery #15536
Conversation
R: @pabloem |
118767c
to
07ce10f
Compare
Codecov Report
@@ Coverage Diff @@
## master #15536 +/- ##
==========================================
- Coverage 83.78% 83.77% -0.01%
==========================================
Files 444 444
Lines 60189 60273 +84
==========================================
+ Hits 50427 50493 +66
- Misses 9762 9780 +18
Continue to review full report at Codecov.
|
07ce10f
to
59b0fb9
Compare
@@ -1915,6 +1927,11 @@ class ReadFromBigQuery(PTransform): | |||
that dataset, and will remove it once it is not needed. Job needs access | |||
to create and delete tables within the given dataset. Dataset name | |||
should *not* start with the reserved prefix `beam_temp_dataset_`. | |||
query_priority (BigQueryQueryPriority): By default, this transform runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering can we make a batch priority by default in ReadFromBigQuery? What would be the use case for the user to make it interactive? I just want to make sure that the parameter that got exposed are really necessary for the end user and it confirms with style guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does make the priority BATCH by default.
If you're asking about whether or not we should expose a query_priority
parameter, I don't have strong opinions either way. My original commit (b66e4b1) makes it non-configurable, but after seeing that it's configurable in the Java BigQueryIO, I decided to make the Python side consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind, my comment. I was thinking initially that the user doesn't need to be aware of this parameter. There's no point executing the query if there's not enough resources. Thus, batch priority for dataflow job makes better sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for checking all of these - I was not sure about adding the extra parameter, bu since Java exposes it, I think it makes sense for it to be exposed by Python as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -1915,6 +1927,11 @@ class ReadFromBigQuery(PTransform): | |||
that dataset, and will remove it once it is not needed. Job needs access | |||
to create and delete tables within the given dataset. Dataset name | |||
should *not* start with the reserved prefix `beam_temp_dataset_`. | |||
query_priority (BigQueryQueryPriority): By default, this transform runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for checking all of these - I was not sure about adding the extra parameter, bu since Java exposes it, I think it makes sense for it to be exposed by Python as well.
It looks like this PR is causing python postcommit to fail: https://ci-beam.apache.org/job/beam_PostCommit_Python37/4309/ @chunyang Could you PTAL. |
@y1chi yes I just saw that, will put together a fix right now. |
Changes:
ReadFromBigQuery
submit queries with BATCH priority by default. This mirrors the default behavior in the Java BigQueryIO and the legacy native Dataflow IO.query_priority
parameter toReadFromBigQuery
to allow toggling between INTERACTIVE and BATCH priority. Add a correspondingBigQueryQueryPriority
object to hold the priority constants.Submitting BigQuery queries with BATCH priority allows queries to be started when idle resources are available and allows queries submitted from Beam to not count toward a project's concurrent rate limit.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
ValidatesRunner
compliance status (on master branch)Examples testing status on various runners
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.