-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
weather-dl
: Out-of-order license distribution error
#98
Labels
Comments
alxmrs
added a commit
that referenced
this issue
Mar 2, 2022
This is a useful short-term mitigation of #98.
alxmrs
added a commit
that referenced
this issue
Mar 2, 2022
* Lower default num-requests for MARS to make it more robust. This is a useful short-term mitigation of #98. * Fixed test.
alxmrs
added a commit
that referenced
this issue
Mar 3, 2022
In this new scheme, we make use of `beam.Partition` to split work by available license (via param subsections). This simplifies a lot of the pipeline logic. E2E tests will validate if this, indeed, fixes #98.
alxmrs
added a commit
that referenced
this issue
Mar 15, 2022
In this change, I divide up the download requets to workers in a data-driven way. Namely, after fanning out the partitions, I group by both the license and number of requests. Then, I've restructured the fetch step to execute each group of configs in order. This way, we can guarantee that all licenses are being utilized to their max capacity and within rate limits. This is a much better fix for #98, as it also obviates the needs to limit the max number of workers (this is actually autoscale friendly!) Thanks to [email protected] for the idea and an example for how to implement this.
alxmrs
added a commit
that referenced
this issue
Mar 17, 2022
* New Data-Oriented Task distribution strategy. In this change, I divide up the download requets to workers in a data-driven way. Namely, after fanning out the partitions, I group by both the license and number of requests. Then, I've restructured the fetch step to execute each group of configs in order. This way, we can guarantee that all licenses are being utilized to their max capacity and within rate limits. This is a much better fix for #98, as it also obviates the needs to limit the max number of workers (this is actually autoscale friendly!) Thanks to [email protected] for the idea and an example for how to implement this. * Removed configure workers method. Now that we have a data-oriented strategy for evenly distributing tasks, we don't need to fight autoscaling anymore. Thus, we can get rid of the code that sets the max number of workers. * Incrementing weather-dl version. * Better logging in the fetch step. * Nit: Better counter key. * Fix: assert + indentation error. * Improved readability of `parse_target_name`. * Parser tests are parameterized. * A simpler, more parallel version of the pipeline.
Closing, as this was fixed in #116! The fix will be in the next release. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In the weather downloader, we underutilize licenses due to the current way they are distributed. Worse, if we use near the max number of queued workers (like we do by default for the MARS downloads), pipelines will fail with a "MaxAPIRequests" error from ECMWF.
My leading hypothesis about what's causing this is that our license distribution scheme depends on a particular order of future requests. However, we've made recent changes to better work with the dataflow autoscaling algorithm to parallelize download requests, which allocates the
fetch_data
step in a random order.How to reproduce
weather-dl
with multiple licenses (CDS is fine).Outcome: There will be an uneven distribution of download requests. Some may have duplicates while others don't have any work dispatched. This leads to underutilization of licenses and in some cases brittle pipelines.
Expected: Each worker should at least have their request limit maxed out (licenses fully utilized). Requests, in general, should be evenly distributed. Pipelines should never hit the "Too many requests" error.
Workarounds
The strategy to get a robust yet high utilization of downloads is to split data requests into multiple configs and run
weather-dl
twice (or multiple times). It especially helps to leverage the-n
flag to limit the number of data requests to the access limit (for CDS, the defaults are currently fine. For MARS, set-n 2
).Potential fix
A good strategy would be to restructure the pipeline such that:
beam.Create()
steps).zip
in Python) after a reshuffle so licenses are evenly distributed.The text was updated successfully, but these errors were encountered: