Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weather-dl: Out-of-order license distribution error #98

Closed
alxmrs opened this issue Jan 28, 2022 · 1 comment
Closed

weather-dl: Out-of-order license distribution error #98

alxmrs opened this issue Jan 28, 2022 · 1 comment
Labels
bug Something isn't working P1 weather-dl

Comments

@alxmrs
Copy link
Collaborator

alxmrs commented Jan 28, 2022

In the weather downloader, we underutilize licenses due to the current way they are distributed. Worse, if we use near the max number of queued workers (like we do by default for the MARS downloads), pipelines will fail with a "MaxAPIRequests" error from ECMWF.

My leading hypothesis about what's causing this is that our license distribution scheme depends on a particular order of future requests. However, we've made recent changes to better work with the dataflow autoscaling algorithm to parallelize download requests, which allocates the fetch_data step in a random order.

How to reproduce

  1. Run weather-dl with multiple licenses (CDS is fine).
  2. Look up the request queue for the client (the docstrings for the client have an appropriate URL) for each license / account.

Outcome: There will be an uneven distribution of download requests. Some may have duplicates while others don't have any work dispatched. This leads to underutilization of licenses and in some cases brittle pipelines.
Expected: Each worker should at least have their request limit maxed out (licenses fully utilized). Requests, in general, should be evenly distributed. Pipelines should never hit the "Too many requests" error.

Workarounds

The strategy to get a robust yet high utilization of downloads is to split data requests into multiple configs and run weather-dl twice (or multiple times). It especially helps to leverage the -n flag to limit the number of data requests to the access limit (for CDS, the defaults are currently fine. For MARS, set -n 2).

Potential fix

A good strategy would be to restructure the pipeline such that:

  • License creation is separate from config partitioning (i.e. different beam.Create() steps).
  • The two streams (licenses and partitions) get joined together (like a zip in Python) after a reshuffle so licenses are evenly distributed.
@alxmrs alxmrs added bug Something isn't working P0 weather-dl labels Jan 28, 2022
@alxmrs alxmrs added P1 and removed P0 labels Mar 2, 2022
alxmrs added a commit that referenced this issue Mar 2, 2022
This is a useful short-term mitigation of #98.
alxmrs added a commit that referenced this issue Mar 2, 2022
* Lower default num-requests for MARS to make it more robust.

This is a useful short-term mitigation of #98.

* Fixed test.
alxmrs added a commit that referenced this issue Mar 3, 2022
In this new scheme, we make use of `beam.Partition` to split work by available license (via param subsections). This simplifies a lot of the pipeline logic.

E2E tests will validate if this, indeed, fixes #98.
alxmrs added a commit that referenced this issue Mar 15, 2022
In this change, I divide up the download requets to workers in a data-driven way. Namely, after fanning out the partitions, I group by both the license and number of requests. Then, I've restructured the fetch step to execute each group of configs in order.

This way, we can guarantee that all licenses are being utilized to their max capacity and within rate limits. This is a much better fix for #98, as it also obviates the needs to limit the max number of workers (this is actually autoscale friendly!)

Thanks to [email protected] for the idea and an example for how to implement this.
alxmrs added a commit that referenced this issue Mar 17, 2022
* New Data-Oriented Task distribution strategy.

In this change, I divide up the download requets to workers in a data-driven way. Namely, after fanning out the partitions, I group by both the license and number of requests. Then, I've restructured the fetch step to execute each group of configs in order.

This way, we can guarantee that all licenses are being utilized to their max capacity and within rate limits. This is a much better fix for #98, as it also obviates the needs to limit the max number of workers (this is actually autoscale friendly!)

Thanks to [email protected] for the idea and an example for how to implement this.

* Removed configure workers method.

Now that we have a data-oriented strategy for evenly distributing tasks, we don't need to fight autoscaling anymore. Thus, we can get rid of the code that sets the max number of workers.

* Incrementing weather-dl version.

* Better logging in the fetch step.

* Nit: Better counter key.

* Fix: assert + indentation error.

* Improved readability of `parse_target_name`.

* Parser tests are parameterized.

* A simpler, more parallel version of the pipeline.
@alxmrs
Copy link
Collaborator Author

alxmrs commented Mar 24, 2022

Closing, as this was fixed in #116! The fix will be in the next release.

@alxmrs alxmrs closed this as completed Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 weather-dl
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant