`weather-dl`: Out-of-order license distribution error #98

alxmrs · 2022-01-28T22:03:04Z

In the weather downloader, we underutilize licenses due to the current way they are distributed. Worse, if we use near the max number of queued workers (like we do by default for the MARS downloads), pipelines will fail with a "MaxAPIRequests" error from ECMWF.

My leading hypothesis about what's causing this is that our license distribution scheme depends on a particular order of future requests. However, we've made recent changes to better work with the dataflow autoscaling algorithm to parallelize download requests, which allocates the fetch_data step in a random order.

How to reproduce

Run weather-dl with multiple licenses (CDS is fine).
Look up the request queue for the client (the docstrings for the client have an appropriate URL) for each license / account.

Outcome: There will be an uneven distribution of download requests. Some may have duplicates while others don't have any work dispatched. This leads to underutilization of licenses and in some cases brittle pipelines.
Expected: Each worker should at least have their request limit maxed out (licenses fully utilized). Requests, in general, should be evenly distributed. Pipelines should never hit the "Too many requests" error.

Workarounds

The strategy to get a robust yet high utilization of downloads is to split data requests into multiple configs and run weather-dl twice (or multiple times). It especially helps to leverage the -n flag to limit the number of data requests to the access limit (for CDS, the defaults are currently fine. For MARS, set -n 2).

Potential fix

A good strategy would be to restructure the pipeline such that:

License creation is separate from config partitioning (i.e. different beam.Create() steps).
The two streams (licenses and partitions) get joined together (like a zip in Python) after a reshuffle so licenses are evenly distributed.

The text was updated successfully, but these errors were encountered:

This is a useful short-term mitigation of #98.

* Lower default num-requests for MARS to make it more robust. This is a useful short-term mitigation of #98. * Fixed test.

In this new scheme, we make use of `beam.Partition` to split work by available license (via param subsections). This simplifies a lot of the pipeline logic. E2E tests will validate if this, indeed, fixes #98.

In this change, I divide up the download requets to workers in a data-driven way. Namely, after fanning out the partitions, I group by both the license and number of requests. Then, I've restructured the fetch step to execute each group of configs in order. This way, we can guarantee that all licenses are being utilized to their max capacity and within rate limits. This is a much better fix for #98, as it also obviates the needs to limit the max number of workers (this is actually autoscale friendly!) Thanks to [email protected] for the idea and an example for how to implement this.

* New Data-Oriented Task distribution strategy. In this change, I divide up the download requets to workers in a data-driven way. Namely, after fanning out the partitions, I group by both the license and number of requests. Then, I've restructured the fetch step to execute each group of configs in order. This way, we can guarantee that all licenses are being utilized to their max capacity and within rate limits. This is a much better fix for #98, as it also obviates the needs to limit the max number of workers (this is actually autoscale friendly!) Thanks to [email protected] for the idea and an example for how to implement this. * Removed configure workers method. Now that we have a data-oriented strategy for evenly distributing tasks, we don't need to fight autoscaling anymore. Thus, we can get rid of the code that sets the max number of workers. * Incrementing weather-dl version. * Better logging in the fetch step. * Nit: Better counter key. * Fix: assert + indentation error. * Improved readability of `parse_target_name`. * Parser tests are parameterized. * A simpler, more parallel version of the pipeline.

alxmrs · 2022-03-24T05:23:58Z

Closing, as this was fixed in #116! The fix will be in the next release.

alxmrs added bug Something isn't working P0 weather-dl labels Jan 28, 2022

alxmrs added P1 and removed P0 labels Mar 2, 2022

alxmrs added a commit that referenced this issue Mar 2, 2022

Lower default num-requests for MARS to make it more robust.

1b33cf5

This is a useful short-term mitigation of #98.

alxmrs mentioned this issue Mar 2, 2022

Lower default num-requests for MARS to make it more robust. #113

Merged

alxmrs added a commit that referenced this issue Mar 2, 2022

Lower default num-requests for MARS to make it more robust. (#113)

777f3d0

* Lower default num-requests for MARS to make it more robust. This is a useful short-term mitigation of #98. * Fixed test.

alxmrs mentioned this issue Mar 3, 2022

New partitioning strategy for weather-dl #115

Closed

alxmrs mentioned this issue Mar 15, 2022

New data-oriented task distribution strategy. #116

Merged

alxmrs closed this as completed Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`weather-dl`: Out-of-order license distribution error #98

`weather-dl`: Out-of-order license distribution error #98

alxmrs commented Jan 28, 2022

alxmrs commented Mar 24, 2022

weather-dl: Out-of-order license distribution error #98

weather-dl: Out-of-order license distribution error #98

Comments

alxmrs commented Jan 28, 2022

How to reproduce

Workarounds

Potential fix

alxmrs commented Mar 24, 2022

`weather-dl`: Out-of-order license distribution error #98

`weather-dl`: Out-of-order license distribution error #98