New data-oriented task distribution strategy. #116

alxmrs · 2022-03-15T23:32:14Z

In this change, I divide up the download requets to workers in a data-driven way. Namely, after fanning out the partitions, I group by both the license and number of requests. Then, I've restructured the fetch step to execute each group of configs in order.

This way, we can guarantee that all licenses are being utilized to their max capacity and within rate limits. This is a much better fix for #98, as it also obviates the needs to limit the max number of workers (this is actually autoscale friendly!)

Thanks to [email protected] for the idea and an example for how to implement this.

In this change, I divide up the download requets to workers in a data-driven way. Namely, after fanning out the partitions, I group by both the license and number of requests. Then, I've restructured the fetch step to execute each group of configs in order. This way, we can guarantee that all licenses are being utilized to their max capacity and within rate limits. This is a much better fix for #98, as it also obviates the needs to limit the max number of workers (this is actually autoscale friendly!) Thanks to [email protected] for the idea and an example for how to implement this.

Now that we have a data-oriented strategy for evenly distributing tasks, we don't need to fight autoscaling anymore. Thus, we can get rid of the code that sets the max number of workers.

alxmrs · 2022-03-16T17:15:39Z

In an end-to-end test on dataflow, I can confirm that eventually – even for smaller downloads – this strategy will autoscale up such that all licenses / requests are in use. :)

fredzyda

This generally looks good, although there are some sections that seem a bit harder to read than is ideal. I commented on them. Let me know if you need clarification on any of this.

weather_dl/download_pipeline/fetcher_test.py

weather_dl/download_pipeline/parsers.py

weather_dl/download_pipeline/parsers_test.py

fredzyda · 2022-03-16T18:58:27Z

weather_dl/download_pipeline/pipeline.py

+ if isinstance(params, dict)] or [('default', {})]
+
+
+def prepare_partitions(config: Config, store: t.Optional[Store] = None) -> t.Iterator[Partition]:


This seems like a super complicated way to solve this problem. Is there a way to simplify this, perhaps by making it a bit less general?

I can start thinking about a simpler way to solve this...

fredzyda · 2022-03-16T19:01:37Z

weather_dl/download_pipeline/pipeline.py

+ params_loop = itertools.cycle(get_subsections(config))
+
+ partition_configs = filter(
+ lambda it: new_downloads_only(it, store),


Why filter here? If I'm reading this correctly, you're actually hitting the storage here to decide on what to filter. Why not do that in its own DoFn so that it can run in parallel? This could be many orders of magnitude slower than the rest of this method. It seems like just generating all possible paths is what you'd want here with a separate filter step that can do whatever test is required to see if you want to actually download later.

A little context: This filter is here because there was a bug we found earlier. If we skipped some downloads in a beam.Filter step, there would be a problem where subsections (licenses) wouldn't be evenly distributed to each non-skipped download.

It occurs to me: I could probably perform the cycle step as a separate Map operation that occurs after a Filter, and keep these steps really simple.

Yes, that's what I was going to propose. I think separately filtering then mapping for the cycle would be most simple and possibly most efficient.

fredzyda · 2022-03-16T19:04:38Z

weather_dl/download_pipeline/pipeline.py

 )

+ yield from ((*name_and_params, config) for name_and_params, config in zip(params_loop, partition_configs))


This is a subtle enough statement to warrant a comment about what exactly it is doing. I also would accept just splitting this into two lines, a for loop and then a yield statement.

weather_dl/download_pipeline/pipeline.py

fredzyda

Thanks for the changes.

alxmrs requested review from uhager and removed request for uhager March 15, 2022 23:32

alxmrs added the weather-dl label Mar 15, 2022

alxmrs added 3 commits March 15, 2022 16:33

Removed configure workers method.

ef3d9e0

Now that we have a data-oriented strategy for evenly distributing tasks, we don't need to fight autoscaling anymore. Thus, we can get rid of the code that sets the max number of workers.

Incrementing weather-dl version.

8fd63d4

Better logging in the fetch step.

b45c0b6

alxmrs mentioned this pull request Mar 16, 2022

New partitioning strategy for weather-dl #115

Closed

alxmrs marked this pull request as ready for review March 16, 2022 00:11

alxmrs requested review from pramodg and uhager March 16, 2022 00:11

alxmrs changed the title ~~New Data-Oriented Task distribution strategy.~~ New data-Ooriented task distribution strategy. Mar 16, 2022

Nit: Better counter key.

2ff2d1f

alxmrs changed the title ~~New data-Ooriented task distribution strategy.~~ New data-oriented task distribution strategy. Mar 16, 2022

alxmrs requested a review from fredzyda March 16, 2022 17:14

fredzyda reviewed Mar 16, 2022

View reviewed changes

alxmrs added 4 commits March 16, 2022 15:45

Fix: assert + indentation error.

adfa4cd

Improved readability of parse_target_name.

3f54cd8

Parser tests are parameterized.

3a36241

A simpler, more parallel version of the pipeline.

2f6beef

fredzyda approved these changes Mar 17, 2022

View reviewed changes

alxmrs merged commit 8c7946a into main Mar 17, 2022

alxmrs mentioned this pull request Mar 24, 2022

weather-dl: Out-of-order license distribution error #98

Closed

alxmrs deleted the dl-groupby branch April 1, 2022 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New data-oriented task distribution strategy. #116

New data-oriented task distribution strategy. #116

alxmrs commented Mar 15, 2022

alxmrs commented Mar 16, 2022

fredzyda left a comment

fredzyda Mar 16, 2022

alxmrs Mar 16, 2022

fredzyda Mar 16, 2022

alxmrs Mar 16, 2022

fredzyda Mar 17, 2022

fredzyda Mar 16, 2022

fredzyda left a comment

		if isinstance(params, dict)] or [('default', {})]


		def prepare_partitions(config: Config, store: t.Optional[Store] = None) -> t.Iterator[Partition]:

		)

		yield from ((*name_and_params, config) for name_and_params, config in zip(params_loop, partition_configs))

New data-oriented task distribution strategy. #116

New data-oriented task distribution strategy. #116

Conversation

alxmrs commented Mar 15, 2022

alxmrs commented Mar 16, 2022

fredzyda left a comment

Choose a reason for hiding this comment

fredzyda Mar 16, 2022

Choose a reason for hiding this comment

alxmrs Mar 16, 2022

Choose a reason for hiding this comment

fredzyda Mar 16, 2022

Choose a reason for hiding this comment

alxmrs Mar 16, 2022

Choose a reason for hiding this comment

fredzyda Mar 17, 2022

Choose a reason for hiding this comment

fredzyda Mar 16, 2022

Choose a reason for hiding this comment

fredzyda left a comment

Choose a reason for hiding this comment