implement manual distributed sharding for SNLI reader #89

epwalsh · 2020-07-08T17:24:21Z

This gives a huge speed-up to SNLI/MNLI dataset reading in distributed training.

dirkgr · 2020-07-14T09:08:55Z

allennlp_models/pair_classification/dataset_readers/snli.py

@@ -45,6 +46,7 @@ def __init__(
 combine_input_fields: Optional[bool] = None,
 **kwargs,
 ) -> None:
+ kwargs["manual_distributed_sharding"] = True
 super().__init__(**kwargs)


I think it would be more readable to call super().__init__(manual_distributed_sharding=True, **kwargs).

dirkgr · 2020-07-14T09:12:52Z

allennlp_models/pair_classification/dataset_readers/snli.py

+ filtered_example_iter = (
+ example for example in example_iter if example["gold_label"] != "-"
+ )
+ for example in itertools.islice(filtered_example_iter, start_index, None, step_size):


You could do the islice before the json.loads, so that it doesn't have to parse the JSON for lines that get discarded anyways. The problem is that then the filtering has to happen afterwards, which would mean different workers have different number of instances. This is not ideal, but not the end of the world.

I'm just leaving this here as a suggestion.

I had it that way initially but it didn't make a significant impact on speed.

epwalsh added 4 commits July 8, 2020 10:24

implement manual distributed sharding for SNLI reader

c9117a2

filter before islice

4bb69a5

oops, fix

0700e5f

Merge branch 'master' into snli-manual-sharding

213b657

epwalsh requested a review from dirkgr July 13, 2020 20:15

dirkgr approved these changes Jul 14, 2020

View reviewed changes

epwalsh added 3 commits July 14, 2020 09:05

Update snli.py

18683fb

Merge branch 'master' into snli-manual-sharding

a30fe19

update CHANGELOG

d2f950a

epwalsh merged commit 4b2178b into master Jul 14, 2020

epwalsh deleted the snli-manual-sharding branch July 14, 2020 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement manual distributed sharding for SNLI reader #89

implement manual distributed sharding for SNLI reader #89

epwalsh commented Jul 8, 2020 •

edited

Loading

dirkgr Jul 14, 2020

dirkgr Jul 14, 2020

epwalsh Jul 14, 2020

implement manual distributed sharding for SNLI reader #89

implement manual distributed sharding for SNLI reader #89

Conversation

epwalsh commented Jul 8, 2020 • edited Loading

dirkgr Jul 14, 2020

Choose a reason for hiding this comment

dirkgr Jul 14, 2020

Choose a reason for hiding this comment

epwalsh Jul 14, 2020

Choose a reason for hiding this comment

epwalsh commented Jul 8, 2020 •

edited

Loading