Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

implement manual distributed sharding for SNLI reader #89

Merged
merged 7 commits into from
Jul 14, 2020

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Jul 8, 2020

This gives a huge speed-up to SNLI/MNLI dataset reading in distributed training.

@epwalsh epwalsh requested a review from dirkgr July 13, 2020 20:15
@@ -45,6 +46,7 @@ def __init__(
combine_input_fields: Optional[bool] = None,
**kwargs,
) -> None:
kwargs["manual_distributed_sharding"] = True
super().__init__(**kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more readable to call super().__init__(manual_distributed_sharding=True, **kwargs).

filtered_example_iter = (
example for example in example_iter if example["gold_label"] != "-"
)
for example in itertools.islice(filtered_example_iter, start_index, None, step_size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do the islice before the json.loads, so that it doesn't have to parse the JSON for lines that get discarded anyways. The problem is that then the filtering has to happen afterwards, which would mean different workers have different number of instances. This is not ideal, but not the end of the world.

I'm just leaving this here as a suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it that way initially but it didn't make a significant impact on speed.

@epwalsh epwalsh merged commit 4b2178b into master Jul 14, 2020
@epwalsh epwalsh deleted the snli-manual-sharding branch July 14, 2020 17:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants