Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s jobsets] add startup policy. #2063

Merged
merged 1 commit into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion metaflow/plugins/kubernetes/kubernetes_decorator.py
Original file line number Diff line number Diff line change
Expand Up @@ -543,7 +543,11 @@ def _save_package_once(cls, flow_datastore, package):

# TODO: Unify this method with the multi-node setup in @batch
def _setup_multinode_environment():
# FIXME: what about MF_MASTER_PORT
# TODO [FIXME SOON]
# Even if Kubernetes may deploy control pods before worker pods, there is always a
# possibility that the worker pods may start before the control. In the case that this happens,
# the worker pods will not be able to resolve the control pod's IP address and this will cause
# the worker pods to fail. This function should account for this in the near future.
import socket

try:
Expand Down
8 changes: 7 additions & 1 deletion metaflow/plugins/kubernetes/kubernetes_jobsets.py
Original file line number Diff line number Diff line change
Expand Up @@ -866,7 +866,13 @@ def dump(self):
spec=dict(
replicatedJobs=[self.control.dump(), self.worker.dump()],
suspend=False,
startupPolicy=None,
startupPolicy=dict(
# We explicitly set an InOrder Startup policy so that
# we can ensure that the control pod starts before the worker pods.
# This is required so that when worker pods try to access the control's IP
# we are able to resolve the control's IP address.
startupPolicyOrder="InOrder"
),
successPolicy=None,
# The Failure Policy helps setting the number of retries for the jobset.
# but we don't rely on it and instead rely on either the local scheduler
Expand Down
Loading