Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valay/refactor on eddie's PR #3

Draft
wants to merge 5 commits into
base: outerbounds-multicloud
Choose a base branch
from

Conversation

valayDave
Copy link
Member

No description provided.

- Unfinished
- Ran black
- no notification system in place.
- re-orged stuff into a module
- Failure of workers triggers failure of control.
- Failure of control results in failure of workers.
- Added more comments.
- control tasks have heartbeat thread that write for a long enough period of time.
….deepspeed`

- ensures separation of concern and also helps when debugging user code failures as there are distinct .
self.environment = environment
self.flow_datastore = flow_datastore

self.is_batch = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is is_batch and is_k8s used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not. Will remove.


for deco in decos:
if deco.name in ["resources", "kubernetes", "batch"]:
compute_deco_attrs = compute_resource_attributes(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function needed here? we are planning on removing this from metaflow OSS soon.

DEEPSPEED_SUFFIX = "mf.deepspeed_datastore"


HOSTFILE_IP_KEY = "hostfile_ips"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are available in constants.py ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot. Can move.

TaskFailedException,
)

DEEPSPEED_SUFFIX = "mf.deepspeed_datastore"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit - just mf.deepspeed perhaps

class DeepspeedExecutor:
"""
Instances of the DeepspeedExecutor class are responsible for launching the Deepspeed command. There is one per Metaflow @step annotated with @deepspeed.
DeepspeedExecutor consumes a list of hosts aka nodes aka k8s pods aka metaflow task containers, and information about how many processes to run on each.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be batch jobs too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The executor is called within the user code. It represents current.deepspeed

- made the hello example work out of the box.
- refactor on fixes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants