Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valay/refactor on eddie's PR #3

Draft
wants to merge 5 commits into
base: outerbounds-multicloud
Choose a base branch
from

Commits on Mar 4, 2024

  1. [deepspeed-deco] Refactoring code.

    - Unfinished
    - Ran black
    valayDave committed Mar 4, 2024
    Configuration menu
    Copy the full SHA
    6f1daa4 View commit details
    Browse the repository at this point in the history

Commits on Mar 6, 2024

  1. [deepspeed-deco] Refactoring code finished.

    - no notification system in place.
    - re-orged stuff into a module
    valayDave committed Mar 6, 2024
    Configuration menu
    Copy the full SHA
    3e3304f View commit details
    Browse the repository at this point in the history
  2. [deepspeed-deco] heartbeat for monitoring success/failure.

    - Failure of workers triggers failure of control.
    - Failure of control results in failure of workers.
    - Added more comments.
    - control tasks have heartbeat thread that write for a long enough period of time.
    valayDave committed Mar 6, 2024
    Configuration menu
    Copy the full SHA
    05d6283 View commit details
    Browse the repository at this point in the history
  3. [deepspeed-deco] known hosts written in setup_mpi instead of `current…

    ….deepspeed`
    
    - ensures separation of concern and also helps when debugging user code failures as there are distinct .
    valayDave committed Mar 6, 2024
    Configuration menu
    Copy the full SHA
    b0cb53c View commit details
    Browse the repository at this point in the history

Commits on Mar 7, 2024

  1. [deepspeed-deco] make it work without ssh restart and other changes

    - made the hello example work out of the box.
    - refactor on fixes.
    valayDave committed Mar 7, 2024
    Configuration menu
    Copy the full SHA
    ba74b72 View commit details
    Browse the repository at this point in the history