Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog Post: Introducing JobSet #45759

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

danielvegamyhre
Copy link
Member

We would like to publish a blog post introducing JobSet, a K8s native API for distributed ML training and HPC workloads.

cc @ahg-g @kannon92 I think we still need to align on one example and ideally make it more concrete and polished. We should also explain the user story above it.

@k8s-ci-robot k8s-ci-robot added area/blog Issues or PRs related to the Kubernetes Blog subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 2, 2024

## Why JobSet?

The Kubernetes community’s recent enhancements to the batch ecosystem on Kubernetes attracted ML engineers who have found it to be a natural fit for the requirements of running distributed training workloads. Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a single host are often distributed across tens of thousands of accelerator chips, which in turn may span thousands of hosts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some new lines. (I can only comment on this entire paragraph).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, I try to follow a newline per sentence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this paragraph is only 2 sentences. Do we really need/want each sentence to be it's own paragraph? In the Kueue blog post they use paragraphs: https://kubernetes.io/blog/2022/10/04/introducing-kueue/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be rendered the same. But if you have no new lines, people can only suggest the entire paragraph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Localization teams ask for Markdown where the source is wrapped at around 100 characters.

Please do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so column width is max of ~100 chars (except in long links)

Copy link

netlify bot commented Apr 2, 2024

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit d3d0821
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/66590787b28a48000881c1a9
😎 Deploy Preview https://deploy-preview-45759--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.


**Authors**: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)

In this article, we introduce [JobSet](https://jobset.sigs.k8s.io/), an open source API for representing distributed jobs. The goal of JobSet is to provide a stable API for running/building APIs with AI/ML and HPC use cases in mind.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggle if we should define HPC..

What do you mean by "running/building APIs"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think HPC is a known term, but I also don't understand the running/building APIs part.

Comment on lines 152 to 157
command:
- bash
- -xc
- |
sleep 10000
- name: workers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any tangible examples of this? I assume not if you're using bash here, but I find it odd this is abstract

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should make the example more concrete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples I know of require TPUs/GCP and I am not sure if we can do that for a general k8s blog post. So we decided to do simple options that don't require accelerators or lock you into a specific vendor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably come up with a GPU based example and simply use fake node labels like "cloud.provider.com/gpu-label" instead of a vendor specific one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can link to GKE related use cases as concrete examples?

If @haircommander point is taking literal, we should show concrete usecases also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about that @sftim? Should we try to keep the post self contained?

@kannon92
Copy link
Contributor

kannon92 commented Apr 2, 2024

/cc @haircommander

Trying to find an impartial reviewer ;)

@haircommander
Copy link
Contributor

I have one note but I found this informative while doing a good job of laying the groundwork needed.

LGTM (assuming the note is unaddressable)

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some partial feedback (not yet a full review)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer SVG:

  • easier to localize
  • scalable

Could you draw this as SVG instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced png diagram with svg version

@sftim
Copy link
Contributor

sftim commented Apr 2, 2024

(if this is not yet ready for review by the blog team, please change the title to start with [WIP])

@sftim
Copy link
Contributor

sftim commented Apr 2, 2024

/hold

pending assignment of publication date

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 2, 2024
@danielvegamyhre danielvegamyhre changed the title Blog Post: Introducing JobSet [WIP] Blog Post: Introducing JobSet Apr 2, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2024
@danielvegamyhre
Copy link
Member Author

(if this is not yet ready for review by the blog team, please change the title to start with [WIP])

Thanks Tim, added [WIP] to the title.


**Authors**: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)

In this article, we introduce [JobSet](https://jobset.sigs.k8s.io/), an open source API for representing distributed jobs. The goal of JobSet is to provide a stable API for running/building APIs with AI/ML and HPC use cases in mind.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think HPC is a known term, but I also don't understand the running/building APIs part.

@sftim
Copy link
Contributor

sftim commented Apr 3, 2024

I propose the 26th of April as publication date. Does that work?

@sftim
Copy link
Contributor

sftim commented May 3, 2024

Let's pick a new publication date. How about the 7th of May?

@sftim
Copy link
Contributor

sftim commented May 3, 2024

You should remove [WIP] from the PR title @danielvegamyhre if / when you think this is ready to be reviewed.

@ahg-g
Copy link
Member

ahg-g commented May 31, 2024

/lgtm

Thanks @danielvegamyhre !

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2024
@danielvegamyhre danielvegamyhre changed the title [WIP] Blog Post: Introducing JobSet Blog Post: Introducing JobSet May 31, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 31, 2024
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Publication date is now in the past

/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 1, 2024
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g June 1, 2024 20:20
@sftim
Copy link
Contributor

sftim commented Jun 23, 2024

LGTMs from #45759 (comment) and #45759 (comment) mean that all this needs is a new publication date.

Sorry about the wait; we have very few active blog reviewers.

I suggest aiming for the 17th of July; does that work?

accelerators, without resorting to scripting or Helm charts to generate many versions of the
same job but with different names.

![JobSet Architecture Diagram](jobset_diagram.svg "JobSet Architecture")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: using a figure shortcode, with a caption, is even better. You can also set the diagram-medium class; I recommend doing that.

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This article introduces an unregistered official annotation.

  • You / WG Batch should register it, especially if any in-project code already uses the annotation.
    (you do that by updating https://kubernetes.io/docs/reference/labels-annotations-taints/)
  • We can either register the annotation before publication, which is pretty straightforward, or we - blog editors - can add an editors' note that the annotation is not yet official.

I'm opposed to publicising unregistered annotations because if we do it once, people will keep asking for a similar exception.

TPU multislice training out-of-the-box with very little configuration required by the user.

```yaml
# Run a simple Jax workload on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean

Suggested change
# Run a simple Jax workload on
# Run a simple Jax workload on Google Cloud

?

name: multislice
annotations:
# Give each child Job exclusive usage of a TPU slice
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should register (and, ideally, deprecate) this annotation before we publicize it.
The page to register it on is https://kubernetes.io/docs/reference/labels-annotations-taints/


It needs registering because although it uses a deprecated pattern (“alpha” in the key), it's inside an official Kubernetes namespace for annotations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If WG Batch do deprecate this annotation and switch to something different, that's a nice but separate improvement.

We still need the original one registered if any released code inside https:/kubernetes/ uses it, or previously used it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we should register alpha annotations/labels. Sounds like drift waiting to happen.

I don't know if most sig projects register the labels/annotations on that website.

Copy link
Contributor

@sftim sftim Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking annotations as alpha or beta is deprecated; we should eventually just use the plain form (eg jobset.kubernetes.io/exclusive-topology). AIUI if a SIG wants to mark something as experimental, and for that not to be registered, it needs to be outside kubernetes.io or k8s.io.

We should register this one, or pick something different and blog about the updated annotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielvegamyhre WDYT? Could we have a post without this annotation for now? I know that changing these annotations could break users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g if you're wondering how much work is involved in registering the annotation, it's a 10 line PR to update https://kubernetes.io/docs/reference/labels-annotations-taints/

Please consider that if writing >= 10 lines of objection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with registering it if that addresses your concern, I was just asking if we are allowed to register annotations that are not part of upstream k8s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! The way I've done it, anything with an in-project origin can be registered (even if it shouldn't have been used; as soon as code ships that might use the annotation, we have to document it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registered means “make a record that it's already in use“; it's not like an API review - deliberately!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap the markdown at 100 or so (source) characters per line.

Once you've done that, I recommend squashing commits.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from sftim. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sftim
Copy link
Contributor

sftim commented Jun 24, 2024

Content LGTMs noted, but publication date does need to change.

/lgtm cancel

@sftim
Copy link
Contributor

sftim commented Aug 6, 2024

There's a long disagreement over whether it's OK to publish this, given it would tacitly bless using annotations the wrong way (the JobSet controller looks for annotations and labels that look Kubernetes official, but aren't yet on our list of known annotations).

I'm personally loth to publish this without settling that disagreement. Making an exception once has me thinking it could set a precedent we end up having to live with.

@danielvegamyhre
Copy link
Member Author

There's a long disagreement over whether it's OK to publish this, given it would tacitly bless using annotations the wrong way (the JobSet controller looks for annotations and labels that look Kubernetes official, but aren't yet on our list of known annotations).

I'm personally loth to publish this without settling that disagreement. Making an exception once has me thinking it could set a precedent we end up having to live with.

I'm fine with registering the annotations before publishing

@danielvegamyhre
Copy link
Member Author

Update: we are close to merging the PR registering JobSet labels/annotations (#47383). Once that is done we can move forward with publishing the blog post here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/blog Issues or PRs related to the Kubernetes Blog subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. language/en Issues or PRs related to English language sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Requires update
Development

Successfully merging this pull request may close these issues.

7 participants