-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blog Post: Introducing JobSet #45759
base: main
Are you sure you want to change the base?
Conversation
|
||
## Why JobSet? | ||
|
||
The Kubernetes community’s recent enhancements to the batch ecosystem on Kubernetes attracted ML engineers who have found it to be a natural fit for the requirements of running distributed training workloads. Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a single host are often distributed across tens of thousands of accelerator chips, which in turn may span thousands of hosts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some new lines. (I can only comment on this entire paragraph).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this, I try to follow a newline per sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this paragraph is only 2 sentences. Do we really need/want each sentence to be it's own paragraph? In the Kueue blog post they use paragraphs: https://kubernetes.io/blog/2022/10/04/introducing-kueue/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be rendered the same. But if you have no new lines, people can only suggest the entire paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Localization teams ask for Markdown where the source is wrapped at around 100 characters.
Please do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated so column width is max of ~100 chars (except in long links)
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Outdated
Show resolved
Hide resolved
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify site configuration. |
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Show resolved
Hide resolved
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Outdated
Show resolved
Hide resolved
|
||
**Authors**: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat) | ||
|
||
In this article, we introduce [JobSet](https://jobset.sigs.k8s.io/), an open source API for representing distributed jobs. The goal of JobSet is to provide a stable API for running/building APIs with AI/ML and HPC use cases in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I struggle if we should define HPC..
What do you mean by "running/building APIs"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think HPC is a known term, but I also don't understand the running/building APIs part.
…2-introducing-jobset.md Co-authored-by: Kevin Hannon <[email protected]>
command: | ||
- bash | ||
- -xc | ||
- | | ||
sleep 10000 | ||
- name: workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any tangible examples of this? I assume not if you're using bash here, but I find it odd this is abstract
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we should make the example more concrete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples I know of require TPUs/GCP and I am not sure if we can do that for a general k8s blog post. So we decided to do simple options that don't require accelerators or lock you into a specific vendor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably come up with a GPU based example and simply use fake node labels like "cloud.provider.com/gpu-label" instead of a vendor specific one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can link to GKE related use cases as concrete examples?
If @haircommander point is taking literal, we should show concrete usecases also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about that @sftim? Should we try to keep the post self contained?
/cc @haircommander Trying to find an impartial reviewer ;) |
I have one note but I found this informative while doing a good job of laying the groundwork needed. LGTM (assuming the note is unaddressable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some partial feedback (not yet a full review)
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Outdated
Show resolved
Hide resolved
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Outdated
Show resolved
Hide resolved
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer SVG:
- easier to localize
- scalable
Could you draw this as SVG instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced png diagram with svg version
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Show resolved
Hide resolved
(if this is not yet ready for review by the blog team, please change the title to start with |
/hold pending assignment of publication date |
Thanks Tim, added |
|
||
**Authors**: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat) | ||
|
||
In this article, we introduce [JobSet](https://jobset.sigs.k8s.io/), an open source API for representing distributed jobs. The goal of JobSet is to provide a stable API for running/building APIs with AI/ML and HPC use cases in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think HPC is a known term, but I also don't understand the running/building APIs part.
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Show resolved
Hide resolved
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Outdated
Show resolved
Hide resolved
I propose the 26th of April as publication date. Does that work? |
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Show resolved
Hide resolved
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Show resolved
Hide resolved
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Show resolved
Hide resolved
Let's pick a new publication date. How about the 7th of May? |
You should remove |
/lgtm Thanks @danielvegamyhre ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Publication date is now in the past
/lgtm cancel
LGTMs from #45759 (comment) and #45759 (comment) mean that all this needs is a new publication date. Sorry about the wait; we have very few active blog reviewers. I suggest aiming for the 17th of July; does that work? |
accelerators, without resorting to scripting or Helm charts to generate many versions of the | ||
same job but with different names. | ||
|
||
![JobSet Architecture Diagram](jobset_diagram.svg "JobSet Architecture") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: using a figure shortcode, with a caption, is even better. You can also set the diagram-medium
class; I recommend doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This article introduces an unregistered official annotation.
- You / WG Batch should register it, especially if any in-project code already uses the annotation.
(you do that by updating https://kubernetes.io/docs/reference/labels-annotations-taints/) - We can either register the annotation before publication, which is pretty straightforward, or we - blog editors - can add an editors' note that the annotation is not yet official.
I'm opposed to publicising unregistered annotations because if we do it once, people will keep asking for a similar exception.
TPU multislice training out-of-the-box with very little configuration required by the user. | ||
|
||
```yaml | ||
# Run a simple Jax workload on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean
# Run a simple Jax workload on | |
# Run a simple Jax workload on Google Cloud |
?
name: multislice | ||
annotations: | ||
# Give each child Job exclusive usage of a TPU slice | ||
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should register (and, ideally, deprecate) this annotation before we publicize it.
The page to register it on is https://kubernetes.io/docs/reference/labels-annotations-taints/
It needs registering because although it uses a deprecated pattern (“alpha” in the key), it's inside an official Kubernetes namespace for annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If WG Batch do deprecate this annotation and switch to something different, that's a nice but separate improvement.
We still need the original one registered if any released code inside https:/kubernetes/ uses it, or previously used it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we should register alpha annotations/labels. Sounds like drift waiting to happen.
I don't know if most sig projects register the labels/annotations on that website.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking annotations as alpha or beta is deprecated; we should eventually just use the plain form (eg jobset.kubernetes.io/exclusive-topology
). AIUI if a SIG wants to mark something as experimental, and for that not to be registered, it needs to be outside kubernetes.io
or k8s.io
.
We should register this one, or pick something different and blog about the updated annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielvegamyhre WDYT? Could we have a post without this annotation for now? I know that changing these annotations could break users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahg-g if you're wondering how much work is involved in registering the annotation, it's a 10 line PR to update https://kubernetes.io/docs/reference/labels-annotations-taints/
Please consider that if writing >= 10 lines of objection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with registering it if that addresses your concern, I was just asking if we are allowed to register annotations that are not part of upstream k8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! The way I've done it, anything with an in-project origin can be registered (even if it shouldn't have been used; as soon as code ships that might use the annotation, we have to document it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registered means “make a record that it's already in use“; it's not like an API review - deliberately!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
content/en/blog/_posts/2024-04-02-introducing-jobset/2024-04-02-introducing-jobset.md
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wrap the markdown at 100 or so (source) characters per line.
Once you've done that, I recommend squashing commits.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Content LGTMs noted, but publication date does need to change. /lgtm cancel |
There's a long disagreement over whether it's OK to publish this, given it would tacitly bless using annotations the wrong way (the JobSet controller looks for annotations and labels that look Kubernetes official, but aren't yet on our list of known annotations). I'm personally loth to publish this without settling that disagreement. Making an exception once has me thinking it could set a precedent we end up having to live with. |
I'm fine with registering the annotations before publishing |
Update: we are close to merging the PR registering JobSet labels/annotations (#47383). Once that is done we can move forward with publishing the blog post here. |
We would like to publish a blog post introducing JobSet, a K8s native API for distributed ML training and HPC workloads.
cc @ahg-g @kannon92 I think we still need to align on one example and ideally make it more concrete and polished. We should also explain the user story above it.