Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

user quota. fix https:/microsoft/pai/issues/5503 #41

Closed
wants to merge 2 commits into from

Conversation

siaimes
Copy link

@siaimes siaimes commented May 11, 2022

fix microsoft/pai#5503

Signed-off-by: siaimes [email protected]

@abuccts
Copy link
Member

abuccts commented May 12, 2022

Hi, according to my understanding to your design in this PR, it will

  1. configure scheduler with a quota API endpoint (another service or rest-server) and tokens etc.
  2. when new job arrives, scheduler will calculate current user's used GPUs, request quota from the configured API, then compare.

However, there're two issues:

  • it will make the scheduler and rest-server depend on each other.
  • scheduler only has the concept of "VC" instead of "user", the affinity group may not follows the "job~username" naming if the job is not submitted from rest-server.

Maybe it would be better and easier to implement this in rest-server directly if you're using OpenPAI, you can configured the quota and pre-check user's used GPU (by querying database or scheduler API) in rest-server before accepting the job.

@siaimes
Copy link
Author

siaimes commented May 12, 2022

@abuccts thank you for your reply.

Some useful commits from this issue: microsoft/pai#5503

  1. interdependence issues:

a. This is a weak dependency. If the api has not been started yet, the schedule will return and wait for the next scheduling, so as long as the api can be started, the scheduling will run normally.

	resp, err := (&http.Client{Timeout: 10 * time.Second}).Do(req)
	// Request failed
	if err != nil {
		return 0, "Get quota failed: " + err.Error(), true
	}

b. The API is not necessarily provided by rest-server, as long as the response body of api{username} contains "quota": N.

  1. naming problem:

a. The new feature does not affect existinging systems, as long as they do not use this feature.

b. If third-party systems relying on hivdschedure want to use this feature, they must change their naming to username~jobname.

c. I think as a base component, it is reasonable for hivdschedure to add new features (a) by introducing new constraints (b) without affecting the existing system.

  1. Implement this feature through rest-server:

a. There is an issue with using the rest-server to implement this feature. It cannot make jobs that exceed the quota enter the waiting queue, but can only prevent users from continuing to submit jobs, which will lose the flexibility of the cluster.

b. By design, the quota should be implemented by the scheduler instead of the rest-server.

@siaimes
Copy link
Author

siaimes commented May 21, 2022

Hi @abuccts

I changed the implementation and now there is no interdependence, please review. Thanks.

@siaimes
Copy link
Author

siaimes commented May 21, 2022

This must work with microsoft/pai#5777.

@siaimes
Copy link
Author

siaimes commented May 21, 2022

Preview:

microsoft/pai#5503 (comment)

Fix GitHub Action config.
@siaimes
Copy link
Author

siaimes commented Jun 1, 2022

I've been merging these two PRs into my production environment for over 10 days, and the user feedback is currently all positive.

@abuccts @yqwang-ms I would like to know if there is any discussion in your group about merging this PR, is there any need for further optimization?

Thanks.

@abuccts
Copy link
Member

abuccts commented Jun 6, 2022

Sorry for the late response due to quarantine.
For the user quota problem you wanted to solve, after went through the original issue, I think in current scheduler's design, virtual cluster is the unit for cell quota and mapping one "user" to one VC should address this problem from design wide, regardless of the implementation issue for slow restarting where each VC is configured as one stateful set for per VC queuing to avoid cross VC starvation.
For your current solution, it still introduces new concept like "quota" and "user" which are only used for this feature and more like a hack. A general solution could be leveraging new k8s scheduling framework and implementing a customized PreFilter. We had a plan to migrate to that framework but didn't have time for that.

Anyway, thanks for your contributions but since openpai is in maintenance mode and hasn't been actively checked in new features for a while, pls fell free to keep the forked versions.

@siaimes siaimes closed this Jun 6, 2022
@siaimes siaimes deleted the quota branch December 13, 2022 11:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add SKU quota for each user respectively.
2 participants