[RFE] QUADS Self-Scheduling Phase 1 via API #487

sadsfae · 2024-06-14T12:09:50Z

This is the skeleton feature that will enable future self-scheduling via the UI here: #98

Design flexible API scheduling framework using dedicated RBAC roles / token auth (perhaps we pre-define a set number of "slots" like we do cloud environments as users)
Design a threshold activation mechanism based on both cloud capacity % and per-model % usage, e.g. a host level metadata attribute like ss_enabled: true/false
- We can manage this with a cron-interval tool to check / set as needed like selfservice_manager.py
- We can provide a global argparse override like --enable-ss true/false
Design what the tenant workflow might look like, e.g.

Query available self scheduling
If enabled, what models/systems?
Use API to obtain a "self scheduling" cloud/user
Perform set commands to define cloud, add systems (schedule times are pre-set, e.g. starting now and lasting X days or a week"
Possibly include API mechanisms to auto-extend so long as thresholds meet.

The text was updated successfully, but these errors were encountered:

jtaleric · 2024-06-14T18:18:01Z

This is awesome!

Just to give a high-level what we are looking to do with Quads with our baremetal CPT, here is a quick and dirty visual ;-)

Our use-case for integration would be for leases <= 24 hours -- depending on the cluster size we are requesting. Basically these allocations will be ephemeral in nature. Not ideal for long investigations, but great for quick response to hardened performance automation.

let us know if this fits the model you describe here! Thanks Will!

josecastillolema · 2024-06-14T22:36:44Z

Thanks @sadsfae and @jtaleric ! I think this is a great step in the right direction to enable dynamic allocations in the scale lab.

After discussing with @jtaleric the way I have thought of the process is:

Step 1 - The request: This would be a new Prow step that would query the QUADS API for the leftovers (servers that are not statically assigned in today's dynamic inventory via the usual reservation process). The reservation request would look something like:
- NUM_SERVERS: the number of servers being requested
- SERVERS_LABELS: Just some examples: sriov_intel, sriov_mellanox, dell_640, disk_nvme, disk_ssd. I don't know if this labels exist today or will have to be implemented.
If the servers are available:
- interact with the QUADS API to create a new cloud with those servers: this would setup the networking, BMCs, VLANs, generate the corresponding OCPINV, etc.
- interact with the QUADS API (or the foreman one, not sure) to deploy the bastion server
Step 2 - Deploy OCP corresponds to the current openshift-qe-installer-bm-deploy which we would need to improve by implementing a dynamic inventory
Step 3 - Run benchmarks, no planned changes here, this is already implemented
Step 4 - Return the machines, a new Prow step to cleanup, delete the cloud, etc.

sadsfae · 2024-06-28T16:12:50Z

Great feedback so far, keep it coming. We have a work-in-progress design document we're mostly done with but needs internal review before we share it but I gave @jtaleric and @radez a preview.

I think what you'd care about is the tenant API request workflow, it would look ideally something like this broken down into a GET request to obtain eligible systems and three API POSTs to get your own workload delivered.

You'd receive JSON returns for each of the API requests below that can be fed into any automation you have, note the jump to using the generated API token, etc.

1) Query available, eligible systems (OPEN)

API GET to return list of systems eligible for self-scheduling (if self-scheduling is unavailable, return as such)

2) Request assignment (OPEN)

API POST to Request a new self-service assignment
- Sends: Workload description, tenant kerberos name e.g. dradez (mapped to email), optionally public VLAN requirement (we would auto-allocate a free one) and Q-in Q design
- Receives: Permanent unique ID for their user (creates ID if it doesn’t exist)
- Receives: Newly generated API auth token (for this request only)
- Receives: JIRA ticket url for new JIRA issue (--cloud-ticket for this request)

3) Acquire assignment (Authenticated)

API POST to Acquire an assignment
Sends: Generated token, JIRA ticket number and permanent unique ID
Receives: JSON return with cloud number, ticket, description

4) Deploy Assignment (Authenticated)

Curl API POST to Deploy their assignment
- Sends: Token, unique ID and list of hosts they want
- Receives: JSON status return on success/failure
After step 4 you can query the open public API too to view your list of machines as well
After step 4 you can also poll the release status of the cloud environment and have your automation gate on that before proceeding.

More Details WIP

(excuse the screenshot, didn't convert to markdown)

sadsfae · 2024-06-28T16:24:15Z

Thanks @sadsfae and @jtaleric ! I think this is a great step in the right direction to enable dynamic allocations in the scale lab.

After discussing with @jtaleric the way I have thought of the process is:

Step 1 - The request: This would be a new Prow step that would query the QUADS API for the leftovers (servers that are not statically assigned in today's dynamic inventory via the usual reservation process). The reservation request would look something like:

NUM_SERVERS: the number of servers being requested

SERVERS_LABELS: Just some examples: sriov_intel, sriov_mellanox, dell_640, disk_nvme, disk_ssd. I don't know if this labels exist today or will have to be implemented.

This already exists in our metadata models and filters for --ls-available

https:/redhat-performance/quads/blob/latest/docs/quads-host-metadata-search.md

This can also be expanded upon and import anything of value via lshw and our Python metadata import tool:

https:/redhat-performance/quads/blob/22bb54a2bf989103532057486542eef52fe29d1a/src/quads/tools/lshw2meta.py

If the servers are available:

interact with the QUADS API to create a new cloud with those servers: this would setup the networking, BMCs, VLANs, generate the corresponding OCPINV, etc.

interact with the QUADS API (or the foreman one, not sure) to deploy the bastion server

We would consume all the frills and features of a normal deliberately scheduled future QUADS assignment so all that would be inclusive. As to what you'd receive it would not differ at all from what someone who receives a future assignment scheduled by us would receive, only that you can request it yourself with a few API calls (we need several because we do a lot including talking to JIRA and Foreman on the backend).

Step 2 - Deploy OCP corresponds to the current openshift-qe-installer-bm-deploy which we would need to improve by implementing a dynamic inventory

Step 3 - Run benchmarks, no planned changes here, this is already implemented

Step 4 - Return the machines, a new Prow step to cleanup, delete the cloud, etc.

The world is your oyster, but we're just auto-delivering the hardware/networks here. Any additional hour zero work is in your purview only to action like deploying OCP and running your workloads. You will be able to release the systems (or extend them) with the same set of API's though.

One thing to keep in mind is our provisioning release time is what it is, there's no significant speedups beyond our usage of asyncio (already in codebase and current QUADS) and what having multiple, concurrent gunicorn threads/listeners provides. Bare-metal / IPMI / boot / mechanics are just slow, prone to weird issues and often needs hands-on so only thing I'd add here is keep your expectations reasonable, we're not going to have within-the-hour deployments 100% of the time, it may take a few hours to get your systems once the dust clears, longer if hands-on is required to push them through validation like any normal QUADS assignment.

josecastillolema · 2024-06-29T09:09:48Z

Thanks for the great explanation @sadsfae ,
some newie questions about the scale lab internals:

When a new cloud is assigned and goes through validation, does all its host get provisioned through foreman? Is this needed?
- Assuming that the answer to the previous quesiton is yes, and considering our automatic deployment scenario where we only need the bastion node deployed, and the other servers will be handled by the bastion installer, would it be possible to skip the provisioning of the rest of the nodes or the provisioning itself is needed in order to do the validation?
Could the validation of the servers be done when the previous cloud assignment finishes instead of when the new cloud assignment is released, or is it dependent on the new assignment (i.e.: vlans, etc.)

sadsfae · 2024-07-01T08:31:10Z

Thanks for the great explanation @sadsfae , some newie questions about the scale lab internals:

When a new cloud is assigned and goes through validation, does all its host get provisioned through foreman? Is this needed?

Yes, it is absolutely needed. We have no other way to ensure systems data, settings, etc. are cleaned. More importantly we need a way to physically test/validate network functionality. We have a series of fping tests that occur and other validation testing to ensure the hardware is working 100%, traffic passes on ports and also that data and settings from previous tenants are removed. We have to deploy our own Foreman RHEL because it sets up templates for all of the VLAN interfaces with deliberate IP schemes to facilitate this testing.

Assuming that the answer to the previous quesiton is yes, and considering our automatic deployment scenario where we only need the bastion node deployed, and the other servers will be handled by the bastion installer, would it be possible to skip the provisioning of the rest of the nodes or the provisioning itself is needed in order to do the validation?

Yes but also no. There is no way to perform any validation at all without wiping the systems for a new tenant. We have an option called "no wipe" which skips all provisioning and even reboots, it simply performs the network VLAN and switch automation needed to "move" an assignment to a new cloud. We can allow this as an option via the API phases but you would have to be 100% sure they were the same systems you used before and nobody else used them since you used them last. otherwise you get stuck with effectively the same running OS/systems as whomever had those systems before you, then you'd have to burn extra time provisioning them.

No-wipe is used more frequently in expansion scenarios when we can't be sure there isn't conflicting broadcast services on the internal VLANS that would hijack kickstart (like DHCP/PXE from an internal tenant installer or service) or on occasion when we need to resurrect an expired assignment, it's really not designed to be done on new assignments unless you can be sure of the systems integrity which I don't see really being a tenable or reliable situation in a self-service pool of changing hardware.

We also have no way of knowing what system you may pick ahead of time in the future for your bastion node before you get your systems, as those systems availability would only based on what's available at the current time. The Foreman-deployed OS, while perfectly fine for a generic RHEL is more a vehicle for us to validate hardware and network functionality, ensure clean baselines and catch any hands-on issue that might occur with bare-metal which happens frequently with 1000's of something more than we'd like. There is just no getting around this and it would cause a lot more headache to try skipping it. A environment doesn't get released until all systems pass our sets of validation phases.

Could the validation of the servers be done when the previous cloud assignment finishes instead of when the new cloud assignment is released, or is it dependent on the new assignment (i.e.: vlans, etc.).

When machines finish their allocation they roll directly to the next tenant anyway if they have an active schedule but they still need to be wiped/validated/tested and then pass our hardware/network validation and release gating, meaning they do immediately go into this process anyway if they have another place to be.

What you're asking in general doesn't save a lot of time anyway. Kickstart via Foreman are a fairly fast part of the QUADS workflow (network automation is the most speedy which takes 5-10 seconds or so per machine).

It takes around 5-8 minutes or less not counting reboots for a modern system to fully kickstart with our local SDN mirror and it's all done in parallel via the asyncio-capable Foreman QUADS library. Even if we used image-based deployments it would still take a comparable amount of time because disk images have to be written and reboots still have to happen and it's more inflexible because we'd have to maintain many different flavors of images to account for hardware variety. In general this step is fundamental to ensuring we deliver and properly validate 100% functional systems to tenants.

jtaleric · 2024-07-01T13:52:08Z

@sadsfae ack! Thanks for the detailed responses!

My concern was if we are trying to have a highly dynamic transactional lease with QUADs -- there would be this enormous amount of time for the validate/provisioning. From the above description, I don't think we are talking hours, but maybe minutes depending on the size of the request.

Initially, I envision the request size around 25 nodes, 1 bastion, 3 workers, 3 infra, 18 workers -- this would be our "largest" deployment to start with in this POC. The lease would be 24-48hours. We might want to have the ability to run 2-3 jobs in parallel so my hope would be the "dynamic pool" of machines is ~75 if we can accommodate that for a POC?

sadsfae · 2024-07-01T14:10:22Z

@sadsfae ack! Thanks for the detailed responses!

My concern was if we are trying to have a highly dynamic transactional lease with QUADs -- there would be this enormous amount of time for the validate/provisioning. From the above description, I don't think we are talking hours, but maybe minutes depending on the size of the request.

Hey Joe, I would set your expectations to 45minutes to an hour if nothing goes wrong: from the time you finish all the API POST(s) needed until you receive fully validated, fresh hardware/networks. That's a reasonable goal. The amount of systems doesn't matter as much because almost everything is done in parallel but of course one machine not passing validation holds up the rest because we demand 100% validation integrity. Most of the time this entire process goes off without a hitch unless tenants really trash the systems before you get them.

Initially, I envision the request size around 25 nodes, 1 bastion, 3 workers, 3 infra, 18 workers -- this would be our "largest" deployment to start with in this POC. The lease would be 24-48hours. We might want to have the ability to run 2-3 jobs in parallel so my hope would be the "dynamic pool" of machines is ~75 if we can accommodate that for a POC?

You'll have whatever limits we need for testing as I think you'll be our first "beta" adopters, but afterwards we will be setting per-user limits for multiple concurrent self-scheduled requests, we simply don't want one user occupying all the self-schedule pool. We don't know what that limit is yet but we'll figure out something that allows what you're trying to do but stems hogging as well.

I just don't know what our dynamic pool will look like when this is ready, it just depends on what's free at the time because regardless of the priority of this feature we need to operate the R&D and product scale business first and foremost. Yes, I think 75+ to let it really grind is a great number to aim for though and very reachable looking at usage so far this Summer.

One other note about delivery expectations, there are certainly areas we can speed up the QUADS phases (move_and_rebuild and validate_environment) but our resources are first aiming for functionality so we'd expect the same or slightly faster delivery than QUADS 1.1.x due to moving to independent gunicorn listeners and nginx benefits. We will need to pull out some profiling tools after this is working well to tighten things up where it makes sense as future RFE's.

One area we need to revisit is the Foreman library, while we are using asyncio we do limit API activity via semaphores because we've overloaded it in the past and never got back to digging into RoR and their Sinatra API too deeply. I think that this needs revisiting and there's likely tuning on the Foreman API side we need to do too beyond what we do with mod_passenger cc: @grafuls

https://quads.dev/2019/10/04/concurrent-provisioning-with-asyncio/

Edit: I just checked a few 30+ node assignments that went out flawlessly in the last week and they are completely validated and released in around 40-45minutes so I think 1 hour is a good target, possibly two hours maximum assuming no hardware/switch issue requiring hands-on.

jtaleric · 2024-07-01T15:09:37Z

Super helpful to set expectations, thanks Will!

sadsfae · 2024-07-02T12:11:04Z

Super helpful to set expectations, thanks Will!

Hey for sure, no problem. Better to set a target we're pretty sure we can hit and then start profiling and looking to see where we can reduce the delivery time going forward. So much has changed design/architecture wise that I don't want to be too aggressive on estimates. Once we have 2.0 running in production we'll have a better baseline, ideally I'd like to aim for 30 minutes or less with as a start-to-end delivery goal.

grafuls · 2024-08-16T15:01:30Z

There has been some internal discussion on the design and here's a first look at how the flow of the system would be.

jtaleric · 2024-08-19T11:36:32Z

Thanks @grafuls -- so for our BM CPT we will focus on 2 paths I assume

Pass our CloudID for our LTA -- which it will check if things are "free"
"Find Free Cloud" -- which I assume we will send some concept of "machines with these characteristics, and number of nodes"

Thanks for throwing this together!

sadsfae · 2024-08-19T13:12:40Z

Hey Joe, this is how it'll likely work.

Thanks @grafuls -- so for our BM CPT we will focus on 2 paths I assume
1. Pass our CloudID for our LTA -- which it will check if things are "free"

You'll first do an open GET request to obtain a JSON list of systems (max 10) and store this somewhere and you'll pass it along later as part of a bearer-auth authenticated POST payload once the first successful 201 response is returned with your temporary token and generated JIRA ticket number and cloud that's auto-allocated.

We are using the concept of users like --cloud-owner so you could for example also specify others as --cc-users here, because we need physical people contacts for the requests. Free cloud slots are also randomly chosen here too based on what is available. Roles are then associated for that token / user to the specific cloud environment slot for the lifetime of the schedules.

2. "Find Free Cloud" -- which I assume we will send some concept of "machines with these characteristics, and number of nodes"

The way we have it scoped now is you'll just pick the actual physical systems (max of 10) that gets returned. But we do have the capability to support filtering based on hardware, model or capability that will curate this API response return.

Thanks for throwing this together!

grafuls · 2024-08-19T14:07:00Z

Pass our CloudID for our LTA -- which it will check if things are "free"

For self scheduling we are expecting users to not pass any cloud, for which Quads, on the backend, will auto select a cloud name that is available. We are also opening the possibility of passing the cloud name ( that should also be available ) to override the auto selection.

2. "Find Free Cloud" -- which I assume we will send some concept of "machines with these characteristics, and number of nodes"

These are kind of separate concepts.
We have a set number of clouds/name [cloud01..cloud99] which represents the naming of the environment. Find free cloud looks for cloud names that are not assigned to currently scheduled "clouds".
In order to get a list of available servers with specific characteristics, you would use the /available/ endpoint, passing the filters as arguments on the request.
E.g.:

curl http://quads2.example.com/api/v3/available?interfaces.vendor=Mellanox+Technologies&model=R640&disks.disk_type=nvme

sadsfae added enhancement scheduling automation labels Jun 14, 2024

sadsfae added this to the 2.0 Series - Bowie milestone Jun 14, 2024

sadsfae mentioned this issue Jun 26, 2024

[RFE] Public/Private API Classification After Full Flask Migration #405

Closed

sadsfae self-assigned this Jul 1, 2024

sadsfae added Priority self-scheduling API labels Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE] QUADS Self-Scheduling Phase 1 via API #487

[RFE] QUADS Self-Scheduling Phase 1 via API #487

sadsfae commented Jun 14, 2024

jtaleric commented Jun 14, 2024

josecastillolema commented Jun 14, 2024 •

edited by sadsfae

Loading

sadsfae commented Jun 28, 2024 •

edited

Loading

sadsfae commented Jun 28, 2024 •

edited

Loading

josecastillolema commented Jun 29, 2024

sadsfae commented Jul 1, 2024 •

edited

Loading

jtaleric commented Jul 1, 2024

sadsfae commented Jul 1, 2024 •

edited

Loading

jtaleric commented Jul 1, 2024

sadsfae commented Jul 2, 2024 •

edited

Loading

grafuls commented Aug 16, 2024 •

edited

Loading

jtaleric commented Aug 19, 2024

sadsfae commented Aug 19, 2024

grafuls commented Aug 19, 2024

[RFE] QUADS Self-Scheduling Phase 1 via API #487

[RFE] QUADS Self-Scheduling Phase 1 via API #487

Comments

sadsfae commented Jun 14, 2024

jtaleric commented Jun 14, 2024

josecastillolema commented Jun 14, 2024 • edited by sadsfae Loading

sadsfae commented Jun 28, 2024 • edited Loading

1) Query available, eligible systems (OPEN)

2) Request assignment (OPEN)

3) Acquire assignment (Authenticated)

4) Deploy Assignment (Authenticated)

More Details WIP

sadsfae commented Jun 28, 2024 • edited Loading

josecastillolema commented Jun 29, 2024

sadsfae commented Jul 1, 2024 • edited Loading

jtaleric commented Jul 1, 2024

sadsfae commented Jul 1, 2024 • edited Loading

jtaleric commented Jul 1, 2024

sadsfae commented Jul 2, 2024 • edited Loading

grafuls commented Aug 16, 2024 • edited Loading

jtaleric commented Aug 19, 2024

sadsfae commented Aug 19, 2024

grafuls commented Aug 19, 2024

josecastillolema commented Jun 14, 2024 •

edited by sadsfae

Loading

sadsfae commented Jun 28, 2024 •

edited

Loading

sadsfae commented Jun 28, 2024 •

edited

Loading

sadsfae commented Jul 1, 2024 •

edited

Loading

sadsfae commented Jul 1, 2024 •

edited

Loading

sadsfae commented Jul 2, 2024 •

edited

Loading

grafuls commented Aug 16, 2024 •

edited

Loading