Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge runs on Azure Pipeline 'queued' #21692

Closed
stephenmcgruer opened this issue Feb 10, 2020 · 8 comments
Closed

Edge runs on Azure Pipeline 'queued' #21692

stephenmcgruer opened this issue Feb 10, 2020 · 8 comments

Comments

@stephenmcgruer
Copy link
Contributor

Whilst investigating #21691 I noticed that epochs/daily and epochs/three_hourly have also been stuck since Friday. They all have pending jobs that are queued waiting for Edge {Stable, Dev}.

epochs/daily, and example run

epochs/three_hourly and example run

Assigning to @mustjab as this seems to definitely be Azure Pipelines problems

@foolip
Copy link
Member

foolip commented Feb 10, 2020

https://dev.azure.com/web-platform-tests/_settings/agentpools shows 147 queued jobs:
image

It does look like there are 25 jobs running, but it'll presumably take a look time to catch up.

@stephenmcgruer
Copy link
Contributor Author

That presumes they are making progress. Can you check the 25 jobs to see if any of them have non-queued Edge runs? (I don't have access to that link)

@mustjab
Copy link
Contributor

mustjab commented Feb 10, 2020

Looks like the failures started to happen on Friday morning with this job:

https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=41439&jobId=633549cb-1448-570f-8318-690805c54ad8&view=results

And after that there were no successful runs until we rebuilt the VMs on Sunday with our automated jobs. I see runs progressing now, so it will take some time to catch up, but runs are progressing. Since VMs got rebuilt, I can't debug how they got into this state, but will send mail to Azure Pipelines folks to see if they have any additional data from the agent logs.

I'm seeing errors like this on the agents that were running that job: ##[error]We stopped hearing from agent w10c00000O. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

@stephenmcgruer
Copy link
Contributor Author

Thanks @mustjab ! I believe they have now caught up, albeit due to #21691 are currently failing (but that's on us to sort out). Closing this - please feel free to reopen if you think we can get any useful information about why it was hanging before the VM rebuild.

@LukeZielinski
Copy link
Contributor

Reopening as I'm seeing jobs queuing up again (https://dev.azure.com/web-platform-tests/wpt/_build?definitionId=1&repositoryFilter=1&branchFilter=1091). Last successful run was 3 days ago (Sat. Feb 15).

Apologies if this isn't the same issue.

@LukeZielinski LukeZielinski reopened this Feb 18, 2020
@mustjab
Copy link
Contributor

mustjab commented Feb 18, 2020

The root cause of this issue is different, but the end result is the same. We use Azure service principal to manage VMs and the password for that account has expired. Working on renewing it now and will kick off a job to re-generate Windows VMs. I'll add a few extra VMs to help increase the number of runs that we can process and hopefully deal with that backlog faster.

@mustjab
Copy link
Contributor

mustjab commented Feb 18, 2020

VMs are back up and running, let me know if you see any other issues.

@LukeZielinski
Copy link
Contributor

Looking good last few days, reclosing. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants