Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some agents aren't deallocated #74

Closed
tom-dudley opened this issue Nov 7, 2017 · 4 comments
Closed

Some agents aren't deallocated #74

tom-dudley opened this issue Nov 7, 2017 · 4 comments

Comments

@tom-dudley
Copy link

tom-dudley commented Nov 7, 2017

I have some agents which are shutting down, but not being deallocated.

This is a similar, if not the same, issue as #64 however the fix given there doesn't work in my case. We did originally have a different idle timeout to our current one, but the agents which are failing to deallocate where created after the timeout was last changed.

Here is the relevant server log, where you can see that tcagent3 has been found to be stopped. You can also see an agent which was successfully deallocated (tcagent2) gets the Stopping virtual machine tcagent2 message. There is not a Stopping virtual machine tcagent3 message.

[2017-11-04 01:30:39,254]   INFO [ey'{id=arm-1} 2] - base.tasks.UpdateInstancesTask - Updated instance 'tcagent3' status to Stopped based on API information 
[2017-11-04 01:30:39,254]   INFO [ey'{id=arm-1} 2] - uds.base.AbstractCloudInstance - Changing tcagent3(247c3648) status from Running to Stopped  
[2017-11-04 01:30:40,647]   INFO [uled executor 5] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopped' state, profile 'Azure'{id=arm-1}, AzureCloudInstance{myName='tcagent4'} 
[2017-11-04 01:30:40,647]   INFO [uled executor 5] - .server.impl.CloudEventsLogger - Cloud agent Unregistered agent "tcagent4" {id=298} is no longer recognized as cloud agent because instance has gone: profileId=arm-1, imageId=tcagent, instanceId=tcagent4 
[2017-11-04 01:30:40,647]   INFO [uled executor 5] - ver.instances.AgentRemoverImpl - Removing cloud agent Unregistered agent "tcagent4" {id=298} because: Cloud instance (id=tcagent4, imageId=tcagent, profileId=arm-1) has gone (is not reported by cloud profile) 
[2017-11-04 01:30:40,663]   INFO [uled executor 5] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopped' state, profile 'Azure'{id=arm-1}, AzureCloudInstance{myName='tcagent3'} 
[2017-11-04 01:30:40,665]   INFO [uled executor 5] - .server.impl.CloudEventsLogger - Cloud agent Unregistered agent "tcagent3" {id=299} is no longer recognized as cloud agent because instance has gone: profileId=arm-1, imageId=tcagent, instanceId=tcagent3 
[2017-11-04 01:30:40,666]   INFO [uled executor 5] - ver.instances.AgentRemoverImpl - Removing cloud agent Unregistered agent "tcagent3" {id=299} because: Cloud instance (id=tcagent3, imageId=tcagent, profileId=arm-1) has gone (is not reported by cloud profile) 
[2017-11-04 01:34:44,822]   INFO [uled executor 5] - ances.KillIdleRunningInstances - Terminating instance AzureCloudInstance{myName='tcagent2'}, profile 'Azure'{id=arm-1} with idle time 15 minutes (idle time is 15 minutes) 
[2017-11-04 01:34:44,834]   INFO [uled executor 5] - l.instances.StopInstanceAction - Terminating instance: AzureCloudInstance{myName='tcagent2'}, profile 'Azure'{id=arm-1}, reason: idle timeout of 15 minutes reached 
[2017-11-04 01:34:44,837]   INFO [uled executor 5] - r.impl.DBCloudStateManagerImpl - Image: AzureCloudImage{myName='tcagent'}, profile: profile 'Azure'{id=arm-1} was marked to CONTAIN agent 
[2017-11-04 01:34:44,840]   INFO [uled executor 5] - uds.base.AbstractCloudInstance - Changing tcagent2(47caa6e6) status from Running to Scheduled to stop  
[2017-11-04 01:34:44,840]   INFO [onPool-worker-1] - ouds.azure.arm.AzureCloudImage - Stopping virtual machine tcagent2 
[2017-11-04 01:34:45,787]   INFO [uled executor 1] - .server.impl.CloudEventsLogger - Cloud instance entered 'scheduled to stop' state, profile 'Azure'{id=arm-1}, AzureCloudInstance{myName='tcagent2'} 
[2017-11-04 01:34:52,681]   INFO [nio-80-exec-188] - ver.instances.AgentRemoverImpl - Removing cloud agent "tcagent2" {id=294} because: Instance has been terminated. AzureCloudInstance{myName='tcagent2'}, profile 'Azure'{id=arm-1} 
[2017-11-04 01:34:52,708]   INFO [nio-80-exec-188] - .server.impl.CloudEventsLogger - Cloud agent Unregistered agent "tcagent2" {id=294} is no longer recognized as cloud agent because instance has gone: profileId=arm-1, imageId=tcagent, instanceId=tcagent2 
[2017-11-04 01:34:52,709]   INFO [nio-80-exec-188] - ver.instances.AgentRemoverImpl - Removing cloud agent Unregistered agent "tcagent2" {id=294} because: Cloud instance (id=tcagent2, imageId=tcagent, profileId=arm-1) has gone (is not reported by cloud profile) 
[2017-11-04 01:37:15,210]   INFO [onPool-worker-1] - uds.base.AbstractCloudInstance - Changing tcagent2(47caa6e6) status from Scheduled to stop to Stopped  
[2017-11-04 01:37:15,210]   INFO [onPool-worker-1] - ouds.azure.arm.AzureCloudImage - Virtual machine tcagent2 has been successfully terminated 
[2017-11-04 01:37:15,826]   INFO [uled executor 2] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopped' state, profile 'Azure'{id=arm-1}, AzureCloudInstance{myName='tcagent2'} 
[2017-11-04 02:44:45,752]   INFO [uled executor 4] - ances.KillIdleRunningInstances - Terminating instance AzureCloudInstance{myName='tcagent5'}, profile 'Azure'{id=arm-1} with idle time 15 minutes (idle time is 15 minutes) 
[2017-11-04 02:44:45,757]   INFO [uled executor 4] - l.instances.StopInstanceAction - Terminating instance: AzureCloudInstance{myName='tcagent5'}, profile 'Azure'{id=arm-1}, reason: idle timeout of 15 minutes reached 
[2017-11-04 02:44:45,759]   INFO [uled executor 4] - r.impl.DBCloudStateManagerImpl - Image: AzureCloudImage{myName='tcagent'}, profile: profile 'Azure'{id=arm-1} was marked to CONTAIN agent 
[2017-11-04 02:44:45,761]   INFO [uled executor 4] - uds.base.AbstractCloudInstance - Changing tcagent5(68ab4c2e) status from Running to Scheduled to stop 

Here is the agent log from tcagent3 where we can see it has decided to shutdown. This log is a bit odd since our timeout is currently 15 minutes.

[2017-11-04 01:24:33,067]   INFO - l.directories.DirectoryMapImpl - Cleaning up old checkout directories. Default lifetime = 192 hour(s) 
[2017-11-04 01:29:32,582]   WARN - rver.clouds.azure.IdleShutdown - Agent was idle for 25 minutes. Cloud profile timeout was 25 minutes. Instance will be shut down 
[2017-11-04 01:29:32,582]   WARN - rver.clouds.azure.IdleShutdown - Agent was idle for 25 minutes. Cloud profile timeout was 25 minutes. Instance will be shut down 
[2017-11-04 01:29:32,582]   INFO - rver.clouds.azure.IdleShutdown - To change this command define 'teamcity.agent.shutdown.command' property with proper shutdown command in the buildAgent.properties file 
[2017-11-04 01:29:32,582]   INFO - rver.clouds.azure.IdleShutdown - To change this command define 'teamcity.agent.shutdown.command' property with proper shutdown command in the buildAgent.properties file 
[2017-11-04 01:29:32,582]   INFO - rver.clouds.azure.IdleShutdown - Shutdown instance commands for windows server 2012 r2 
[2017-11-04 01:29:32,582]   INFO - rver.clouds.azure.IdleShutdown - Shutdown instance commands for windows server 2012 r2 
[2017-11-04 01:29:32,582]   INFO - rver.clouds.azure.IdleShutdown - Shutting down agent with command: shutdown -s -t 1 -c "TeamCity Agent Azure Instance shutdown on idle time" -f 
[2017-11-04 01:29:32,582]   INFO - rver.clouds.azure.IdleShutdown - Shutting down agent with command: shutdown -s -t 1 -c "TeamCity Agent Azure Instance shutdown on idle time" -f 
[2017-11-04 01:29:32,798]   INFO - rver.clouds.azure.IdleShutdown - Shutdown exit code:0 
[2017-11-04 01:29:32,798]   INFO - rver.clouds.azure.IdleShutdown - Shutdown call: 

 
[2017-11-04 01:29:32,798]   INFO - rver.clouds.azure.IdleShutdown - Shutting down agent with command: shutdown /s /t 1 /c "TeamCity Agent Azure Instance shutdown on idle time" /f 
[2017-11-04 01:29:32,798]   INFO - rver.clouds.azure.IdleShutdown - Shutdown exit code:1190 
[2017-11-04 01:29:32,798]   INFO - rver.clouds.azure.IdleShutdown - Shutdown call: 

A system shutdown has already been scheduled.(1190)
 
[2017-11-04 01:29:32,798]   INFO - rver.clouds.azure.IdleShutdown - Shutting down agent with command: shutdown /s /t 1 /c "TeamCity Agent Azure Instance shutdown on idle time" /f 
[2017-11-04 01:29:32,813]   INFO - rver.clouds.azure.IdleShutdown - Shutdown exit code:1190 
[2017-11-04 01:29:32,813]   INFO - rver.clouds.azure.IdleShutdown - Shutdown call: 

A system shutdown has already been scheduled.(1190)
 
[2017-11-04 01:29:32,813]   INFO - rver.clouds.azure.IdleShutdown - Shutdown exit code:1190 
[2017-11-04 01:29:32,813]   INFO - rver.clouds.azure.IdleShutdown - Shutdown call: 

A system shutdown has already been scheduled.(1190)
 
[2017-11-04 01:29:33,048]   INFO - r.artifacts.impl.HttpDiskCache - Cleaning up items with life time in cache greater than 172800 seconds 
[2017-11-04 01:29:33,048]   INFO - r.artifacts.impl.HttpDiskCache - Finished cleaning up expired items, 0 items removed 
[2017-11-04 01:29:33,048]   INFO - dDirectoryBasedCleanupRegistry - Removing files from C:\BuildAgent\temp\.old 
[2017-11-04 01:29:33,048]   INFO - dDirectoryBasedCleanupRegistry - Removing files from C:\BuildAgent\work\.old 
[2017-11-04 01:29:33,067]   INFO - l.directories.DirectoryMapImpl - Cleaning up old checkout directories. Default lifetime = 192 hour(s) 
[2017-11-04 01:29:35,152]   INFO -    jetbrains.buildServer.AGENT - Received stop command from launcher 
[2017-11-04 01:29:35,152]   INFO -    jetbrains.buildServer.AGENT - Stop command was not performed. No build to stop with reason: BuildInterruptReason.AGENT_SHUTDOWN 
[2017-11-04 01:29:35,152]   INFO -    jetbrains.buildServer.AGENT - Starting agent shutdown sequence, reason: Stop command called 
[2017-11-04 01:29:35,168]   INFO - agent.impl.AgentPortFileWriter - Delete agent runtime file from C:\BuildAgent\logs\buildAgent.xmlRpcPort 
[2017-11-04 01:29:35,168]   INFO -    jetbrains.buildServer.AGENT - Unregistering from the server. Agent id: 298 
[2017-11-04 01:29:35,230]   INFO -    jetbrains.buildServer.AGENT - Shutdown agent WebServer start 
[2017-11-04 01:29:35,261]   INFO -    jetbrains.buildServer.AGENT - Shutdown agent WebServer finish 
[2017-11-04 01:29:35,261]   INFO -    jetbrains.buildServer.AGENT - Shutdown agent finish 
[2017-11-04 01:29:35,261]   INFO -    jetbrains.buildServer.AGENT - Shutdown agent and exit with code DIE_USER_COMMAND 
[2017-11-04 01:29:35,261]   INFO - ent.impl.upgrade.AgentExitCode - Agent exited. Stop command called 
[2017-11-04 01:29:35,261]   INFO - buildServer.agent.AgentMain2$2 - Closing jetbrains.buildServer.agent.AgentMain2$2@2286778: startup date [Sat Nov 04 01:03:59 UTC 2017]; root of context hierarchy

Since the Stopping virtual machine tcagent3 message never appears in the server log, the terminateInstance method doesn't seem to be getting called for it, despite the agent deciding to shutdown. The shutdown of the agent VM causes it to be in the 'Stopped (incurring charges)' state but since the server-side plugin didn't enter terminateInstance (and then stopVmAsync), the deallocation never happens.

So the bug here seems to be that the agent is shutting down when the server-side plugin isn't in terminateInstance. I don't know why this is the case. Let me know if I can provide further logs which might be of use.

@dtretyakov
Copy link
Contributor

@tom-dudley, I see that you are using the ARM plugin to start these machines, but in the agent log from vm tcagent3 mentioned IdleShutdown which is not used in the latest plugin builds.

Could you please verify that you're using latest version of ARM plugin and check that you don't have installed Azure Classic plugin? You could do that on the Administration -> Plugins List page.

@tom-dudley
Copy link
Author

tom-dudley commented Dec 6, 2017

We have the following relevant plugins on our server:

  • Azure integration 0.2.51
  • Azure Support: Resource Manager integration SNAPSHOT-20170918103035

I assume the first plugin is the Classic plugin and we're fine to remove it since we only use ARM (I guess they're in conflict with one another)

@dtretyakov
Copy link
Contributor

@tom-dudley, yes, Azure integration is old name of Azure Classic plugin and it affects build agents started by ARM plugin.

@tom-dudley
Copy link
Author

Thanks @dtretyakov , feel free to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants