Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow announcer for orchestrator #15

Open
tomgco opened this issue Feb 26, 2015 · 10 comments
Open

Slow announcer for orchestrator #15

tomgco opened this issue Feb 26, 2015 · 10 comments

Comments

@tomgco
Copy link
Member

tomgco commented Feb 26, 2015

$ fleetctl -strict-host-key-checking=false -endpoint=http://172.17.8.101:4001 journal paz-orchestrator-announce.service
Feb 26 17:38:59 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:00 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:01 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:02 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:03 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:04 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:05 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:06 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:07 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:08 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory

However this timed out and then:

Feb 26 17:41:31 core-01 systemd[1]: paz-orchestrator-announce.service start-pre operation timed out. Terminating.
Feb 26 17:41:31 core-01 systemd[1]: Failed to start paz-orchestrator announce.
Feb 26 17:41:31 core-01 systemd[1]: Unit paz-orchestrator-announce.service entered failed state.
Feb 26 17:41:31 core-01 systemd[1]: paz-orchestrator-announce.service failed.
Feb 26 17:41:31 core-01 systemd[1]: paz-orchestrator-announce.service holdoff time over, scheduling restart.
Feb 26 17:41:31 core-01 systemd[1]: Stopping paz-orchestrator announce...
Feb 26 17:41:31 core-01 systemd[1]: Starting paz-orchestrator announce...
Feb 26 17:41:31 core-01 sh[26833]: Waiting for 49153/tcp...
Feb 26 17:41:31 core-01 systemd[1]: Started paz-orchestrator announce.
Feb 26 17:41:31 core-01 sh[26857]: Connected to 172.17.8.101:49153/tcp and 172.17.8.101:49154, publishing to etcd..

However the service seemed to be up and running:

$ fleetctl -strict-host-key-checking=false -endpoint=http://172.17.8.101:4001 journal paz-orchestrator.service
-- Logs begin at Thu 2015-02-26 16:40:27 UTC, end at Thu 2015-02-26 17:44:04 UTC. --
Feb 26 17:05:35 core-01 systemd[1]: Started paz-orchestrator: Main API for all paz services and monitor of services in etcd..
Feb 26 17:05:37 core-01 bash[16821]: {}
Feb 26 17:05:37 core-01 bash[16821]: { disabled: 'true',
Feb 26 17:05:37 core-01 bash[16821]: provider: 'dnsimple',
Feb 26 17:05:37 core-01 bash[16821]: email: '[email protected]',
Feb 26 17:05:37 core-01 bash[16821]: apiKey: '312487532487',
Feb 26 17:05:37 core-01 bash[16821]: domain: 'paz' }
Feb 26 17:05:37 core-01 bash[16821]: {"name":"paz-orchestrator_log","hostname":"1add37e0c392","pid":9,"level":30,"msg":"Starting server","time":"2015-02-26T17:05:37.887Z","src":{"file":"/usr/src/app/server.js","line":205},"v":0}
Feb 26 17:05:37 core-01 bash[16821]: {"name":"paz-orchestrator_log","hostname":"1add37e0c392","pid":9,"level":30,"msg":"paz-orchestrator is now running on port 9000","time":"2015-02-26T17:05:37.921Z","src":{"file":"/usr/src/app/server.js","line":194},"v":0}
Feb 26 17:05:37 core-01 bash[16821]: {"name":"paz-orchestrator_log","hostname":"1add37e0c392","pid":9,"level":30,"svcdir-url":"http://paz-service-directory.paz","msg":"","time":"2015-02-26T17:05:37.921Z","src":{"file":"/usr/src/app/server.js","line":195},"v":0}
@tomgco
Copy link
Member Author

tomgco commented Feb 26, 2015

Here is the start of the announcer:

Feb 26 16:41:31 core-01 systemd[1]: Starting paz-orchestrator announce...
Feb 26 17:05:36 core-01 sh[1195]: Waiting for 49153 HostIp:0.0.0.0/tcp...
Feb 26 17:05:36 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory

So it means that the following code is causing the above error message: me!

  until netstat -lnt | grep :$port >/dev/null; \
      do sleep 1; \
  done"

@tomgco
Copy link
Member Author

tomgco commented Feb 26, 2015

So all I can find is that we have a discrepancy when trying to sed for the bound ports when using docker inspect.

How it should look:

Waiting for 49153/tcp...

What it is currently:

Waiting for 49153 HostIp:0.0.0.0/tcp...

My guess is something funky is going on with $port / a race condition in the bootstrapping of the container. https:/yldio/paz/blob/d53997470d5263a2334a16b3835380a8d849dd22/unitfiles/1/paz-orchestrator-announce.service#L19-L21

  port=$(docker inspect -f '{{ index .NetworkSettings.Ports \"9000/tcp\"}}' paz-orchestrator \
    | sed 's/.*Port://' \
    | sed 's/].+*//'); \

Maybe @sublimino might be able to offer some input into robustifying the shell code to bail out if this condition occurs.

@lukebond
Copy link
Contributor

@tomgco thanks for all the details! i'll be taking a look at this on the weekend.

@lukebond
Copy link
Contributor

lukebond commented Mar 7, 2015

I've seen this before when units fail to start up correctly, probably, as you identify @tomgco, when things don't start in the right order. This is an issue in and of itself, but one I haven't any insight into atm.

In the meantime, if this bash code can be made more robust then perhaps we may not see it again.

Got any ideas, @sublimino?

@sublimino
Copy link
Contributor

I think adding dependencies to the units and leveraging systemd's native
ordering is the first port of call, and I'll have a glance at the BASH
scripts when I'm at a computer tomorrow.
On 7 Mar 2015 19:39, "Luke Bond" [email protected] wrote:

I've seen this before when units fail to start up correctly, probably, as
you identify @tomgco https:/tomgco, when things don't start
in the right order. This is an issue in and of itself, but one I haven't
any insight into atm.

In the meantime, if this bash code can be made more robust then perhaps we
may not see it again.

Got any ideas, @sublimino https:/sublimino?


Reply to this email directly or view it on GitHub
#15 (comment).

@lukebond
Copy link
Contributor

lukebond commented Mar 7, 2015

👍

@lukebond
Copy link
Contributor

above @sublimino is referring to #30 btw

@tomgco
Copy link
Member Author

tomgco commented May 17, 2015

This cropped up again when provisioning a machine on Digital Ocean on paz-orchestrator-announce.service, maybe something is missing from the unit file?

@tomgco
Copy link
Member Author

tomgco commented May 17, 2015

A fleetctl stop and start on paz-orchestrator-announce.service fixed this.

@lukebond
Copy link
Contributor

Yeah I've seen this I've seen this quite a bit this weekend :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants