Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf stop sending metrics when some input plugin hang #771

Closed
wants to merge 4 commits into from

Conversation

PierreF
Copy link
Contributor

@PierreF PierreF commented Feb 29, 2016

This PR improve timeout of input plugins to avoid blocking all inputs.

For example, with Nginx, if the server does not respond at all (either firewall that DROP or host just shutdown), zero metrics are wrote for about 2 minutes, then one batch is wrote, 2 minutes with nothing ...

To reproduce:

$ telegraf -sample-config -input-filter nginx:cpu -output-filter influxdb > test.conf
$ sed -i 's@http://localhost/status@http://1.2.3.4/status@' test.conf    # point to a non-responding IP address.
$ telegraf -config test.conf
2016/02/29 18:07:08 Starting Telegraf (version 0.10.4.1)
2016/02/29 18:07:08 Loaded outputs: influxdb
2016/02/29 18:07:08 Loaded inputs: cpu nginx
2016/02/29 18:07:08 Tags enabled: host=ubuntu
2016/02/29 18:07:08 Agent Config: Interval:10s, Debug:false, Quiet:false, Hostname:"ubuntu", Flush Interval:10s 
2016/02/29 18:07:20 Wrote 0 metrics to output influxdb in 761.636µs
2016/02/29 18:07:30 Wrote 0 metrics to output influxdb in 554.27µs
2016/02/29 18:07:40 Wrote 0 metrics to output influxdb in 438.092µs
2016/02/29 18:07:50 Wrote 0 metrics to output influxdb in 468.076µs
2016/02/29 18:08:00 Wrote 0 metrics to output influxdb in 469.61µs
2016/02/29 18:08:10 Wrote 0 metrics to output influxdb in 432.655µs
2016/02/29 18:08:20 Wrote 0 metrics to output influxdb in 406.972µs
2016/02/29 18:08:30 Wrote 0 metrics to output influxdb in 820.321µs
2016/02/29 18:08:40 Wrote 0 metrics to output influxdb in 461.684µs
2016/02/29 18:08:50 Wrote 0 metrics to output influxdb in 482.844µs
2016/02/29 18:09:00 Wrote 0 metrics to output influxdb in 472.513µs
2016/02/29 18:09:10 Wrote 0 metrics to output influxdb in 472.036µs
2016/02/29 18:09:17 Error in input [nginx]: error making HTTP request to http://1.2.3.4/status: Get http://1.2.3.4/status: dial tcp 1.2.3.4:80: getsockopt: connection timed out
2016/02/29 18:09:17 Gathered metrics, (10s interval), from 2 inputs in 2m7.259746986s
2016/02/29 18:09:20 Wrote 5 metrics to output influxdb in 4.064824ms
2016/02/29 18:09:30 Wrote 0 metrics to output influxdb in 649.197µs
[...]

With this PR applied:

$ ./build/linux/amd64/telegraf -config test.conf 
2016/02/29 18:10:07 Starting Telegraf (version 0.10.4.1-9-g322871d)
2016/02/29 18:10:07 Loaded outputs: influxdb
2016/02/29 18:10:07 Loaded inputs: cpu nginx
2016/02/29 18:10:07 Tags enabled: host=ubuntu
2016/02/29 18:10:07 Agent Config: Interval:10s, Debug:false, Quiet:false, Hostname:"ubuntu", Flush Interval:10s 
2016/02/29 18:10:14 Error in input [nginx]: error making HTTP request to http://1.2.3.4/status: Get http://1.2.3.4/status: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2016/02/29 18:10:14 Gathered metrics, (10s interval), from 2 inputs in 4.006359088s
2016/02/29 18:10:20 Wrote 5 metrics to output influxdb in 4.016221ms
2016/02/29 18:10:24 Error in input [nginx]: error making HTTP request to http://1.2.3.4/status: Get http://1.2.3.4/status: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2016/02/29 18:10:24 Gathered metrics, (10s interval), from 2 inputs in 4.000246823s
2016/02/29 18:10:30 Wrote 5 metrics to output influxdb in 3.470205ms

This could happen in real situation with Docker. Right after a docker stop of a monitored Nginx container, Telegraf will took ~2 minute to fail with connection timeout for each run of nginx input pluging.

I've tested that the following input work (both in normal case and after a docker stop): memcached mysql zookeeper redis nginx apache dovecot

I've change 3 kind of input:

  • TCP using net.Dial : change to net.DialTimeout and then using conn.SetDeadline
  • net/http : adding Timeout to http.Client (for connection timeout) and ensure that ResponseHeaderTimeout is set on http.Transport)
  • mysql : Modify the DSN to add timeout parameter. I don't know if DSN should be modified by Telegraf or only strongly suggest user to use DSN with timeout.

@sparrc
Copy link
Contributor

sparrc commented Feb 29, 2016

wow, this looks excellent, thanks @PierreF, looks like there are just a few timeout parameters that you need to adjust for unit tests

@PierreF
Copy link
Contributor Author

PierreF commented Feb 29, 2016

Test fixed :)

@sparrc sparrc closed this in fe43fb4 Mar 1, 2016
geodimm pushed a commit to miketonks/telegraf that referenced this pull request Mar 10, 2016
@PierreF PierreF deleted the timeout branch August 4, 2018 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants