added result_code field #5043

LTKH · 2018-11-27T06:19:20Z

Hello,
added the result_code field to the procstate input plugin, similar to the http_response and net_response plugins.
result_code (int, success = 0, not_running = 1, error_getting_process = 2)
Please consider the request.

danielnelson · 2018-11-27T22:18:54Z

You can get this information from the procstat_lookup measurement. The not running and the error case are generally the same case, the error occurs when the process exits after it has been found but before the metrics are gathered for it.

The reason we used a separate measurement is because it is possible to search for multiple processes using a regular expression, and since we don't know the pid or how many metrics will be created we can't always add a new value to the series.

LTKH · 2018-11-28T02:46:52Z

Hello. Yes, we tested procstat_lookup. The problem is that even if the application is not running, but the pid file exists (application crashed), then pid_count! = 0. What makes it impossible to use it. And in our company, monitoring is mainly organized precisely by the pid-file.

danielnelson · 2018-11-28T03:11:00Z

Perhaps you can leverage the fact that the procstat_loop is non-zero but there is no corresponding procstat entry to detect this? Generally the best way to alert on this plugin is by configuring a deadman.

LTKH · 2018-11-28T03:19:23Z

We tried to use deadmean. faced another problem. If the application is no longer in use (removed from the monitoring agent), the deadmean continues to send alerts. This creates serious problems when monitoring thousands of applications.

danielnelson · 2018-11-28T03:41:31Z

Wouldn't you have the same problem with this? If the process is removed the alert must be removed as well.

LTKH · 2018-11-28T03:54:22Z

If you are not using the deadmean, you can determine that the application has been removed.
result_code:
0 - success
1 - does not work
2 - getting process error
no data - application deleted (and wrong agent setting based on this)

The list of applications often changes on the servers.
Unfortunately, other options do not give the desired effect, I thought it would be useful for everyone. I really didn’t want to continue using my own build.

danielnelson · 2018-11-29T00:01:21Z

Okay, thanks for helping me understand. This change has been proposed in the paste unfortunately it doesn't work well with some configurations. For example:

[[inputs.procstat]]
  pattern = "influxd"
  pid_tag = true

This normally creates something like:

> procstat_lookup,pattern=influxd pid_count=1i
> procstat,pattern=influxd,pid=7948,process_name=influxd,user=influxdb cpu_time_guest=0

But if the process isn't found, we are unable to set most of the tags and of course none of the fields, which would leave us with something like:

> procstat_lookup,pattern=influxd pid_count=1i
> procstat,pattern=influxd result_code=2i

It gets even more complicated and strange when you use a pattern that matches multiple processes.

Perhaps instead we could add more info to the procstat_lookup measurement, to cover the cases where the pid is found but the process is not. What if we added a running field with the number of processes that were actually found?

> procstat_lookup,pattern=influxd pid_count=1i,running=0i

LTKH · 2018-11-29T11:59:42Z

Hello. I agree, it looks quite difficult. Although ... I can not imagine a situation where the query is based on the pid number.
Usually we use additional tags (path to the pid file or the pid number can change).

[[inputs.procstat]]
  pattern = "service"

  [inputs.procstat.tags]
    application = "app_name"

The running option is a good idea, but it does not make a difference between a stopped and fallen application.
How we use it (system administrators do not rush to start the service if the service is stopped correctly, maybe there is no problem).

Move result_code to the procstat_lookup - it may be correct.
What do you think about it?
P.S. Sorry for my English

danielnelson · 2018-11-29T21:37:02Z

The running option is a good idea, but it does not make a difference between a stopped and fallen application.

We can't actually detect any distinction between these generally in the procstat plugin, but it sounds like you have been defining it as when a pidfile exists but the process is not found. With the idea above it would be if the two fields, pid_count and running, are not equal.

LTKH · 2018-11-29T22:27:32Z

I understood. Yes, this is a good option.

added result_code field

5fc3d97

glinton approved these changes Nov 27, 2018

View reviewed changes

glinton added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin area/procstat labels Nov 27, 2018

danielnelson closed this Nov 27, 2018

danielnelson mentioned this pull request Nov 30, 2018

Add running field to procstat_lookup #5069

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added result_code field #5043

added result_code field #5043

LTKH commented Nov 27, 2018

danielnelson commented Nov 27, 2018

LTKH commented Nov 28, 2018

danielnelson commented Nov 28, 2018

LTKH commented Nov 28, 2018

danielnelson commented Nov 28, 2018

LTKH commented Nov 28, 2018 •

edited

Loading

danielnelson commented Nov 29, 2018

LTKH commented Nov 29, 2018

danielnelson commented Nov 29, 2018

LTKH commented Nov 29, 2018

added result_code field #5043

added result_code field #5043

Conversation

LTKH commented Nov 27, 2018

danielnelson commented Nov 27, 2018

LTKH commented Nov 28, 2018

danielnelson commented Nov 28, 2018

LTKH commented Nov 28, 2018

danielnelson commented Nov 28, 2018

LTKH commented Nov 28, 2018 • edited Loading

danielnelson commented Nov 29, 2018

LTKH commented Nov 29, 2018

danielnelson commented Nov 29, 2018

LTKH commented Nov 29, 2018

LTKH commented Nov 28, 2018 •

edited

Loading