Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added result_code field #5043

Closed
wants to merge 1 commit into from
Closed

added result_code field #5043

wants to merge 1 commit into from

Conversation

LTKH
Copy link

@LTKH LTKH commented Nov 27, 2018

Hello,
added the result_code field to the procstate input plugin, similar to the http_response and net_response plugins.
result_code (int, success = 0, not_running = 1, error_getting_process = 2)
Please consider the request.

@glinton glinton added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin area/procstat labels Nov 27, 2018
@danielnelson
Copy link
Contributor

You can get this information from the procstat_lookup measurement. The not running and the error case are generally the same case, the error occurs when the process exits after it has been found but before the metrics are gathered for it.

The reason we used a separate measurement is because it is possible to search for multiple processes using a regular expression, and since we don't know the pid or how many metrics will be created we can't always add a new value to the series.

@LTKH
Copy link
Author

LTKH commented Nov 28, 2018

Hello. Yes, we tested procstat_lookup. The problem is that even if the application is not running, but the pid file exists (application crashed), then pid_count! = 0. What makes it impossible to use it. And in our company, monitoring is mainly organized precisely by the pid-file.

@danielnelson
Copy link
Contributor

Perhaps you can leverage the fact that the procstat_loop is non-zero but there is no corresponding procstat entry to detect this? Generally the best way to alert on this plugin is by configuring a deadman.

@LTKH
Copy link
Author

LTKH commented Nov 28, 2018

We tried to use deadmean. faced another problem. If the application is no longer in use (removed from the monitoring agent), the deadmean continues to send alerts. This creates serious problems when monitoring thousands of applications.

@danielnelson
Copy link
Contributor

Wouldn't you have the same problem with this? If the process is removed the alert must be removed as well.

@LTKH
Copy link
Author

LTKH commented Nov 28, 2018

If you are not using the deadmean, you can determine that the application has been removed.
result_code:
0 - success
1 - does not work
2 - getting process error
no data - application deleted (and wrong agent setting based on this)

The list of applications often changes on the servers.
Unfortunately, other options do not give the desired effect, I thought it would be useful for everyone. I really didn’t want to continue using my own build.

@danielnelson
Copy link
Contributor

Okay, thanks for helping me understand. This change has been proposed in the paste unfortunately it doesn't work well with some configurations. For example:

[[inputs.procstat]]
  pattern = "influxd"
  pid_tag = true

This normally creates something like:

> procstat_lookup,pattern=influxd pid_count=1i
> procstat,pattern=influxd,pid=7948,process_name=influxd,user=influxdb cpu_time_guest=0

But if the process isn't found, we are unable to set most of the tags and of course none of the fields, which would leave us with something like:

> procstat_lookup,pattern=influxd pid_count=1i
> procstat,pattern=influxd result_code=2i

It gets even more complicated and strange when you use a pattern that matches multiple processes.

Perhaps instead we could add more info to the procstat_lookup measurement, to cover the cases where the pid is found but the process is not. What if we added a running field with the number of processes that were actually found?

> procstat_lookup,pattern=influxd pid_count=1i,running=0i

@LTKH
Copy link
Author

LTKH commented Nov 29, 2018

Hello. I agree, it looks quite difficult. Although ... I can not imagine a situation where the query is based on the pid number.
Usually we use additional tags (path to the pid file or the pid number can change).

[[inputs.procstat]]
  pattern = "service"

  [inputs.procstat.tags]
    application = "app_name"

The running option is a good idea, but it does not make a difference between a stopped and fallen application.
How we use it (system administrators do not rush to start the service if the service is stopped correctly, maybe there is no problem).
2018-11-29 21 36 34

Move result_code to the procstat_lookup - it may be correct.
What do you think about it?
P.S. Sorry for my English

@danielnelson
Copy link
Contributor

The running option is a good idea, but it does not make a difference between a stopped and fallen application.

We can't actually detect any distinction between these generally in the procstat plugin, but it sounds like you have been defining it as when a pidfile exists but the process is not found. With the idea above it would be if the two fields, pid_count and running, are not equal.

@LTKH
Copy link
Author

LTKH commented Nov 29, 2018

I understood. Yes, this is a good option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/procstat feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants