Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.7.0-rc1 PerfMon % Processor Time discrepancy #4250

Closed
matthenning opened this issue Jun 7, 2018 · 14 comments · Fixed by #4267
Closed

1.7.0-rc1 PerfMon % Processor Time discrepancy #4250

matthenning opened this issue Jun 7, 2018 · 14 comments · Fixed by #4267
Labels
area/windows Related to windows plugins (win_eventlog, win_perf_counters, win_services) bug unexpected problem or unintended behavior
Milestone

Comments

@matthenning
Copy link

Relevant telegraf.conf:

[global_tags]

[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "5s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = "***"
  omit_hostname = false

[[outputs.influxdb]]
  urls = ["https://***:8086"]
  database = "telegraf"
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  username = "***"
  password = "***"

[[inputs.win_perf_counters]]
  [[inputs.win_perf_counters.object]]
    ObjectName = "Processor"
    Instances = ["*"]
    Counters = ["% Idle Time", "% Interrupt Time", "% Privileged Time", "% User Time", "% Processor Time"]
    Measurement = "win_cpu"
    IncludeTotal=true

System info:

Telegraf: 1.7.0-rc1
Windows: Server 2012 R2

Steps to reproduce:

Compare % Process Time in Grafana (or the graphing tool of your choice) and perfmon

Expected behavior:

The collected values are identical to the actual values on the system

Actual behavior:

During the time span where CPU Load was at 100% in the following screenshot perfmon reported a steady load of 82-88% of the same exact counter, so 12-15% less than the collected value.
grafik

@danielnelson
Copy link
Contributor

Thank you for the help testing. I'm not sure what would cause the difference here, would it be possible to upload an image of your perfmon output against the raw data in influxdb using the influx command and running select Percent_Processor_time from win_cpu where time > $starttime and time < $endtime with the appropriate time bounds.

@danielnelson danielnelson added bug unexpected problem or unintended behavior area/windows Related to windows plugins (win_eventlog, win_perf_counters, win_services) labels Jun 7, 2018
@matthenning
Copy link
Author

For this test we reduced the interval from 60s to 10s:

Query output (all times without +02:00):

> SELECT "Percent_Processor_Time" FROM "win_cpu" WHERE "host" = '***' AND "instance" = '_Total' AND time > NOW()-2m
name: win_cpu
time                 Percent_Processor_Time
----                 ----------------------
2018-06-08T07:24:01Z 60.84709548950195
2018-06-08T07:24:25Z 77.2288589477539
2018-06-08T07:24:33Z 53.76494598388672
2018-06-08T07:24:41Z 49.0409049987793
2018-06-08T07:24:53Z 39.62447738647461
2018-06-08T07:25:04Z 52.0313606262207
2018-06-08T07:25:15Z 55.34902572631836
2018-06-08T07:25:25Z 64.83545684814453
2018-06-08T07:25:31Z 52.79405975341797
2018-06-08T07:25:43Z 65.84320068359375

PerfMon output with the original data in red and the influx query output approximately added in green:
grafik

@matthenning
Copy link
Author

This is a multi processor system in case that's important.

@danielnelson
Copy link
Contributor

I wonder if this could be because we are using the Telegraf time instead of the counter time. @vlastahajek We should bring in the UsePerfCounterTime feature from #4076.

@matthenning
Copy link
Author

I'm available for testing of required

@vlastahajek
Copy link
Contributor

When using Telegraf 1.6, are the results ok?

@matthenning
Copy link
Author

matthenning commented Jun 11, 2018

In 1.6 there was a different issue with this counter: #4198
So I'm not sure whether the values we'd receive from 1.6 would be viable for comparison.

vlastahajek added a commit to bonitoo-io/telegraf that referenced this issue Jun 11, 2018
@danielnelson danielnelson added this to the 1.8.0 milestone Jun 30, 2018
@matthenning
Copy link
Author

Thank you for your effort folks.

@danielnelson
Copy link
Contributor

@matthenning Would it be possible for you to test out the fix with a nightly builds to see if it helps with the accuracy? The change was a little larger than I would normally backport for a patch release but if it solves the issue and is well tested we can consider it.

@matthenning
Copy link
Author

Will do. I'll report back.

@matthenning
Copy link
Author

I fear the problem isn't solved yet.

Perfmon:
grafik

Grafana:
grafik

Query:

SELECT 
	last("Percent_Processor_Time") 
FROM 
	"win_cpu" 
WHERE (
	"host" = '***' 
	AND "instance" = '_Total') 
	AND $timeFilter 
GROUP BY 
	time($__interval)
fill(linear)

Relevant Telegraf config:

[global_tags]

[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "5s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = "***"
  omit_hostname = false

[[outputs.influxdb]]
  urls = ["***"]
  database = "telegraf"
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  username = "***"
  password = "***"
   

[[inputs.win_perf_counters]]
  UseWildcardsExpansion = true
  UsePerfCounterTime = true
  
  [[inputs.win_perf_counters.object]]
    ObjectName = "Processor"
    Instances = ["*"]
    Counters = ["% Idle Time", "% Interrupt Time", "% Privileged Time", "% User Time", "% Processor Time"]
    Measurement = "win_cpu"
    IncludeTotal=true

Telegraf v1.8.0~abfbf4f4 (git: master abfbf4f)

@vlastahajek
Copy link
Contributor

It seems that Perfmon uses different way of gathering data. On my laptop it even offers counters that are not available by Telegraf or Typeperf (MS command line tool for reading performance counters). E.g. \Battery Status

I did test over shorter period of time with 2s interval:

If I try Typeperf, which uses the same mechanism of accessing counters as Telegraf, it returns very similar data (it's hard to precisely synchronize points of measurements) :

image

Same measurement in perfmon:
image

To show even different values, see CPU activity progress from Taskmgr measured over the same period:
image

I will try longer test tomorrow.

@matthenning
Copy link
Author

Taskmgr values almost match your typeperf data.
Maybe perfmon is the problem and there is nothing to fix here?

@vlastahajek
Copy link
Contributor

The main difference is that taskmgr shows even different scale of values. The values there reach 100%, which wasn't noted by any other method of measurement.

I didn't investigated the science behind CPU activity measurement, so it is hard to say what is wrong and what is right. For sure we can say that Telegraf provides good results with respect to the way how it collects data from performance counters.

Is there any other metrics you are interested in we can look at and compare?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/windows Related to windows plugins (win_eventlog, win_perf_counters, win_services) bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants