HostMonitor method in system tests does not detect logs correctly #4526

pro-akim · 2023-09-19T12:49:52Z

Running tests on different EC2 instances at #4525 has revealed that the HostMonitor method in the system tests exhibits instability. This instability is directly proportional to the performance of the EC2 instance.

In the case of the basic_cluster, it results in false negatives when using T2, T3 xlarge EC2instances (being fixed in C5.xlarge), and in the agentless_cluster, it consistently produces a false negative even when utilizing up to a C5.xlarge instance.

Since this instability persists in the agentless_cluster with a false negative, further investigation is required.

According to the analysis of test results, the log that HostMonitor is searching for is present but is not being detected by the method.

The text was updated successfully, but these errors were encountered:

pro-akim · 2023-09-19T15:58:03Z

The problem

The HostMonitor has several parts that could be vulnerable when running on a low-performance computer because it operates as follows:

When a host is performing its operations, HostMonitor takes its logs and stores them line by line in a temporary file (file_composer). This process is done in a loop using the host_manager.get_file_content function and is performed without a predefined frequency or waiting time.

Once the log is stored, HostMonitor checks if the desired messages exist in that temporary file. This task has a timeout defined by the test executor in message.yml and is configurable for each message. After that, it generates a dictionary with successes and failures, and if there is a failure, it produces a timeout error message.

There is also a time_step, which represents a pause between different steps, and it is hardcoded in the HostMonitor function.

Vulnerabilities:

It has a rigid timeout. This timeout is set in message.yml and is determined for each message that needs to be waited for during the test execution.
If the get_file_content function takes too long or fails to retrieve the result, it will not be recorded in the temporary file, leading to incorrect validation.

Test with agentless_cluster_env

Changing test_integrity_sync/data/messages.yml timeout out figures:

From	To
120	180
60	120
100	180

Results of agentless_cluster_env 🟢

The test worked fine in an EC2 Ubuntu 22.04.3 t3.large 15GB HD where this test used to fail.

Before changes 🔴
timeout_before_changes.zip

After changes 🟢
timeout_after_changes.zip

This means that in the absence of resources on the EC2, increasing the log exposure time (timeout) can reduce the possibility of false negatives without increasing system requirements.

Time analysis - Cost analysis

Results of agentless_cluster_env

EC2 Ubuntu 22.04.3 LTS t3.large 15GB HD 🟢

Changing test_integrity_sync/data/messages.yml

From	To
120	180
60	120
100	180

report_agentless_cluster.html.zip

Execution time: 2627.19 seconds (Around 48 mins)

EC2 Ubuntu 22.04.3 LTS C5n.2Xlarge 15GB HD 🟢

report_agentlessC5n.zip

Execution time:1909.50 seconds (Around 32 mins)

Considering the installation time which can take around 15 mins

Environment	Total Consumed time	$ per hour	$ per execution	Additional notes
T3.large	63 min	$0.083	$0,08715	It requires timeout fixing only once
C5n.2Xlarge	45 min	$0.8	$0.6

This implies that there is no doubt that modifying timeouts is more cost-effective for agentless_cluster, even if the test fails and its execution needs to be repeated.

Test with basic_environment_env

Test with basic_environment_env time_out_extended 🔴

Changing test_agent_auth/data/message.yml

From	To
30	240

Changing test_enrollment/data/message.yml

From	To
60	120

basic_environment_timeout_extended.zip

Test with basic_environment_env reducing time_step in FileTailer and adding a sleep after get_file_content in HostMonitor.file_composer 🔴

basic_environment_reduced_timecheck.zip

pro-akim · 2023-09-20T14:57:21Z

Conclusion

In the obtained failures, it can be observed that the log is generated but fails to be successfully stored in the temporary file, which means that the file_composer requires a refactorization (dynamic waits and control points) in order to be independent of the machine/EC2 performance.

After discussing with the team, it is understood that the refactoring of HostMonitor is not a priority as it is in the process of upcoming deprecation.

All tests can be run in a stable environment after the investigation, so we are closing this issue, incorporating the timeout modifications found in agentless_cluster.

pro-akim added level/task Task issue type/bug labels Sep 19, 2023

pro-akim mentioned this issue Sep 19, 2023

Basic_environment_env and agentless_env tests are not stable executed in EC2 #4525

Closed

pro-akim self-assigned this Sep 19, 2023

pro-akim added a commit that referenced this issue Sep 20, 2023

fix(#4526): Agentless timeout changed

9ff0e01

pro-akim mentioned this issue Sep 20, 2023

Agentless_cluster_env system tests timeout changed in order to reduce EC2 requirements #4534

Merged

pro-akim linked a pull request Sep 20, 2023 that will close this issue

Agentless_cluster_env system tests timeout changed in order to reduce EC2 requirements #4534

Merged

pro-akim mentioned this issue Sep 20, 2023

System tests not working on virtual environments #4509

Closed

pro-akim added a commit that referenced this issue Sep 20, 2023

fix(#4526): fixes after review

ebbcdb5

pro-akim added the qa_known Issues that are already known by the QA team label Sep 20, 2023

juliamagan closed this as completed Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HostMonitor method in system tests does not detect logs correctly #4526

HostMonitor method in system tests does not detect logs correctly #4526

pro-akim commented Sep 19, 2023 •

edited

Loading

pro-akim commented Sep 19, 2023 •

edited

Loading

pro-akim commented Sep 20, 2023 •

edited

Loading

HostMonitor method in system tests does not detect logs correctly #4526

HostMonitor method in system tests does not detect logs correctly #4526

Comments

pro-akim commented Sep 19, 2023 • edited Loading

pro-akim commented Sep 19, 2023 • edited Loading

The problem

Test with agentless_cluster_env

Results of agentless_cluster_env 🟢

Time analysis - Cost analysis

Results of agentless_cluster_env

EC2 Ubuntu 22.04.3 LTS t3.large 15GB HD 🟢

EC2 Ubuntu 22.04.3 LTS C5n.2Xlarge 15GB HD 🟢

Test with basic_environment_env

Test with basic_environment_env time_out_extended 🔴

Test with basic_environment_env reducing time_step in FileTailer and adding a sleep after get_file_content in HostMonitor.file_composer 🔴

pro-akim commented Sep 20, 2023 • edited Loading

Conclusion

pro-akim commented Sep 19, 2023 •

edited

Loading

pro-akim commented Sep 19, 2023 •

edited

Loading

pro-akim commented Sep 20, 2023 •

edited

Loading