Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HostMonitor method in system tests does not detect logs correctly #4526

Closed
pro-akim opened this issue Sep 19, 2023 · 2 comments · Fixed by #4534
Closed

HostMonitor method in system tests does not detect logs correctly #4526

pro-akim opened this issue Sep 19, 2023 · 2 comments · Fixed by #4534
Assignees
Labels
level/task Task issue qa_known Issues that are already known by the QA team type/bug

Comments

@pro-akim
Copy link
Member

pro-akim commented Sep 19, 2023

Running tests on different EC2 instances at #4525 has revealed that the HostMonitor method in the system tests exhibits instability. This instability is directly proportional to the performance of the EC2 instance.

In the case of the basic_cluster, it results in false negatives when using T2, T3 xlarge EC2instances (being fixed in C5.xlarge), and in the agentless_cluster, it consistently produces a false negative even when utilizing up to a C5.xlarge instance.

Since this instability persists in the agentless_cluster with a false negative, further investigation is required.

According to the analysis of test results, the log that HostMonitor is searching for is present but is not being detected by the method.

@pro-akim
Copy link
Member Author

pro-akim commented Sep 19, 2023

The problem

The HostMonitor has several parts that could be vulnerable when running on a low-performance computer because it operates as follows:

When a host is performing its operations, HostMonitor takes its logs and stores them line by line in a temporary file (file_composer). This process is done in a loop using the host_manager.get_file_content function and is performed without a predefined frequency or waiting time.

Once the log is stored, HostMonitor checks if the desired messages exist in that temporary file. This task has a timeout defined by the test executor in message.yml and is configurable for each message. After that, it generates a dictionary with successes and failures, and if there is a failure, it produces a timeout error message.

There is also a time_step, which represents a pause between different steps, and it is hardcoded in the HostMonitor function.

Vulnerabilities:

  1. It has a rigid timeout. This timeout is set in message.yml and is determined for each message that needs to be waited for during the test execution.
  2. If the get_file_content function takes too long or fails to retrieve the result, it will not be recorded in the temporary file, leading to incorrect validation.

Test with agentless_cluster_env

Changing test_integrity_sync/data/messages.yml timeout out figures:

From To
120 180
60 120
100 180

Results of agentless_cluster_env 🟢

The test worked fine in an EC2 Ubuntu 22.04.3 t3.large 15GB HD where this test used to fail.

Before changes 🔴
timeout_before_changes.zip

After changes 🟢
timeout_after_changes.zip

This means that in the absence of resources on the EC2, increasing the log exposure time (timeout) can reduce the possibility of false negatives without increasing system requirements.

Time analysis - Cost analysis

Results of agentless_cluster_env

EC2 Ubuntu 22.04.3 LTS t3.large 15GB HD 🟢

Changing test_integrity_sync/data/messages.yml

From To
120 180
60 120
100 180

report_agentless_cluster.html.zip

Execution time: 2627.19 seconds (Around 48 mins)

EC2 Ubuntu 22.04.3 LTS C5n.2Xlarge 15GB HD 🟢

report_agentlessC5n.zip

Execution time:1909.50 seconds (Around 32 mins)

Considering the installation time which can take around 15 mins

Environment Total Consumed time $ per hour $ per execution Additional notes
T3.large 63 min $0.083 $0,08715 It requires timeout fixing only once
C5n.2Xlarge 45 min $0.8 $0.6

This implies that there is no doubt that modifying timeouts is more cost-effective for agentless_cluster, even if the test fails and its execution needs to be repeated.

Test with basic_environment_env

Test with basic_environment_env time_out_extended 🔴

Changing test_agent_auth/data/message.yml

From To
30 240

Changing test_enrollment/data/message.yml

From To
60 120

basic_environment_timeout_extended.zip

Test with basic_environment_env reducing time_step in FileTailer and adding a sleep after get_file_content in HostMonitor.file_composer 🔴

basic_environment_reduced_timecheck.zip

@pro-akim
Copy link
Member Author

pro-akim commented Sep 20, 2023

Conclusion

In the obtained failures, it can be observed that the log is generated but fails to be successfully stored in the temporary file, which means that the file_composer requires a refactorization (dynamic waits and control points) in order to be independent of the machine/EC2 performance.

After discussing with the team, it is understood that the refactoring of HostMonitor is not a priority as it is in the process of upcoming deprecation.

All tests can be run in a stable environment after the investigation, so we are closing this issue, incorporating the timeout modifications found in agentless_cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level/task Task issue qa_known Issues that are already known by the QA team type/bug
Projects
No open projects
Status: Done
2 participants