-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HostMonitor method in system tests does not detect logs correctly #4526
Comments
The problemThe HostMonitor has several parts that could be vulnerable when running on a low-performance computer because it operates as follows: When a host is performing its operations, HostMonitor takes its logs and stores them line by line in a temporary file (file_composer). This process is done in a loop using the host_manager.get_file_content function and is performed without a predefined frequency or waiting time. Once the log is stored, HostMonitor checks if the desired messages exist in that temporary file. This task has a timeout defined by the test executor in message.yml and is configurable for each message. After that, it generates a dictionary with successes and failures, and if there is a failure, it produces a timeout error message. There is also a time_step, which represents a pause between different steps, and it is hardcoded in the HostMonitor function. Vulnerabilities:
Test with agentless_cluster_envChanging test_integrity_sync/data/messages.yml timeout out figures:
Results of agentless_cluster_env 🟢The test worked fine in an EC2 Ubuntu 22.04.3 t3.large 15GB HD where this test used to fail. Before changes 🔴 After changes 🟢 This means that in the absence of resources on the EC2, increasing the log exposure time (timeout) can reduce the possibility of false negatives without increasing system requirements. Time analysis - Cost analysisResults of agentless_cluster_envEC2 Ubuntu 22.04.3 LTS t3.large 15GB HD 🟢Changing test_integrity_sync/data/messages.yml
Execution time: 2627.19 seconds (Around 48 mins) EC2 Ubuntu 22.04.3 LTS C5n.2Xlarge 15GB HD 🟢Execution time:1909.50 seconds (Around 32 mins) Considering the installation time which can take around 15 mins
This implies that there is no doubt that modifying timeouts is more cost-effective for Test with basic_environment_envTest with basic_environment_env time_out_extended 🔴Changing test_agent_auth/data/message.yml
Changing test_enrollment/data/message.yml
Test with basic_environment_env reducing time_step in FileTailer and adding a sleep after get_file_content in HostMonitor.file_composer 🔴 |
ConclusionIn the obtained failures, it can be observed that the log is generated but fails to be successfully stored in the temporary file, which means that the file_composer requires a refactorization (dynamic waits and control points) in order to be independent of the machine/EC2 performance. After discussing with the team, it is understood that the refactoring of HostMonitor is not a priority as it is in the process of upcoming deprecation. All tests can be run in a stable environment after the investigation, so we are closing this issue, incorporating the timeout modifications found in agentless_cluster. |
Running tests on different EC2 instances at #4525 has revealed that the HostMonitor method in the system tests exhibits instability. This instability is directly proportional to the performance of the EC2 instance.
In the case of the
basic_cluster
, it results in false negatives when usingT2, T3 xlarge EC2
instances (being fixed inC5.xlarge
), and in theagentless_cluster
, it consistently produces a false negative even when utilizing up to aC5.xlarge
instance.Since this instability persists in the
agentless_cluster
with a false negative, further investigation is required.According to the analysis of test results, the log that HostMonitor is searching for is present but is not being detected by the method.
The text was updated successfully, but these errors were encountered: