-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stack 8.16.0-SNAPSHOT] [cyberarkpas] Failing test daily: system test: tls in cyberarkpas.audit #10620
Comments
Looking at the set of failures in these, I see (mod the tenable_io issue, which is addressed by #10940) over a wide range of packages: no hits; changes in processor behaviour; missing field definitions — present in the definitions files; and failed CEL evaluations — in the case where the error is reported a 404. This looks very much to me like a global CI failure rather than a specific cyberarkpas failure. All of the elastic-agent logs that are retained in the builders' stores have staled out, so to investigate this further we will need to see a new failure. |
@elastic/ecosystem can you have a look? |
In all the cases that I've looked at here (all of 16022 and one of 15974), I see a final line in the agent log that looks like this
message: |
I think that error comes from the unenrollment/removal of the Elastic Agent that Looking at the logs of the build 15974, there are some documents ingested (180) but not all the documents expected (343) according to the system test configuration file. Testing locally, I've been able to reproduce it too but not consistently. It fails randomly or at least I did not have any reason for that. From the logs retrieved in the build 15974:
There are some messages like this one showing the number of events sent to Elasticsearch and match with the documents that were ingested: {"log.level":"debug","@timestamp":"2024-09-17T01:48:48.643Z","message":"doBulkRequest: 50 events have been sent to elasticsearch in 1.13726457s.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-09-17T01:48:48.750Z","message":"doBulkRequest: 50 events have been sent to elasticsearch in 106.270895ms.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"log.logger":"elasticsearch","log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-09-17T01:48:49.015Z","message":"doBulkRequest: 50 events have been sent to elasticsearch in 264.559873ms.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"elasticsearch","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-09-17T01:48:49.807Z","message":"doBulkRequest: 30 events have been sent to elasticsearch in 53.129321ms.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0"} That message is the last bulk operation performed, and then after one minute there is an error connecting to Elasticsearch: {"log.level":"debug","@timestamp":"2024-09-17T01:49:49.810Z","message":"Error reading from connection: read tcp 172.18.0.6:36554->172.18.0.3:9200: use of closed network connection","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"log.logger":"esclientleg","network":"tcp","ecs.version":"1.6.0","log.origin":{"file.line":51,"file.name":"transport/logging.go","function":"github.com/elastic/elastic-agent-libs/transport.(*loggingConn).Read"},"service.name":"filebeat","address":"elasticsearch:9200","ecs.version":"1.6.0"} And later, there is another error connecting to another port (tls service container that has been killed by elastic-package with SIGHUP signal?): {"log.level":"debug","@timestamp":"2024-09-17T01:51:18.414Z","message":"Client error","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"address":"0.0.0.0:9999","error":{"message":"read tcp 172.19.0.2:9999->172.19.0.3:39940: read: connection timed out"},"ecs.version":"1.6.0","log.logger":"tcp","log.origin":{"file.line":202,"file.name":"streaming/listener.go","function":"github.com/elastic/beats/v7/filebeat/inputsource/common/streaming.(*Listener).handleConnection"},"service.name":"filebeat","remote_address":"172.19.0.3:39940","ecs.version":"1.6.0"} Just realized that there were also these kind of errors for previous stack versions starting in July 13th for 8.15.0-SNAPSHOT: #10481. Being able to reproduce it in 8.15.0-SNAPSHOT locally too. Looking at the Buildkite stats, this package started to fail on the build run on July 13th. Older failures are not related, they were related to artifacts-api, stack not being able to bootstrap or other CI issues. There are quite a few packages that use this same approach to ingest logs using the stream docker container. For instance Not sure what could be causing that not all the expected documents are ingested in Elasticsearch. |
Adding some more context/findings from the logs of that CI step, but nothing that could show what it is the root cause of this error. According to the logs sent by the $ grep "Log data sent" buildkite_integrations_main-integrations-15974_insecure-logs_cyberarkpas_container-logs_cyberarkpas-audit-tls-1726538329934607865.log | awk -F '|' '{print $2}' | jq -r '.total_lines' | awk '{s+=$1} END {printf "%.0f\n", s}'
343 From the point of view of Elasticsearch, there are no errors in that container between the range time of the last Bulk operation sent (01:48:49) and the "Error reading from connection" (01:49:49) messages: {"@timestamp":"2024-09-17T01:48:48.947Z", "log.level": "INFO", "message":"[.ds-logs-cyberarkpas.audit-94816-2024.09.17-000001/HqrdhoeiQzqGG34YTWlrHg] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[8697ca59cc19][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"Hqgbt5zMRomDk4_megDCvA","elasticsearch.node.id":"SIEJUddDQdSe-JyIMVeidw","elasticsearch.node.name":"8697ca59cc19","elasticsearch.cluster.name":"elasticsearch"}
{"@timestamp":"2024-09-17T01:58:50.295Z", "log.level": "INFO", "message":"removing data stream [logs-cyberarkpas.audit-94816]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[8697ca59cc19][masterService#updateTask][T#4]","log.logger":"org.elasticsearch.datastreams.action.DeleteDataStreamTransportAction","elasticsearch.cluster.uuid":"Hqgbt5zMRomDk4_megDCvA","elasticsearch.node.id":"SIEJUddDQdSe-JyIMVeidw","elasticsearch.node.name":"8697ca59cc19","elasticsearch.cluster.name":"elasticsearch"} And looking at the latest 5 builds at the moment of writing this, the number of documents ingested (hits found) into Elasticsearch is different for each build: 231, 180, 228, 225 and 236. |
I think this is the same as in #11075 and #11224 (although for the latter it's I was able to reproduce it locally by removing the system tests and containers that aren't for
And these services from
The container that streams the data does indeed say it's sending all data. It's a little odd that the per line log output is sometimes incomplete, while the per file logging seems to be complete. I think it is sending everything but those debug log lines are getting delayed or limited somehow. I tried different I was able to reproduce the problem using just |
Failure:
First build failed: https://buildkite.com/elastic/integrations/builds/14060
Latest 5 failed builds:
The text was updated successfully, but these errors were encountered: