[Stack 8.16.0-SNAPSHOT] [cyberarkpas] Failing test daily: system test: tls in cyberarkpas.audit #10620

elastic-vault-github-plugin-prod · 2024-07-27T03:20:08Z

Stack version: 8.16.0-SNAPSHOT
Package: cyberarkpas
Failing test: system test: tls
DataStream: audit
Owners:
- @elastic/security-service-integrations

Failure:

test case failed: could not find hits in logs-cyberarkpas.audit-76857 data stream

First build failed: https://buildkite.com/elastic/integrations/builds/14060

Latest 5 failed builds:

The text was updated successfully, but these errors were encountered:

efd6 · 2024-09-03T22:40:13Z

Looking at the set of failures in these, I see (mod the tenable_io issue, which is addressed by #10940) over a wide range of packages: no hits; changes in processor behaviour; missing field definitions — present in the definitions files; and failed CEL evaluations — in the case where the error is reported a 404. This looks very much to me like a global CI failure rather than a specific cyberarkpas failure.

All of the elastic-agent logs that are retained in the builders' stores have staled out, so to investigate this further we will need to see a new failure.

narph · 2024-09-17T08:44:47Z

@elastic/ecosystem can you have a look?

efd6 · 2024-09-18T03:52:42Z

In all the cases that I've looked at here (all of 16022 and one of 15974), I see a final line in the agent log that looks like this

elastic-agent-1  | {"log.level":"warn","@timestamp":"2024-09-18T02:00:53.497Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":183},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: requester 0/1 to host https://fleet-server:8220/ errored: Post \"https://fleet-server:8220/api/fleet/agents/e610ce2b-12de-4956-ba91-7cd154eafcf5/checkin?\": invalid api key to authenticate with fleet"},"request_duration_ns":253705707,"failed_checkins":1,"retry_after_ns":67291627342,"ecs.version":"1.6.0"}

message: "fail to checkin to fleet-server: all hosts failed: requester 0/1 to host https://fleet-server:8220/ errored: Post \"https://fleet-server:8220/api/fleet/agents/e610ce2b-12de-4956-ba91-7cd154eafcf5/checkin?\": invalid api key to authenticate with fleet"

mrodm · 2024-09-18T13:49:46Z

message: "fail to checkin to fleet-server: all hosts failed: requester 0/1 to host https://fleet-server:8220/ errored: Post \"https://fleet-server:8220/api/fleet/agents/e610ce2b-12de-4956-ba91-7cd154eafcf5/checkin?\": invalid api key to authenticate with fleet"

I think that error comes from the unenrollment/removal of the Elastic Agent that elastic-package performs at the end of each test. So I'd say it's expected that this log message appears at the end of each file.

Looking at the logs of the build 15974, there are some documents ingested (180) but not all the documents expected (343) according to the system test configuration file.

Testing locally, I've been able to reproduce it too but not consistently. It fails randomly or at least I did not have any reason for that.

From the logs retrieved in the build 15974:

2024/09/17 01:48:47 DEBUG found 0 hits in logs-cyberarkpas.audit-94816 data stream
2024/09/17 01:48:48 DEBUG found 0 hits in logs-cyberarkpas.audit-94816 data stream
2024/09/17 01:48:49 DEBUG found 92 hits in logs-cyberarkpas.audit-94816 data stream
2024/09/17 01:48:50 DEBUG found 180 hits in logs-cyberarkpas.audit-94816 data stream
2024/09/17 01:48:51 DEBUG found 180 hits in logs-cyberarkpas.audit-94816 data stream
2024/09/17 01:48:52 DEBUG found 180 hits in logs-cyberarkpas.audit-94816 data stream
...

There are some messages like this one showing the number of events sent to Elasticsearch and match with the documents that were ingested:

 {"log.level":"debug","@timestamp":"2024-09-17T01:48:48.643Z","message":"doBulkRequest: 50 events have been sent to elasticsearch in 1.13726457s.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0"}
 {"log.level":"debug","@timestamp":"2024-09-17T01:48:48.750Z","message":"doBulkRequest: 50 events have been sent to elasticsearch in 106.270895ms.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"log.logger":"elasticsearch","log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
 {"log.level":"debug","@timestamp":"2024-09-17T01:48:49.015Z","message":"doBulkRequest: 50 events have been sent to elasticsearch in 264.559873ms.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"elasticsearch","ecs.version":"1.6.0"}
 {"log.level":"debug","@timestamp":"2024-09-17T01:48:49.807Z","message":"doBulkRequest: 30 events have been sent to elasticsearch in 53.129321ms.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0"}

That message is the last bulk operation performed, and then after one minute there is an error connecting to Elasticsearch:

 {"log.level":"debug","@timestamp":"2024-09-17T01:49:49.810Z","message":"Error reading from connection: read tcp 172.18.0.6:36554->172.18.0.3:9200: use of closed network connection","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"log.logger":"esclientleg","network":"tcp","ecs.version":"1.6.0","log.origin":{"file.line":51,"file.name":"transport/logging.go","function":"github.com/elastic/elastic-agent-libs/transport.(*loggingConn).Read"},"service.name":"filebeat","address":"elasticsearch:9200","ecs.version":"1.6.0"}

And later, there is another error connecting to another port (tls service container that has been killed by elastic-package with SIGHUP signal?):

 {"log.level":"debug","@timestamp":"2024-09-17T01:51:18.414Z","message":"Client error","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"address":"0.0.0.0:9999","error":{"message":"read tcp 172.19.0.2:9999->172.19.0.3:39940: read: connection timed out"},"ecs.version":"1.6.0","log.logger":"tcp","log.origin":{"file.line":202,"file.name":"streaming/listener.go","function":"github.com/elastic/beats/v7/filebeat/inputsource/common/streaming.(*Listener).handleConnection"},"service.name":"filebeat","remote_address":"172.19.0.3:39940","ecs.version":"1.6.0"}

Just realized that there were also these kind of errors for previous stack versions starting in July 13th for 8.15.0-SNAPSHOT: #10481. Being able to reproduce it in 8.15.0-SNAPSHOT locally too.

Looking at the Buildkite stats, this package started to fail on the build run on July 13th. Older failures are not related, they were related to artifacts-api, stack not being able to bootstrap or other CI issues.

There are quite a few packages that use this same approach to ingest logs using the stream docker container. For instance tines packages expects 241 documents and that package is working successfully in the latest daily builds.

Not sure what could be causing that not all the expected documents are ingested in Elasticsearch.
Any other ideas @elastic/ecosystem ?

mrodm · 2024-09-20T15:22:01Z

Adding some more context/findings from the logs of that CI step, but nothing that could show what it is the root cause of this error.

According to the logs sent by the tls service container (logs from build 15974), all expected lines were sent to the Elastic Agent (343 as defined in the configuration file):

 $ grep "Log data sent" buildkite_integrations_main-integrations-15974_insecure-logs_cyberarkpas_container-logs_cyberarkpas-audit-tls-1726538329934607865.log | awk -F '|' '{print $2}' | jq -r '.total_lines' | awk '{s+=$1} END {printf "%.0f\n", s}'
343

From the point of view of Elasticsearch, there are no errors in that container between the range time of the last Bulk operation sent (01:48:49) and the "Error reading from connection" (01:49:49) messages:

{"@timestamp":"2024-09-17T01:48:48.947Z", "log.level": "INFO", "message":"[.ds-logs-cyberarkpas.audit-94816-2024.09.17-000001/HqrdhoeiQzqGG34YTWlrHg] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[8697ca59cc19][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"Hqgbt5zMRomDk4_megDCvA","elasticsearch.node.id":"SIEJUddDQdSe-JyIMVeidw","elasticsearch.node.name":"8697ca59cc19","elasticsearch.cluster.name":"elasticsearch"}
{"@timestamp":"2024-09-17T01:58:50.295Z", "log.level": "INFO", "message":"removing data stream [logs-cyberarkpas.audit-94816]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[8697ca59cc19][masterService#updateTask][T#4]","log.logger":"org.elasticsearch.datastreams.action.DeleteDataStreamTransportAction","elasticsearch.cluster.uuid":"Hqgbt5zMRomDk4_megDCvA","elasticsearch.node.id":"SIEJUddDQdSe-JyIMVeidw","elasticsearch.node.name":"8697ca59cc19","elasticsearch.cluster.name":"elasticsearch"}

And looking at the latest 5 builds at the moment of writing this, the number of documents ingested (hits found) into Elasticsearch is different for each build: 231, 180, 228, 225 and 236.

chrisberkhout · 2024-10-04T19:59:17Z

I think this is the same as in #11075 and #11224 (although for the latter it's tcp as well as tls).

I was able to reproduce it locally by removing the system tests and containers that aren't for tls. That is, deleting these files:

data_stream/audit/_dev/test/system/test-logfile-config.yml
data_stream/audit/_dev/test/system/test-tcp-config.yml
data_stream/audit/_dev/test/system/test-udp-config.yml

And these services from _dev/deploy/docker/docker-compose.yml:

cyberarkpas-audit-logfile
cyberarkpas-audit-udp
cyberarkpas-audit-tcp

The container that streams the data does indeed say it's sending all data. It's a little odd that the per line log output is sometimes incomplete, while the per file logging seems to be complete. I think it is sending everything but those debug log lines are getting delayed or limited somehow.

I tried different --delay settings for the log command and found that that it didn't affect the error rate, which made me think that it's about missing the end of the data rather than the start.

I was able to reproduce the problem using just stream log to send and openssl s_server to receive. I filed an issue with the details: elastic/stream#120.

elastic-vault-github-plugin-prod bot added automation flaky-test Unstable or unreliable test cases. labels Jul 27, 2024

andrewkroh added Integration:cyberarkpas CyberArk Privileged Access Security Team:Security-Service Integrations Security Service Integrations Team [elastic/security-service-integrations] labels Aug 2, 2024

narph mentioned this issue Aug 29, 2024

[Stack 8.15.0-SNAPSHOT] [cyberarkpas] Failing test daily: system test: tls in cyberarkpas.audit #10481

Closed

efd6 self-assigned this Sep 3, 2024

This was referenced Sep 8, 2024

[Stack 8.16.0-SNAPSHOT] [claroty_ctd] Failing test daily: system test: udp in claroty_ctd.event #11035

Closed

[Stack 8.16.0-SNAPSHOT] [claroty_ctd] Failing test daily: system test: tcp in claroty_ctd.event #11034

Closed

narph added the Team:Ecosystem Label for the Packages Ecosystem team [elastic/ecosystem] label Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack 8.16.0-SNAPSHOT] [cyberarkpas] Failing test daily: system test: tls in cyberarkpas.audit #10620

[Stack 8.16.0-SNAPSHOT] [cyberarkpas] Failing test daily: system test: tls in cyberarkpas.audit #10620

elastic-vault-github-plugin-prod bot commented Jul 27, 2024 •

edited

Loading

efd6 commented Sep 3, 2024 •

edited

Loading

narph commented Sep 17, 2024

efd6 commented Sep 18, 2024

mrodm commented Sep 18, 2024

mrodm commented Sep 20, 2024

chrisberkhout commented Oct 4, 2024

[Stack 8.16.0-SNAPSHOT] [cyberarkpas] Failing test daily: system test: tls in cyberarkpas.audit #10620

[Stack 8.16.0-SNAPSHOT] [cyberarkpas] Failing test daily: system test: tls in cyberarkpas.audit #10620

Comments

elastic-vault-github-plugin-prod bot commented Jul 27, 2024 • edited Loading

efd6 commented Sep 3, 2024 • edited Loading

narph commented Sep 17, 2024

efd6 commented Sep 18, 2024

mrodm commented Sep 18, 2024

mrodm commented Sep 20, 2024

chrisberkhout commented Oct 4, 2024

elastic-vault-github-plugin-prod bot commented Jul 27, 2024 •

edited

Loading

efd6 commented Sep 3, 2024 •

edited

Loading