Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack 8.16.0-SNAPSHOT] [cyberarkpas] Failing test daily: system test: tls in cyberarkpas.audit #10620

Open
elastic-vault-github-plugin-prod bot opened this issue Jul 27, 2024 · 6 comments
Assignees
Labels
automation flaky-test Unstable or unreliable test cases. Integration:cyberarkpas CyberArk Privileged Access Security Team:Ecosystem Label for the Packages Ecosystem team [elastic/ecosystem] Team:Security-Service Integrations Security Service Integrations Team [elastic/security-service-integrations]

Comments

@elastic-vault-github-plugin-prod
Copy link

elastic-vault-github-plugin-prod bot commented Jul 27, 2024

  • Stack version: 8.16.0-SNAPSHOT
  • Package: cyberarkpas
  • Failing test: system test: tls
  • DataStream: audit
  • Owners:
    • @elastic/security-service-integrations

Failure:

test case failed: could not find hits in logs-cyberarkpas.audit-76857 data stream

First build failed: https://buildkite.com/elastic/integrations/builds/14060

Latest 5 failed builds:

@elastic-vault-github-plugin-prod elastic-vault-github-plugin-prod bot added automation flaky-test Unstable or unreliable test cases. labels Jul 27, 2024
@andrewkroh andrewkroh added Integration:cyberarkpas CyberArk Privileged Access Security Team:Security-Service Integrations Security Service Integrations Team [elastic/security-service-integrations] labels Aug 2, 2024
@efd6
Copy link
Contributor

efd6 commented Sep 3, 2024

Looking at the set of failures in these, I see (mod the tenable_io issue, which is addressed by #10940) over a wide range of packages: no hits; changes in processor behaviour; missing field definitions — present in the definitions files; and failed CEL evaluations — in the case where the error is reported a 404. This looks very much to me like a global CI failure rather than a specific cyberarkpas failure.

All of the elastic-agent logs that are retained in the builders' stores have staled out, so to investigate this further we will need to see a new failure.

@narph
Copy link
Contributor

narph commented Sep 17, 2024

@elastic/ecosystem can you have a look?

@narph narph added the Team:Ecosystem Label for the Packages Ecosystem team [elastic/ecosystem] label Sep 17, 2024
@efd6
Copy link
Contributor

efd6 commented Sep 18, 2024

In all the cases that I've looked at here (all of 16022 and one of 15974), I see a final line in the agent log that looks like this

elastic-agent-1  | {"log.level":"warn","@timestamp":"2024-09-18T02:00:53.497Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/gateway/fleet.(*FleetGateway).doExecute","file.name":"fleet/fleet_gateway.go","file.line":183},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: requester 0/1 to host https://fleet-server:8220/ errored: Post \"https://fleet-server:8220/api/fleet/agents/e610ce2b-12de-4956-ba91-7cd154eafcf5/checkin?\": invalid api key to authenticate with fleet"},"request_duration_ns":253705707,"failed_checkins":1,"retry_after_ns":67291627342,"ecs.version":"1.6.0"}

message: "fail to checkin to fleet-server: all hosts failed: requester 0/1 to host https://fleet-server:8220/ errored: Post \"https://fleet-server:8220/api/fleet/agents/e610ce2b-12de-4956-ba91-7cd154eafcf5/checkin?\": invalid api key to authenticate with fleet"

@mrodm
Copy link
Contributor

mrodm commented Sep 18, 2024

message: "fail to checkin to fleet-server: all hosts failed: requester 0/1 to host https://fleet-server:8220/ errored: Post \"https://fleet-server:8220/api/fleet/agents/e610ce2b-12de-4956-ba91-7cd154eafcf5/checkin?\": invalid api key to authenticate with fleet"

I think that error comes from the unenrollment/removal of the Elastic Agent that elastic-package performs at the end of each test. So I'd say it's expected that this log message appears at the end of each file.

Looking at the logs of the build 15974, there are some documents ingested (180) but not all the documents expected (343) according to the system test configuration file.

Testing locally, I've been able to reproduce it too but not consistently. It fails randomly or at least I did not have any reason for that.

From the logs retrieved in the build 15974:

2024/09/17 01:48:47 DEBUG found 0 hits in logs-cyberarkpas.audit-94816 data stream
2024/09/17 01:48:48 DEBUG found 0 hits in logs-cyberarkpas.audit-94816 data stream
2024/09/17 01:48:49 DEBUG found 92 hits in logs-cyberarkpas.audit-94816 data stream
2024/09/17 01:48:50 DEBUG found 180 hits in logs-cyberarkpas.audit-94816 data stream
2024/09/17 01:48:51 DEBUG found 180 hits in logs-cyberarkpas.audit-94816 data stream
2024/09/17 01:48:52 DEBUG found 180 hits in logs-cyberarkpas.audit-94816 data stream
...

There are some messages like this one showing the number of events sent to Elasticsearch and match with the documents that were ingested:

 {"log.level":"debug","@timestamp":"2024-09-17T01:48:48.643Z","message":"doBulkRequest: 50 events have been sent to elasticsearch in 1.13726457s.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0"}
 {"log.level":"debug","@timestamp":"2024-09-17T01:48:48.750Z","message":"doBulkRequest: 50 events have been sent to elasticsearch in 106.270895ms.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"log.logger":"elasticsearch","log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
 {"log.level":"debug","@timestamp":"2024-09-17T01:48:49.015Z","message":"doBulkRequest: 50 events have been sent to elasticsearch in 264.559873ms.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"elasticsearch","ecs.version":"1.6.0"}
 {"log.level":"debug","@timestamp":"2024-09-17T01:48:49.807Z","message":"doBulkRequest: 30 events have been sent to elasticsearch in 53.129321ms.","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":301,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).doBulkRequest"},"service.name":"filebeat","ecs.version":"1.6.0"}

That message is the last bulk operation performed, and then after one minute there is an error connecting to Elasticsearch:

 {"log.level":"debug","@timestamp":"2024-09-17T01:49:49.810Z","message":"Error reading from connection: read tcp 172.18.0.6:36554->172.18.0.3:9200: use of closed network connection","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"log.logger":"esclientleg","network":"tcp","ecs.version":"1.6.0","log.origin":{"file.line":51,"file.name":"transport/logging.go","function":"github.com/elastic/elastic-agent-libs/transport.(*loggingConn).Read"},"service.name":"filebeat","address":"elasticsearch:9200","ecs.version":"1.6.0"}

And later, there is another error connecting to another port (tls service container that has been killed by elastic-package with SIGHUP signal?):

 {"log.level":"debug","@timestamp":"2024-09-17T01:51:18.414Z","message":"Client error","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"tcp-default","type":"tcp"},"log":{"source":"tcp-default"},"address":"0.0.0.0:9999","error":{"message":"read tcp 172.19.0.2:9999->172.19.0.3:39940: read: connection timed out"},"ecs.version":"1.6.0","log.logger":"tcp","log.origin":{"file.line":202,"file.name":"streaming/listener.go","function":"github.com/elastic/beats/v7/filebeat/inputsource/common/streaming.(*Listener).handleConnection"},"service.name":"filebeat","remote_address":"172.19.0.3:39940","ecs.version":"1.6.0"}

Just realized that there were also these kind of errors for previous stack versions starting in July 13th for 8.15.0-SNAPSHOT: #10481. Being able to reproduce it in 8.15.0-SNAPSHOT locally too.

Looking at the Buildkite stats, this package started to fail on the build run on July 13th. Older failures are not related, they were related to artifacts-api, stack not being able to bootstrap or other CI issues.

There are quite a few packages that use this same approach to ingest logs using the stream docker container. For instance tines packages expects 241 documents and that package is working successfully in the latest daily builds.

Not sure what could be causing that not all the expected documents are ingested in Elasticsearch.
Any other ideas @elastic/ecosystem ?

@mrodm
Copy link
Contributor

mrodm commented Sep 20, 2024

Adding some more context/findings from the logs of that CI step, but nothing that could show what it is the root cause of this error.

According to the logs sent by the tls service container (logs from build 15974), all expected lines were sent to the Elastic Agent (343 as defined in the configuration file):

 $ grep "Log data sent" buildkite_integrations_main-integrations-15974_insecure-logs_cyberarkpas_container-logs_cyberarkpas-audit-tls-1726538329934607865.log | awk -F '|' '{print $2}' | jq -r '.total_lines' | awk '{s+=$1} END {printf "%.0f\n", s}'
343

From the point of view of Elasticsearch, there are no errors in that container between the range time of the last Bulk operation sent (01:48:49) and the "Error reading from connection" (01:49:49) messages:

{"@timestamp":"2024-09-17T01:48:48.947Z", "log.level": "INFO", "message":"[.ds-logs-cyberarkpas.audit-94816-2024.09.17-000001/HqrdhoeiQzqGG34YTWlrHg] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[8697ca59cc19][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"Hqgbt5zMRomDk4_megDCvA","elasticsearch.node.id":"SIEJUddDQdSe-JyIMVeidw","elasticsearch.node.name":"8697ca59cc19","elasticsearch.cluster.name":"elasticsearch"}
{"@timestamp":"2024-09-17T01:58:50.295Z", "log.level": "INFO", "message":"removing data stream [logs-cyberarkpas.audit-94816]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[8697ca59cc19][masterService#updateTask][T#4]","log.logger":"org.elasticsearch.datastreams.action.DeleteDataStreamTransportAction","elasticsearch.cluster.uuid":"Hqgbt5zMRomDk4_megDCvA","elasticsearch.node.id":"SIEJUddDQdSe-JyIMVeidw","elasticsearch.node.name":"8697ca59cc19","elasticsearch.cluster.name":"elasticsearch"}

And looking at the latest 5 builds at the moment of writing this, the number of documents ingested (hits found) into Elasticsearch is different for each build: 231, 180, 228, 225 and 236.

@chrisberkhout
Copy link
Contributor

I think this is the same as in #11075 and #11224 (although for the latter it's tcp as well as tls).

I was able to reproduce it locally by removing the system tests and containers that aren't for tls. That is, deleting these files:

data_stream/audit/_dev/test/system/test-logfile-config.yml
data_stream/audit/_dev/test/system/test-tcp-config.yml
data_stream/audit/_dev/test/system/test-udp-config.yml

And these services from _dev/deploy/docker/docker-compose.yml:

cyberarkpas-audit-logfile
cyberarkpas-audit-udp
cyberarkpas-audit-tcp

The container that streams the data does indeed say it's sending all data. It's a little odd that the per line log output is sometimes incomplete, while the per file logging seems to be complete. I think it is sending everything but those debug log lines are getting delayed or limited somehow.

I tried different --delay settings for the log command and found that that it didn't affect the error rate, which made me think that it's about missing the end of the data rather than the start.

I was able to reproduce the problem using just stream log to send and openssl s_server to receive. I filed an issue with the details: elastic/stream#120.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation flaky-test Unstable or unreliable test cases. Integration:cyberarkpas CyberArk Privileged Access Security Team:Ecosystem Label for the Packages Ecosystem team [elastic/ecosystem] Team:Security-Service Integrations Security Service Integrations Team [elastic/security-service-integrations]
Projects
None yet
Development

No branches or pull requests

5 participants