Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch output does not recover after connection failure #40705

Closed
belimawr opened this issue Sep 6, 2024 · 17 comments
Closed

Elasticsearch output does not recover after connection failure #40705

belimawr opened this issue Sep 6, 2024 · 17 comments
Assignees
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@belimawr
Copy link
Contributor

belimawr commented Sep 6, 2024

For confirmed bugs, please report:

  • Version: main, v8.15.1
  • Operating System: Linux

Update: The problematic part of the change was reverted in #40776 and is targeted for release in 8.15.2.

Steps to reproduce

  1. Deploy a Filebeat sending data to a remote Elasticsearch using a domain name in the configuration
  2. Confirm Filebeat is running and sending data
  3. Disable the network
  4. Wait for a DNS lookup error and the publisher errors:
    {
      "log.level": "warn",
      "@timestamp": "2024-09-06T08:30:33.240-0400",
      "log.logger": "transport",
      "log.origin": {
        "function": "github.com/elastic/elastic-agent-libs/transport.TestNetDialer.func1",
        "file.name": "transport/tcp.go",
        "file.line": 53
      },
      "message": "DNS lookup failure \"remote-es.elastic.cloud\": lookup remote-es.elastic.cloud: Temporary failure in name resolution",
      "service.name": "filebeat",
      "ecs.version": "1.6.0"
    }
    {
      "log.level": "error",
      "@timestamp": "2024-09-06T15:50:15.140-0400",
      "log.logger": "publisher_pipeline_output",
      "log.origin": {
        "function": "github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run",
        "file.name": "pipeline/client_worker.go",
        "file.line": 148
      },
      "message": "Failed to connect to backoff(elasticsearch(https://remote-es.elastic.cloud:443)): Get \"https://remote-es.elastic.cloud:443\": context canceled",
      "service.name": "filebeat",
      "ecs.version": "1.6.0"
    }
    {
      "log.level": "info",
      "@timestamp": "2024-09-06T15:50:15.140-0400",
      "log.logger": "publisher_pipeline_output",
      "log.origin": {
        "function": "github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run",
        "file.name": "pipeline/client_worker.go",
        "file.line": 139
      },
      "message": "Attempting to reconnect to backoff(elasticsearch(https://remote-es.elastic.cloud:443)) with 475 reconnect attempt(s)",
      "service.name": "filebeat",
      "ecs.version": "1.6.0"
    }
  5. Enable the network
  6. Ensure the machine can reach the internet and the remote Elasticsearch
  7. Filebeat will keep logging the same publisher errors and not sending data

The configuration I used:

filebeat.inputs:
  - type: filestream
    id: my-filestream-id
    paths:
      - /tmp/some-logs.txt

output.elasticsearch:
  hosts: ["https://remote-es.elastic.cloud:443"]
  username: elastic
  password: some-very-secret-password
@belimawr belimawr added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Sep 6, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz
Copy link
Member

cmacknz commented Sep 9, 2024

Get "https://remote-es.elastic.cloud:443\": context canceled

This isn't a DNS error, so to me it looks like we did recover from the DNS failure but something else is wrong, perhaps a context that expired when it shouldn't have somewhere in one of the initial connection callbacks. The Get \"https://remote-es.elastic.cloud:443\" is likely to get the version and deployment type.

esVersion := conn.GetVersion()

func (conn *Connection) execRequest(
method, url string,
body io.Reader,
) (int, []byte, error) {
req, err := http.NewRequestWithContext(conn.reqsContext, method, url, body)
if err != nil {
conn.log.Warnf("Failed to create request %+v", err)
return 0, nil, err
}

// Close closes a connection.
func (conn *Connection) Close() error {
conn.HTTP.CloseIdleConnections()
conn.cancelReqs()
return nil
}

Perhaps we are closing the connection after the DNS failure instead of reusing it.

@belimawr
Copy link
Contributor Author

belimawr commented Sep 9, 2024

That makes sense. I didn't investigate the issue, I only managed to reproduce and report it and the DNS error seemed to be the trigger for not being able to connect to ES again.

@cmacknz cmacknz changed the title Elasticsearch output does not recover from DNS lookup failure Elasticsearch output does not recover after DNS lookup failure Sep 9, 2024
@cmacknz cmacknz changed the title Elasticsearch output does not recover after DNS lookup failure Elasticsearch output does not recover after connection failure Sep 11, 2024
@cmacknz
Copy link
Member

cmacknz commented Sep 11, 2024

Potentially related: #40572

@cmacknz
Copy link
Member

cmacknz commented Sep 11, 2024

^Looking at the changes there I see:

// Close closes a connection.
func (conn *Connection) Close() error {
	conn.HTTP.CloseIdleConnections()
+	conn.cancelReqs()
	return nil
}

Creating a new context only happens when NewConnection is called and there must be code paths that do not do that properly after close is called.

This change is only in 8.15.1 from what I can tell.

@cmacknz
Copy link
Member

cmacknz commented Sep 11, 2024

Linking the 8.15.0 backport: b19844f

git tag --contains b19844ffdae6861feaf2ee02ce11936d80b243cb
v8.15.1

@cmacknz
Copy link
Member

cmacknz commented Sep 11, 2024

The first thing we need to do is write a test that reproduces this problem.

@cmacknz
Copy link
Member

cmacknz commented Sep 11, 2024

I think we should revert #40572 once we double check removing it fixes the problem, then work on re-adding it back with a fix + test for this.

This issue is more severe than the original problem that PR was trying to fix.

@cmacknz
Copy link
Member

cmacknz commented Sep 11, 2024

FYI @marc-gr

@belimawr
Copy link
Contributor Author

I've been testing and you're correct Craig the conn.cancelRequs() is the issue. The Close method from the connection

func (conn *Connection) Close() error {
conn.HTTP.CloseIdleConnections()
conn.cancelReqs()
return nil
}

Did not use to cancel any in-flight requests nor it rendered the connection unusable.

#40572 cannot be automatically reverted :/, I'll create a PR removing the culprit line, which effectively restores the old behaviour from the Elasticsearch client of not cancelling in-flight requests. Everything else added by #40572 should still work fine.

belimawr added a commit to belimawr/beats that referenced this issue Sep 11, 2024
This commit removes a call to conn.cancelReqs() that causes the
Connection to be unusable after Close() is called, leading to the bug
described by elastic#40705.
@belimawr
Copy link
Contributor Author

The "revert-ish" PR: #40769

@belimawr
Copy link
Contributor Author

Investigating more, I found the root cause of the issue. On error our backoffClient will call Close() on the client:

func (b *backoffClient) Publish(ctx context.Context, batch publisher.Batch) error {
err := b.client.Publish(ctx, batch)
if err != nil {
b.client.Close()
}
backoff.WaitOnError(b.backoff, err)
return err
}

The Client then calls Close in the connection:

func (client *Client) Close() error {
return client.conn.Close()
}

The connection Close method is

func (conn *Connection) Close() error {
conn.HTTP.CloseIdleConnections()
conn.cancelReqs()
return nil
}

When #40572 was merged, the call to conn.cancelReqs() was introduced, which cancels the context created by NewConnection

ctx, cancelFunc := context.WithCancel(context.Background())
conn := Connection{
ConnectionSettings: s,
HTTP: esClient,
Encoder: encoder,
log: logger,
responseBuffer: bytes.NewBuffer(nil),
reqsContext: ctx,
cancelReqs: cancelFunc,
}

that is used in every request (L404)

func (conn *Connection) execRequest(
method, url string,
body io.Reader,
) (int, []byte, error) {
req, err := http.NewRequestWithContext(conn.reqsContext, method, url, body)
if err != nil {
conn.log.Warnf("Failed to create request %+v", err)
return 0, nil, err
}
if body != nil {
conn.Encoder.AddHeader(&req.Header)
}
return conn.execHTTPRequest(req)
}

and never recreated, which renders the whole Connection unusable, which was not the old behavour.

belimawr added a commit that referenced this issue Sep 11, 2024
This commit removes a call to conn.cancelReqs() that causes the
Connection to be unusable after Close() is called, leading to the bug
described by #40705.
mergify bot pushed a commit that referenced this issue Sep 11, 2024
This commit removes a call to conn.cancelReqs() that causes the
Connection to be unusable after Close() is called, leading to the bug
described by #40705.

(cherry picked from commit b0e4f85)
mergify bot pushed a commit that referenced this issue Sep 11, 2024
This commit removes a call to conn.cancelReqs() that causes the
Connection to be unusable after Close() is called, leading to the bug
described by #40705.

(cherry picked from commit b0e4f85)
belimawr added a commit that referenced this issue Sep 12, 2024
This commit removes a call to conn.cancelReqs() that causes the
Connection to be unusable after Close() is called, leading to the bug
described by #40705.

(cherry picked from commit b0e4f85)

Co-authored-by: Tiago Queiroz <[email protected]>
belimawr added a commit that referenced this issue Sep 12, 2024
This commit removes a call to conn.cancelReqs() that causes the
Connection to be unusable after Close() is called, leading to the bug
described by #40705.

(cherry picked from commit b0e4f85)

Co-authored-by: Tiago Queiroz <[email protected]>
@marc-gr
Copy link
Contributor

marc-gr commented Sep 12, 2024

@belimawr thanks for taking a look, would be enough to recreate the context on close to make the client reusable? I think just removing the call to cancelReqs might make the stop racey since IIRC this was the main reason publishers were not closed before on stop.

@belimawr
Copy link
Contributor Author

would be enough to recreate the context on close to make the client reusable?

Mostly, the same instance of the connection is also used in a callback, so I also made sure they both hold a pointer reference so when the context is recreated both can use the new one.

I think just removing the call to cancelReqs might make the stop racey since IIRC this was the main reason publishers were not closed before on stop.

The PR removing the call to cancelReqs was just a quick patch to keep main releasable, I've just created a new PR with the proper fix.

@cmacknz
Copy link
Member

cmacknz commented Sep 20, 2024

The problematic part of the change that introduced this was reverted in reverted in #40776 and is targeted for release in 8.15.2.

Reverting this change will likely bring back these two bugs which were fixed in the change that introduced this problem:

The PR linked to this issue now #40794 introduces a regression test for this situation and also brings in a fix for the two problems above that handles interrupted connections properly.

I have updated the description to note that this issue will not exist in 8.15.2.

@cmacknz
Copy link
Member

cmacknz commented Sep 20, 2024

I am going to close this to indicate we have resolved the underlying issue in the upcoming 8.15.2 release and move tracking the remaining work separately.

@cmacknz
Copy link
Member

cmacknz commented Sep 20, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

4 participants