-
-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topic autocreation stalls producer for 10 seconds #114
Comments
I looked into this and will change some internal code to hopefully improve this situation. Currently, an unknown topic triggers an immediate update. When using metadata requests to create topics, Kafka does not assign a leader immediately, so the topic / partitions have no leader. This is considered a "failure" internally, and since this metadata request was from an immediate trigger, the client immediately retries again. This retry happens twice (three tries total), all very fast back to back, and Kafka has not assigned a leader in this time (<1s, per your logs). After three failures, the client triggers itself to load metadata slowly (metadata min age). This is the 10s pause you see, after which it requests again, and by that time Kafka has assigned a leader. I'm going to change the code to not immediately re-issue a metadata request. The new code will only allow one extra retry when the metadata request has a true error (connection cut), since this case already retries 3x internally before getting to this point (it's retries all the way down). When the request is non-errored but has internal errors (no leader), I will allow retries up to a total of 8 requests, with a 250ms sleep between them. This gives immediate-triggers 2s to be successful, rather than the prior basically 0s. As well, I'm going to change the default metadata min age to 5s. So, if Kafka still has not stabilized on a leader within 2s, the client will fall back to waiting 5s between attempts rather than 10 All told this will increase the number of metadata requests on per-partition failures. Immediate triggers only happen on per-partition errors, so ideally the increased number of metadata requests only happens when things are going south, and it may help recover in the client sooner. |
Should be closed with 8325ba7; if it is not, I'm not sure there's much more to do within the client to help this case. |
(commenting in this because it seems like it's related) I think this introduced an issue when you start a consumer and immediately after create the topic.it seems like it's sleeping for ~2.5s upon calling |
^That sounds somewhat related, but I don't think there's much to be improved here. The client doesn't backoff 2.5s, it will retry a few times and then wait 2.5s. It's possible that the partitions don't have a leader immediately? |
Maybe the retries are too fast? Or maybe the 2.5s wait could be replaced by an exponential back-off. wdyt? |
This retries every 250ms 8 times in a row. What specific problem are you running into? |
I wrote a snippet to illustrate what I think the issue is: func TestFranzPollHanging(t *testing.T) {
topicID := uuid.NewString()
// Create client used to consume
subCl, err := kgo.NewClient(
kgo.SeedBrokers("localhost:9092"),
kgo.ConsumeTopics(topicID),
)
if err != nil {
t.Fatal(err)
}
// start polling for fetches
wg := new(sync.WaitGroup)
wg.Add(1)
go func() {
fmt.Println(">fetch", time.Now())
for {
cctx, cancel := context.WithTimeout(context.Background(), 300*time.Millisecond)
defer cancel()
rs := subCl.PollFetches(cctx)
if rs.Err() == nil {
fmt.Println("<fetch", time.Now())
wg.Done()
break
}
fmt.Println(" fetch", rs.Err())
}
}()
// create topic
admCl := kadm.NewClient(subCl)
admCl.CreateTopics(context.Background(), 1, 1, nil, topicID)
// Producing with new client
fmt.Println(">produce", time.Now())
pubCl, err := kgo.NewClient(kgo.SeedBrokers("localhost:9092"))
if err != nil {
t.Fatal(err)
}
pubCl.ProduceSync(context.Background(), &kgo.Record{Value: []byte("waiting"), Topic: topicID})
fmt.Println("<produce", time.Now())
wg.Wait()
} This outputs the following:
As you can see, the producer runs virtually instantly after topic creation and 100ms after the fetch was called, but the fetch hangs for 2.5s. I'd expect it to return before, probably after the 250ms and during the first retry. My hypothesis is the retry logic is probably broken - maybe some caching issue? |
This is benign behavior that would only be encountered if trying to create a client and consume immediately after topic creation. However, the fix is small so I've done so here: f981856 I still expect the client to take a bit if you, for example, use regex consuming and create a topic. The client only periodically updates metadata. As well, if your topic doesn't exist and you try to consume it and then you only create the topic a minute later, it will still take the min metadata wait to discover it. |
We are using topic autocreation with franz-go and we experience noticeable delay when producing first message to non-existing topic with
ProduceSync
method. After some debugging I found that reducingMetadataMinAge
reduces this delay as well. Is that expected behavior?Relevant client configuration:
franz-go version: v1.2.3
Click to expand log file
The text was updated successfully, but these errors were encountered: