Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quarkus Rest Client experiences massive slowdowns under concurrent load starting from Quarkus 3.13.0 #42855

Closed
derkork opened this issue Aug 29, 2024 · 9 comments
Labels
area/kotlin area/rest-client kind/bug Something isn't working triage/out-of-date This issue/PR is no longer valid or relevant

Comments

@derkork
Copy link

derkork commented Aug 29, 2024

Describe the bug

We have a project were we use the Quarkus Rest Client to call other services within a service of our own. We use a simple RestClient like this:

@RegisterRestClient(configKey = "test-client")
interface TestRestClient {

    @GET()
    @Path("/external/hello")
    suspend fun helloExternal(): String
}

We then call this Rest client from within a controller of our service:

@Path("/hello")
class GreetingResource(@RestClient private val testRestClient: TestRestClient) {

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    suspend fun hello(): String {
        var result: String
        val millis = measureTimeMillis {
            result = testRestClient.helloExternal()
        }
        Log.info("Upstream call took $millis ms")
        return result
    }
}

When we have a sufficient amount of concurrent calls with this rest client, we experience massive slowdowns starting with Quarkus 3.13.0, e.g. calls that take 300-400ms usually start to take 4000-5000ms under load. The behaviour is only reproducable when putting some load on the client and only in Quarkus versions 3.13.0 or later. In our specific case we migrated from version 3.11.3 to 3.13.0, but i have tested all versions up to 3.13.0. The last "good" version was 3.12.3.

Expected behavior

The client's response times do not deterioate under heavy load, given that the external system responds in a timely fashion.

Actual behavior

The client's response times become much slower the more concurrent calls the client is executing.

Output from 3.13.3 under heavy load (50 concurrent calls, note how times deteriorate as more requests are in-flight)

2024-08-29 10:53:10,291 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#53) Upstream call took 404 ms
2024-08-29 10:53:10,292 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#54) Upstream call took 357 ms
2024-08-29 10:53:10,292 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#55) Upstream call took 342 ms
2024-08-29 10:53:10,293 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#52) Upstream call took 406 ms
2024-08-29 10:53:10,293 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#56) Upstream call took 344 ms
2024-08-29 10:53:10,541 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#57) Upstream call took 590 ms

...

2024-08-29 10:53:11,579 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#85) Upstream call took 1621 ms
2024-08-29 10:53:11,826 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#78) Upstream call took 1868 ms
2024-08-29 10:53:11,829 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#80) Upstream call took 1870 ms
2024-08-29 10:53:11,832 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#87) Upstream call took 1873 ms
2024-08-29 10:53:11,833 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#89) Upstream call took 1874 ms
2024-08-29 10:53:11,835 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#82) Upstream call took 1876 ms

...

2024-08-29 10:53:12,602 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#96) Upstream call took 2638 ms
2024-08-29 10:53:12,603 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#100) Upstream call took 2637 ms

Output from 3.11.3 (50 concurrent calls, also slight deterioration but may be caused by the upstream server not responding fast enough)

2024-08-29 10:56:00,275 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#63) Upstream call took 330 ms
2024-08-29 10:56:00,275 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#54) Upstream call took 343 ms
2024-08-29 10:56:00,277 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#65) Upstream call took 331 ms
2024-08-29 10:56:00,277 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#62) Upstream call took 334 ms

...

2024-08-29 10:56:00,529 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#82) Upstream call took 579 ms
2024-08-29 10:56:00,529 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-1 @coroutine#83) Upstream call took 578 ms
2024-08-29 10:56:00,532 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#71) Upstream call took 584 ms
2024-08-29 10:56:00,533 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#79) Upstream call took 583 ms

...

2024-08-29 10:56:00,793 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#96) Upstream call took 838 ms
2024-08-29 10:56:00,793 INFO  [org.acm.GreetingResource] (vert.x-eventloop-thread-0 @coroutine#101) Upstream call took 837 ms

How to Reproduce?

Reproducer:

https:/derkork/quarkus-performance-issue-reproducer

The project consists of a minimal server, client and a test. The test fires up a WireMock Server which simulates 200ms delay of the upstream system and then runs 10 rounds of 50 concurrent requests against the Quarkus application. The test then asserts that the average response time is below 1000ms.

The version that is commited is set up to use Quarkus 3.13.3 and in this version the test should fail. If you change the Quarkus version in the pom.xml to 3.11.3 the test works and the response times significantly improve.

Output of uname -a or ver

Darwin .local 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:09:52 PDT 2024; root:xnu-10063.121.3~5/RELEASE_X86_64 x86_64

Output of java -version

openjdk version "22.0.2" 2024-07-16

Quarkus version or git rev

3.13.0 or later

Build tool (ie. output of mvnw --version or gradlew --version)

Apache Maven 3.9.8 (36645f6c9b5079805ea5009217e36f2cffd34256)

Additional information

This also happens in the native build of the software running in a docker container, it is not limited to OSX. We have seen this in production under heavy load. It doesn't seem to be consistently reproducible in every environment though.

@derkork derkork added the kind/bug Something isn't working label Aug 29, 2024
@quarkus-bot
Copy link

quarkus-bot bot commented Aug 29, 2024

/cc @cescoffier (rest-client), @geoand (kotlin,rest-client)

@gsmet
Copy link
Member

gsmet commented Aug 29, 2024

Could you try if #42823 makes any difference for you?

@gsmet
Copy link
Member

gsmet commented Aug 29, 2024

From what I can see, things are better with 999-SNAPSHOT but I'm not entirely sure they are on par with the previous behavior.

@derkork
Copy link
Author

derkork commented Aug 29, 2024

I'm currently building it on my machine.

@gsmet
Copy link
Member

gsmet commented Aug 29, 2024

I think what's remaining might be due to #42858 but from what I can see the behavior is a lot saner in main already.

@derkork
Copy link
Author

derkork commented Aug 29, 2024

Hmm how would I use this locally? I did a ./mvnw install but it would look like maven cannot resolve the 999-SNAPSHOT dependencies.

@geoand
Copy link
Contributor

geoand commented Aug 29, 2024

You need to use:

      <dependency>
        <groupId>io.quarkus</groupId>
        <artifactId>quarkus-bom</artifactId>
        <version>999-SNAPSHOT</version>
        <type>pom</type>
        <scope>import</scope>
      </dependency>

as the BOM

@derkork
Copy link
Author

derkork commented Aug 29, 2024

For me 999-SNAPSHOT (avg 580ms) is slightly faster than 3.11.3 (avg 620ms). So it would seem the patches have helped with the problem and even slightly improved the speed (at least on my machine).

@geoand
Copy link
Contributor

geoand commented Aug 29, 2024

Cool, thanks a lot for checking!

Given that #42823 is already marked for backport, let's go ahead and close this.

@geoand geoand closed this as completed Aug 29, 2024
@geoand geoand added the triage/out-of-date This issue/PR is no longer valid or relevant label Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kotlin area/rest-client kind/bug Something isn't working triage/out-of-date This issue/PR is no longer valid or relevant
Projects
None yet
Development

No branches or pull requests

3 participants