Gateway 2.X performance issues #1861

v3i1y · 2022-05-16T22:40:24Z

We are using apollo server 3.6.7 with apollo gateway 2.0.2. We are seeing longer event loop delays and general fewer event loop iterations after the upgrade.

This is a graph of our runtime metrics before and after we shipped the upgrade.

The performance of the graph took a hit after the upgrade, which I suspect this being the cause. Our P99 doubled and P95 is 1.3 times longer.

Another observation (that I couldn't quite understand) from this graph is that before the upgrades, peaks in the event loop delay correspond to dips in event loop iterations which makes sense, but after the upgrade, the dips in the event loop delays correspond to even fewer event loop iterations.

The text was updated successfully, but these errors were encountered:

benweatherman · 2022-05-17T16:44:36Z

Howdy @v3i1r4in! Thanks for upgrading and filing this issue! Really sorry there are some performance problems. While we look into this, do you mind giving a few extra details?

What version of node are you running?
What version of the gateway are you upgrading from?
Can you share the code/config you use for starting the gateway?
Any more info you can give about subgraph setup and types of queries being run? We want to ensure our query plan caching is working as intended.

v3i1y · 2022-05-17T17:13:33Z

Here are the answers:

We are running node 16.
We are upgrading from 0.48.1. Also, the gateway was the only lib that was upgraded, the Apollo server is running the same version as before.
The only config we are passing in is buildService, where we have a customized version of RemoteGraphQLDataSource.
I can't share these publically, but we are enterprise client of Apollo Studio. Maybe @lennyburdette can help here.

v3i1y · 2022-05-17T19:19:26Z

Hey, I want to add another observation is that the performance seems to improve a little bit when the service redeploys. This is really weird given all the caching that's happening, and I would expect the reverse is true.

benweatherman · 2022-05-18T05:15:06Z

I believe this is happening because fetching from subgraphs now has maxSockets: Infinity as a result of the fix for #1647. If you update the remote data source fetcher to set maxSockets, the performance should increase. I used 100 below, but you may need to experiment with a value that works best. 15 was the previous default.

import { Fetcher } from "@apollo/utils.fetcher";
import * as makeFetchHappen from "make-fetch-happen";

class CustomDataSource extends RemoteGraphQLDataSource {
  fetcher: Fetcher = makeFetchHappen.defaults({ maxSockets: 100 });
}

Investigation

During my testing, I found there were similar performance characteristics between 0.48.1, 2.0.1, and 2.0.2 with a maxSockets value set. I could serve 3-4x the number of requests with 2.0.2 at the expense of per-request performance decreasing.

Stats for `maxSockets: 15`

➜ wrk --threads 25 --connections 250 --duration 30s --header 'Content-type: application/json' --script wrk-custom.lua --latency http://localhost:4000
Running 30s test @ http://localhost:4000
  25 threads and 250 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.46s   170.25ms   1.81s    84.67%
    Req/Sec    16.63     13.20    60.00     56.74%
  Latency Distribution
     50%    1.47s
     75%    1.56s
     90%    1.62s
     99%    1.74s
  5029 requests in 30.08s, 1.49MB read
  Socket errors: connect 0, read 117, write 0, timeout 0
Requests/sec:    167.18
Transfer/sec:     50.77KB

Stats for `maxSockets: Infinity`

➜ wrk --threads 25 --connections 250 --duration 30s --header 'Content-type: application/json' --script wrk-custom.lua --latency http://localhost:4000
Running 30s test @ http://localhost:4000
  25 threads and 250 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   287.68ms  316.44ms   2.00s    88.18%
    Req/Sec    36.40     19.12   111.00     68.49%
  Latency Distribution
     50%  155.45ms
     75%  356.14ms
     90%  664.58ms
     99%    1.51s
  26393 requests in 30.10s, 8.06MB read
  Socket errors: connect 0, read 100, write 3, timeout 313
Requests/sec:    876.71
Transfer/sec:    274.27KB

Why improved performance on deploys?

I think the reason we're seeing improved performance during the deploy is that we either have more pods servicing requests or individual pods are serving fewer requests while the routing layer routes new requests to new pods. Does performance improve if you scale up the number of gateway pods?

v3i1y · 2022-05-18T17:57:37Z

Hey, thanks for the detailed investigation! That makes a lot of sense. We discovered yesterday too it might be something related to the fetcher. We were using our own customized HTTP agent for connections and keep-alive behaviors and was passing it in through willSendRequest. However it looks like the new version of RemoteGraphQLDataSource no longer passes that down to the fetcher (before vs after). We are rolling out the fix of passing the agent through the constructor, hopefully this will fix the perf issue.

We are currently perf testing on one pod. so no scale up yet.

benweatherman · 2022-05-18T18:20:51Z

The change to not allow extra stuffs passed into the http agent (such as willSendRequest) is also related to #1647, as we tried to standardize on some types that worked everywhere https:/apollographql/apollo-utils/blob/af6ddf1ad25820a035fa22df8f92f686e90a935a/packages/fetcher/src/index.ts#L1-L11. Hopefully passing a fetcher to the constructor is an easy enough change and hopefully that helps with this performance issue.

theJC · 2022-05-20T05:17:06Z

@v3i1r4in -- Please let us know what maxSockets you change to, and also approximately how many unique host/port combos do you have across all of your subgraphs that your gateway connects to? That has bearing over your performance due to how many total sockets can actually be created.

smyrick · 2022-06-16T18:01:31Z

I am linking this here for reference. Still more investigation to be done but there has been some noticeable perf differences between graphql-js v15 and v16 which also came in gateway 2

One thing may be due to the TS compilation target: graphql/graphql-js#3648

benweatherman · 2022-07-20T22:25:41Z

Howdy all, just wanted to provide an update of what's been going on with performance investigation in v2. It's worth noting that there will be different solutions depending on your version, subgraph topology, and schema & operation size. There are a few threads we're pulling on:

There are performance differences between graphql@15 & graphql@16 and we're tracking down a few causes:
1. Changing target from es5 to es2020 causes a slowdown in the graphql-js benchmarks Big performance drop due to switching TS output from ES5 to ES6 graphql/graphql-js#3648. In an unrelated investigation, an Apollo engineer saw the following (which we have not investigated further):
  
  Node.js was especially slow to initialize native class properties, which seems relevant here because class syntax was previously compiled to ES5 constructor functions, but is now native/uncompiled (with es2020)
2. Pre-caching visit internals seems to cause a slowdown for certain operations. At least one characteristic of problematic operations appears to be usage of deeply nested fragments. It appears the addition of the enterLeaveMap might be the culprit. We're continuing to investigate what's happening.
There are a number of performance improvements in the federation including a change for fed1 schemas Removes special case to type explosion with fed1 supergraphs #1994 and an ongoing PR for fragment reuse Fix fragment reuse in subgraph fetches #1911. Both of these could result in large query plans and subsequently large requests to subgraphs.
Changing maxSockets from 15 to Infinity could be causing more connections to be maintained to subgraphs, rather than being queued (some background info from our friends on StackOverflow). This increase could also exacerbate the above issue by sending more large requests simultaneously.

We hope to get more clarity on the graphql-js changes soon, as well as concrete numbers for the performance improvements that have been slated for federation 2.1.x.

For now, my best advice for anyone investigating performance issues is:

Try the latest 2.1.0 alpha (currently 2.1.0-alpha.1)
Possibly change maxSockets as mentioned at the top of this thread
Possibly try the codesandbox build from Fix fragment reuse in subgraph fetches #1911 if your operations use a lot of fragments

theJC · 2022-08-01T16:07:10Z

Any chance this "FIXME: heavy-handed" mechanism of re-executing the query altogether here in the POST-PROCESSING could be exacerbating this issue as well?

benweatherman · 2022-08-01T21:35:54Z

Yes, I believe it's related to my point #1 above. Though I don't believe calling execute alone is the cause of a new slowdown since it's called in recent 0.x versions of the gateway as well.

Of course, removing that execute and only doing what's necessary could speed things up.

benweatherman · 2022-08-10T16:37:30Z

Apollo has published this tech note to help provide some guidance on how to investigate performance issues https://www.apollographql.com/docs/technotes/TN0009-gateway-performance/. We're still investigating some issues from above, but that tech note is a good way to ensure everything is setup correctly.

benweatherman added the 🪲 bug label May 17, 2022

benweatherman self-assigned this Jun 9, 2022

lennyburdette mentioned this issue Jul 26, 2022

Query plan cache metrics apollographql/router#1441

Closed

benweatherman removed their assignment Aug 25, 2022

jeffjakub closed this as completed Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway 2.X performance issues #1861

Gateway 2.X performance issues #1861

v3i1y commented May 16, 2022

benweatherman commented May 17, 2022

v3i1y commented May 17, 2022 •

edited

Loading

v3i1y commented May 17, 2022 •

edited

Loading

benweatherman commented May 18, 2022

v3i1y commented May 18, 2022

benweatherman commented May 18, 2022

theJC commented May 20, 2022

smyrick commented Jun 16, 2022 •

edited

Loading

benweatherman commented Jul 20, 2022

theJC commented Aug 1, 2022 •

edited

Loading

benweatherman commented Aug 1, 2022

benweatherman commented Aug 10, 2022

Gateway 2.X performance issues #1861

Gateway 2.X performance issues #1861

Comments

v3i1y commented May 16, 2022

benweatherman commented May 17, 2022

v3i1y commented May 17, 2022 • edited Loading

v3i1y commented May 17, 2022 • edited Loading

benweatherman commented May 18, 2022

Investigation

Stats for maxSockets: 15

Stats for maxSockets: Infinity

Why improved performance on deploys?

v3i1y commented May 18, 2022

benweatherman commented May 18, 2022

theJC commented May 20, 2022

smyrick commented Jun 16, 2022 • edited Loading

benweatherman commented Jul 20, 2022

theJC commented Aug 1, 2022 • edited Loading

benweatherman commented Aug 1, 2022

benweatherman commented Aug 10, 2022

v3i1y commented May 17, 2022 •

edited

Loading

v3i1y commented May 17, 2022 •

edited

Loading

Stats for `maxSockets: 15`

Stats for `maxSockets: Infinity`

smyrick commented Jun 16, 2022 •

edited

Loading

theJC commented Aug 1, 2022 •

edited

Loading