Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway 2.X performance issues #1861

Closed
v3i1y opened this issue May 16, 2022 · 12 comments
Closed

Gateway 2.X performance issues #1861

v3i1y opened this issue May 16, 2022 · 12 comments
Labels

Comments

@v3i1y
Copy link

v3i1y commented May 16, 2022

We are using apollo server 3.6.7 with apollo gateway 2.0.2. We are seeing longer event loop delays and general fewer event loop iterations after the upgrade.

This is a graph of our runtime metrics before and after we shipped the upgrade.
Screen Shot 2022-05-16 at 3 00 08 PM

The performance of the graph took a hit after the upgrade, which I suspect this being the cause. Our P99 doubled and P95 is 1.3 times longer.

Another observation (that I couldn't quite understand) from this graph is that before the upgrades, peaks in the event loop delay correspond to dips in event loop iterations which makes sense, but after the upgrade, the dips in the event loop delays correspond to even fewer event loop iterations.

@benweatherman
Copy link
Contributor

Howdy @v3i1r4in! Thanks for upgrading and filing this issue! Really sorry there are some performance problems. While we look into this, do you mind giving a few extra details?

  • What version of node are you running?
  • What version of the gateway are you upgrading from?
  • Can you share the code/config you use for starting the gateway?
  • Any more info you can give about subgraph setup and types of queries being run? We want to ensure our query plan caching is working as intended.

@v3i1y
Copy link
Author

v3i1y commented May 17, 2022

Here are the answers:

  • We are running node 16.
  • We are upgrading from 0.48.1. Also, the gateway was the only lib that was upgraded, the Apollo server is running the same version as before.
  • The only config we are passing in is buildService, where we have a customized version of RemoteGraphQLDataSource.
  • I can't share these publically, but we are enterprise client of Apollo Studio. Maybe @lennyburdette can help here.

@v3i1y
Copy link
Author

v3i1y commented May 17, 2022

Hey, I want to add another observation is that the performance seems to improve a little bit when the service redeploys. This is really weird given all the caching that's happening, and I would expect the reverse is true.
Screen Shot 2022-05-17 at 12 20 54 PM

@benweatherman
Copy link
Contributor

I believe this is happening because fetching from subgraphs now has maxSockets: Infinity as a result of the fix for #1647. If you update the remote data source fetcher to set maxSockets, the performance should increase. I used 100 below, but you may need to experiment with a value that works best. 15 was the previous default.

import { Fetcher } from "@apollo/utils.fetcher";
import * as makeFetchHappen from "make-fetch-happen";

class CustomDataSource extends RemoteGraphQLDataSource {
  fetcher: Fetcher = makeFetchHappen.defaults({ maxSockets: 100 });
}

Investigation

During my testing, I found there were similar performance characteristics between 0.48.1, 2.0.1, and 2.0.2 with a maxSockets value set. I could serve 3-4x the number of requests with 2.0.2 at the expense of per-request performance decreasing.

Stats for maxSockets: 15

image

➜ wrk --threads 25 --connections 250 --duration 30s --header 'Content-type: application/json' --script wrk-custom.lua --latency http://localhost:4000
Running 30s test @ http://localhost:4000
  25 threads and 250 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.46s   170.25ms   1.81s    84.67%
    Req/Sec    16.63     13.20    60.00     56.74%
  Latency Distribution
     50%    1.47s
     75%    1.56s
     90%    1.62s
     99%    1.74s
  5029 requests in 30.08s, 1.49MB read
  Socket errors: connect 0, read 117, write 0, timeout 0
Requests/sec:    167.18
Transfer/sec:     50.77KB

Stats for maxSockets: Infinity

image

➜ wrk --threads 25 --connections 250 --duration 30s --header 'Content-type: application/json' --script wrk-custom.lua --latency http://localhost:4000
Running 30s test @ http://localhost:4000
  25 threads and 250 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   287.68ms  316.44ms   2.00s    88.18%
    Req/Sec    36.40     19.12   111.00     68.49%
  Latency Distribution
     50%  155.45ms
     75%  356.14ms
     90%  664.58ms
     99%    1.51s
  26393 requests in 30.10s, 8.06MB read
  Socket errors: connect 0, read 100, write 3, timeout 313
Requests/sec:    876.71
Transfer/sec:    274.27KB

Why improved performance on deploys?

I think the reason we're seeing improved performance during the deploy is that we either have more pods servicing requests or individual pods are serving fewer requests while the routing layer routes new requests to new pods. Does performance improve if you scale up the number of gateway pods?

@v3i1y
Copy link
Author

v3i1y commented May 18, 2022

Hey, thanks for the detailed investigation! That makes a lot of sense. We discovered yesterday too it might be something related to the fetcher. We were using our own customized HTTP agent for connections and keep-alive behaviors and was passing it in through willSendRequest. However it looks like the new version of RemoteGraphQLDataSource no longer passes that down to the fetcher (before vs after). We are rolling out the fix of passing the agent through the constructor, hopefully this will fix the perf issue.

We are currently perf testing on one pod. so no scale up yet.

@benweatherman
Copy link
Contributor

The change to not allow extra stuffs passed into the http agent (such as willSendRequest) is also related to #1647, as we tried to standardize on some types that worked everywhere https:/apollographql/apollo-utils/blob/af6ddf1ad25820a035fa22df8f92f686e90a935a/packages/fetcher/src/index.ts#L1-L11. Hopefully passing a fetcher to the constructor is an easy enough change and hopefully that helps with this performance issue.

@theJC
Copy link
Contributor

theJC commented May 20, 2022

@v3i1r4in -- Please let us know what maxSockets you change to, and also approximately how many unique host/port combos do you have across all of your subgraphs that your gateway connects to? That has bearing over your performance due to how many total sockets can actually be created.

@benweatherman benweatherman self-assigned this Jun 9, 2022
@smyrick
Copy link
Member

smyrick commented Jun 16, 2022

I am linking this here for reference. Still more investigation to be done but there has been some noticeable perf differences between graphql-js v15 and v16 which also came in gateway 2

One thing may be due to the TS compilation target: graphql/graphql-js#3648

@benweatherman
Copy link
Contributor

Howdy all, just wanted to provide an update of what's been going on with performance investigation in v2. It's worth noting that there will be different solutions depending on your version, subgraph topology, and schema & operation size. There are a few threads we're pulling on:

  1. There are performance differences between graphql@15 & graphql@16 and we're tracking down a few causes:
    1. Changing target from es5 to es2020 causes a slowdown in the graphql-js benchmarks Big performance drop due to switching TS output from ES5 to ES6 graphql/graphql-js#3648. In an unrelated investigation, an Apollo engineer saw the following (which we have not investigated further):

      Node.js was especially slow to initialize native class properties, which seems relevant here because class syntax was previously compiled to ES5 constructor functions, but is now native/uncompiled (with es2020)

    2. Pre-caching visit internals seems to cause a slowdown for certain operations. At least one characteristic of problematic operations appears to be usage of deeply nested fragments. It appears the addition of the enterLeaveMap might be the culprit. We're continuing to investigate what's happening.
  2. There are a number of performance improvements in the federation including a change for fed1 schemas Removes special case to type explosion with fed1 supergraphs #1994 and an ongoing PR for fragment reuse Fix fragment reuse in subgraph fetches #1911. Both of these could result in large query plans and subsequently large requests to subgraphs.
  3. Changing maxSockets from 15 to Infinity could be causing more connections to be maintained to subgraphs, rather than being queued (some background info from our friends on StackOverflow). This increase could also exacerbate the above issue by sending more large requests simultaneously.

We hope to get more clarity on the graphql-js changes soon, as well as concrete numbers for the performance improvements that have been slated for federation 2.1.x.

For now, my best advice for anyone investigating performance issues is:

  1. Try the latest 2.1.0 alpha (currently 2.1.0-alpha.1)
  2. Possibly change maxSockets as mentioned at the top of this thread
  3. Possibly try the codesandbox build from Fix fragment reuse in subgraph fetches #1911 if your operations use a lot of fragments

@theJC
Copy link
Contributor

theJC commented Aug 1, 2022

Any chance this "FIXME: heavy-handed" mechanism of re-executing the query altogether here in the POST-PROCESSING could be exacerbating this issue as well?

@benweatherman
Copy link
Contributor

Yes, I believe it's related to my point #1 above. Though I don't believe calling execute alone is the cause of a new slowdown since it's called in recent 0.x versions of the gateway as well.

Of course, removing that execute and only doing what's necessary could speed things up.

@benweatherman
Copy link
Contributor

Apollo has published this tech note to help provide some guidance on how to investigate performance issues https://www.apollographql.com/docs/technotes/TN0009-gateway-performance/. We're still investigating some issues from above, but that tech note is a good way to ensure everything is setup correctly.

@benweatherman benweatherman removed their assignment Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants