Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support upstream timeout #53

Closed
stefanprodan opened this issue Apr 2, 2019 · 25 comments
Closed

Support upstream timeout #53

stefanprodan opened this issue Apr 2, 2019 · 25 comments
Assignees
Labels
Roadmap: Accepted We are planning on doing this work.

Comments

@stefanprodan
Copy link

Allow the upstream timeout to be specified.
https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/route/route.proto#envoy-api-field-route-routeaction-timeout

@mhausenblas mhausenblas added the Roadmap: Proposed We are considering this for inclusion in the roadmap. label Apr 3, 2019
@mhausenblas mhausenblas changed the title [request]: Support upstream timeout Support upstream timeout Apr 3, 2019
@TrueBurn
Copy link

TrueBurn commented Aug 8, 2019

Even when adding the "x-envoy-upstream-rq-timeout-ms" header to the request to override the default 15s timeout requests still error out with upstream request timeout.

https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/router_filter#x-envoy-upstream-rq-timeout-ms

@stphivos
Copy link

Attempting to override using header x-envoy-upstream-rq-timeout-ms is ignored for me also. As a result when under heavy load all our slow responses get terminated at the gateway level. Would really appreciate an update on this :)

@SleeperSmith
Copy link

Any update on this? We are blocked on a full roll out because of this.

@bigdefect
Copy link
Contributor

bigdefect commented Sep 25, 2019

We're researching this right now to see why the header isn't being honored, and to importantly clarify when it can be honored.

In the meantime, @stphivos @SleeperSmith would you be able to tell us the following information

  1. Is the downstream requester behind an Envoy Proxy?
  2. Is the upstream responder behind Envoy?
  3. Is the communication between the downstream and the upstream modeled in App Mesh?

@shubharao shubharao added the Roadmap: Awaiting Customer Feedback We need to get more information in order understand how we will implement this feature. label Sep 27, 2019
@SleeperSmith
Copy link

SleeperSmith commented Oct 1, 2019

  1. Yes
  2. Yes
  3. Yes.

The downstream requester: Nginx as ingress controller -> envoy proxy as sidecar.
The upstream responder: Custom app service registered as a virtual service + virtual router + virtual node.
I am injecting the header "x-envoy-upstream-rq-timeout-ms" in the nginx.

@bigdefect bigdefect assigned dastbe and unassigned bigdefect Oct 2, 2019
@dastbe
Copy link
Contributor

dastbe commented Oct 3, 2019

A quick update on why this isn't working.

Broadly, the issue is an interaction between how envoy determines if the request is internal (which is what allows the use of envoy control headers) and the setting of use_remote_address.

Currently, App Mesh configuration turns use_remote_address off. The result is that the downstream Envoy doesn't consider the request internal (as no x-forwarded-for is set) and x-forwarded-for is not appended to, and the upstream will not respect the control header for the same reason.

Looking forward, we'd like to support upstream timeouts on both the route (respected at the downstream) and on a VNode listener (respected at the upstream).

@shubharao shubharao added Roadmap: Accepted We are planning on doing this work. and removed Roadmap: Awaiting Customer Feedback We need to get more information in order understand how we will implement this feature. Roadmap: Proposed We are considering this for inclusion in the roadmap. labels Oct 3, 2019
@stphivos
Copy link

@efe-selcuk, @dastbe thanks. In our case a standard request is as follows:

[client] => [aws alb] => [nginx] => [flask-http-app] => [grpc-tcp-app] => [aws rds]

where nginx (vn), flask (vs+vn) and grpc (vs+vn) are all modeled in App Mesh and have the Envoy Proxy sidecar. All were updated to support higher timeouts (3m) but when some responses are slow nginx replies with 504. As a temp solution we removed the services from the mesh.

@SleeperSmith
Copy link

@stphivos that sounds like the timeout is working? What settings / host headers did you set? I have a similar set up and that should work well enough for me.

Can't you just set nginx timeout to something higher?

@stphivos
Copy link

@SleeperSmith both aws alb and nginx gateway have timeouts set to 3m:

http {
    server {
        listen 80;
        server_name _;
        proxy_http_version 1.1;

        client_header_timeout  3m;
        client_body_timeout    3m;
        send_timeout           3m;
        keepalive_timeout      3m;

        gzip on;
        gzip_min_length  1100;
        gzip_buffers     4 8k;
        gzip_types       text/plain;

        location / {
            try_files $uri @app;
        }

        location @app {
            proxy_redirect  off;
            proxy_pass      http://backend;

            proxy_set_header   Host $host;
            proxy_set_header   X-Real-IP $remote_addr;
            proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header   X-Forwarded-Host $server_name;

            proxy_connect_timeout  70;
            proxy_send_timeout     180;
            proxy_read_timeout     180;            
        }
    }
}

The request times out at nginx while waiting for the flask app to respond. I can confirm that from X-Ray:

mymesh/service-gateway-vn AWS::AppMesh::Proxy
  => mymesh/service-gateway-vn  504 15.0 sec

mymesh/flask-app-vn AWS::ECS::Container
  => mymesh/flask-app-vn        200 16.4 sec

Also when logging the http headers at the flask app i get:

  "X-Real-Ip": "127.0.0.1",
  "X-Forwarded-For": "<public-ip-addr>, 127.0.0.1",
  "X-Forwarded-Host": "_",
  "X-Envoy-Expected-Rq-Timeout-Ms": "15000"

I tried setting different headers from the http client but still no luck:

x-envoy-upstream-rq-timeout-ms          180000
x-envoy-upstream-rq-per-try-timeout-ms  180000
X-Envoy-Expected-Rq-Timeout-Ms          180000

X-Envoy-Expected-Rq-Timeout-Ms in the request is always set to 15000 and the client receives a 504: upstream request timeout.

I'm using the latest release of appmesh-inject (v0.2.0) which uses aws-appmesh-envoy:v1.11.2.0-prod. Thanks.

@gdowmont
Copy link

gdowmont commented Nov 20, 2019

I have got an application that streams the response back to the client. Due to the size it can take more than a minute to finish. However envoy always terminates the connection after 15 seconds.

I have tried sending keep-alives, chunking, streaming, it is always 15 seconds and connection gets closed.

Are there any workarounds I could try applying? As other mentioned before, headers are being ignored.
I have tried with the latest aws-appmesh-envoy:v1.12.1.0-prod

@craiggoddenpayne
Copy link

I have the same issue also, I came across this after searching for some time, and x-envoy-expected-rq-timeout-ms does not seem to work for me also.

I also cannot find any mention of this in any documentation on the AWS site, only on envoy documentation, is AppMesh now support changing the timeout based on request headers like in the envoy documentation? https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter

@SleeperSmith
Copy link

@bcelenza so this has been accepted as being worked on, but I'm just wondering why I don't see it on the project road map. Just want to see if this is being worked on or this is under a slightly different status?

@bcelenza
Copy link
Contributor

bcelenza commented Dec 4, 2019

@SleeperSmith sorry about that, added it to the roadmap project. We are actively working on it now. 😄

@gary-cowell
Copy link

Imagine my disappointment upon getting 29 meshed services deployed with CDK only to find a slew of hard limit 15s timeouts. This is not usable in this state. Awaiting a resolution.

@hanxu2050
Copy link

any update on this? we also just migrated 2 of our applications to App Mesh and now facing time out issues

@SleeperSmith
Copy link

Just to chime in here. We've withheld further roll out of AWS AppMesh specifically because of this too. Some 15 something services.

@LancerRainier
Copy link

Hey just to give a quick update, this is still work in progress but we are getting more traction here. We had some delays in early 2020, and is now back on track. We are looking at launching this in Q2 2020 (most likely in May 2020). Will keep this thread updated if any more changes. Apologies about the delay here, we are aware that this is a very high priority item and are actively working on it. Thanks!

@shsahu
Copy link

shsahu commented Apr 8, 2020

As part of providing support for timeouts, we intend to add configurable timeout fields(request timeout and idle timeout) at the Route, Virtual Node listener and probably at the Virtual Gateway too (Virtual Gateway is available in preview)

Current Implementation:
We currently support defining timeout at Route in the preview channel. There is a walkthrough available for this feature. Below is a sample spec of a Http route with timeouts:

{
  "virtualRouterName": "color-router",
  "routeName": "color-route",
  "spec": {
    "priority": 1,
    "httpRoute": {
      "match": {
        "prefix": "/"
      },
      "action": {
        "weightedTargets": [
          {
            "virtualNode": "color-node",
            "weight": 1
          }
        ]
      },
      "timeout" : {
        "perRequest": {
          "value" : 5,
          "unit" : "s"
        },
        "idle": {
          "value" : 10,
          "unit" : "s"
        }
      }
    }
  }
}

Limitation:
The current feature implementation lets you only decrease the current default request timeout(15 secs). You can set the timeout to higher values but the request will still timeout at 15 secs because the default value of 15 secs will be applicable at the upstream.

Future Enhancements (WIP) :
To set the timeout to higher values(>15 secs) , the value needs to be set at both the Route and the Virtual Node listener that the Route points to. We are working to add timeouts at the listener of the Virtual Node. Below is the sample config we will support for the Virtual Nodes with timeout defined at the listener.

    "virtualNodeName": "color-node"
    "spec": {
        "listeners": [
            {
                "healthCheck": {
                    "healthyThreshold": 2,
                    "intervalMillis": 5000,
                    "port": 8080,
                    "protocol": "http",
                    "path": "/ping",
                    "timeoutMillis": 2000,
                    "unhealthyThreshold": 3
                },
                "portMapping": {
                    "port": 8080,
                    "protocol": "http"
                },
                "timeout": {
                    "http": {
                        "timeout" : {
                            "perRequest": {
                                "value" : 20,
                                "unit" : "s"
                            },
                            "idle": {
                                "value" : 30,
                                "unit" : "s"
                            }
                        }
                    }
                }
            }
        ],
        "serviceDiscovery": {
            "awsCloudMap": {
                "namespaceName": "http.local",
                "serviceName": "color"
            }
        }
    }
}

We are also working on to finalise the design for adding timeout at the Virtual gateway route.

Thanks.

@shsahu
Copy link

shsahu commented May 30, 2020

Hi All, Timeouts for Routes and Virtual Node Listener are available for testing in our preview channel. Check out the walkthrough to get started. Refer to the docs for route and Virtual Node for more info.

Looking forward for your feedback.

@shsahu
Copy link

shsahu commented Jun 19, 2020

Timeouts is now available in the App Mesh APIs, SDKs and Console for all regions. The Kubernetes controller supports timeout in the latest version v1.0.0

Timeout configuration is supported at route and at Virtual Node listeners. Refer API docs and the userguide for more information.

Please note that the Cloudformation support is not released yet, It will be available soon.

@herrhound
Copy link
Contributor

Closing the issue, as the feature is now released.

@RafalMaleska
Copy link

hi, the issue was solved for VirtualNodes but how about VirtualGateway?

it there a way to set the timeout on them?

@RafalMaleska
Copy link

ok, solved this by using a router and setting a timeout for one of the routes.

@visit1985
Copy link

Even when setting timeouts on VirtualRouter and VirtualNode level, the is still no timeout defined on VirtualGateways internal self-redirect to 15001. This should be set to the max of all timeouts across all possible routes, but is currently defaulting to 15s.

@rajal-amzn
Copy link
Contributor

@visit1985 Thank you for reporting this. We are aware of this regression where the timeouts are not getting configured at Virtual Gateway when the hostname rewrite is disabled. We are currently working on the fix and is being tracked here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Roadmap: Accepted We are planning on doing this work.
Projects
None yet
Development

No branches or pull requests