Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent cancelling the HTTP/2 requests will cause client never sends WINDOW_UPDATE frame #3915

Closed
jifang opened this issue Mar 6, 2018 · 40 comments
Labels
bug Bug in existing code
Milestone

Comments

@jifang
Copy link

jifang commented Mar 6, 2018

Description: When downloading a lot of medium-sized files from Nginx HTTP/2 server, if requests are frequently canceled (RST_STREAM), eventually the connection will timeout and not be able to send/receive any data from the server.

On the server side configure the Nginx server to support HTTP/2 with a self-signed certificate. And host about 100 files with size about 500k each.
On the client side, I built an Android test, using a single thread to download the file from file no.1 to file no.100 consequently. During each download, the request is canceled and then proceed to download next file. Eventually, the request will timeout. Depending on the setup the timeout may happen on different download. But once the testing environment is set, it always failed on the same file.

Server info: nginx/1.13.10
Client info: Android 6.0 with OKHttp 3.10.0

This issue is reproducible on multiple servers. For example, Go HTTP/2 server and AWS CloudFront.

A filtered (sid:75) server log shows that the header frame has been sent on 14:39:32 and then timed out after 60 seconds. Note, it is not always the header frame though. Sometimes it is the data frame not being sent out.

2018/03/05 14:39:32 [debug] 92282#0: *4 http2 frame type:1 f:5 l:14 sid:75
2018/03/05 14:39:32 [debug] 92282#0: *4 http2 HEADERS frame sid:75 depends on 0 excl:0 weight:16
2018/03/05 14:39:32 [debug] 92282#0: *4 http2 frame out: 00007FD4B082ABA0 sid:75 bl:1 len:122
2018/03/05 14:39:32 [debug] 92282#0: *4 http2 frame sent: 00007FD4B082ABA0 sid:75 bl:1 len:122
2018/03/05 14:40:32 [debug] 92282#0: *4 http2 send RST_STREAM frame sid:75, status:1

In the meantime, the client log shows the header is not received until 14:40:32

03-05 14:39:32.229 5979-5995/? V/VB_Network: Sending request https://10.0.0.186/files/37 on Connection{10.0.0.186:443, proxy=DIRECT hostAddress=/10.0.0.186:443 cipherSuite=TLS_RSA_WITH_AES_128_CBC_SHA protocol=h2}
                                             User-Agent: 
                                             Host: 10.0.0.186
                                             Connection: Keep-Alive
                                             Accept-Encoding: gzip
03-05 14:40:32.397 5979-5995/? V/VB_Network: Received response for https://10.0.0.186/files/37 (h2) in 60167.4ms
                                             server: nginx/1.13.10
                                             date: Mon, 05 Mar 2018 22:39:32 GMT
                                             content-type: application/octet-stream
                                             content-length: 562176
                                             last-modified: Sun, 04 Mar 2018 00:23:13 GMT
                                             etag: "5a9b3c71-89400"
                                             accept-ranges: bytes
03-05 14:40:32.400 5979-5995/? E/VB_Test: stream was reset: PROTOCOL_ERROR https://10.0.0.186/files/37

At first, I thought it is a server issue, so I filed a bug to the Nginx team. They replied that this is because the client never sends the WINDOW_UPDATE frame except after the initial connection.

Server Full Log
Client Full Log

The test project is under [email protected]:jifang/nginx_bug_repo1.git

@jifang jifang changed the title Frequent cancelling the HTTP/2 requests will cause server stop sending more data on the same connection Frequent cancelling the HTTP/2 requests will cause client never sends WINDOW_UPDATE frame Mar 6, 2018
@yschimke
Copy link
Collaborator

yschimke commented Mar 6, 2018

Thanks @jifang for including the github project to reproduce it. Not sure who is going to look at this first. n.b. Have you looked at the Http2 frame logging?

Excuse the kotlin

val activeLogger = getLogger(okhttp3.internal.http2.Http2::class.java.name)
activeLogger.level = Level.FINE

Should give output like

>> CONNECTION 505249202a20485454502f322e300d0a0d0a534d0d0a0d0a
>> 0x00000000     6 SETTINGS
>> 0x00000000     4 WINDOW_UPDATE
<< 0x00000000    18 SETTINGS
>> 0x00000000     0 SETTINGS      ACK
>> 0x00000003    65 HEADERS       END_STREAM|END_HEADERS
<< 0x00000000     0 SETTINGS      ACK
<< 0x00000003   209 HEADERS       END_HEADERS
<< 0x00000003 14181 DATA
<< 0x00000003  2829 DATA
<< 0x00000003 16375 DATA
<< 0x00000003  8505 DATA
<< 0x00000003  5667 DATA
<< 0x00000003 16384 DATA
<< 0x00000003 16366 DATA
<< 0x00000003 11343 DATA
<< 0x00000003  5667 DATA
<< 0x00000003  5083 DATA          END_STREAM
>> 0x00000000     8 GOAWAY

@jifang
Copy link
Author

jifang commented Mar 6, 2018

Thanks @yschimke for the tips.
Here is the client log filtered by Http2
The test project is also updated.

@swankjesse
Copy link
Collaborator

I have a theory.

In our HTTP/2 code we don't ack bytes until they're delivered to the application. It's insufficient to get them to the Http2Stream; they actually need to be read from the stream.

I'm guessing that when a steam is cancelled we need to do something with the leftover data.

@yschimke
Copy link
Collaborator

yschimke commented Mar 7, 2018

Nice. That theory should be reasonably easy to repro with a test if so.

@yschimke yschimke added the bug Bug in existing code label Mar 12, 2018
@yschimke yschimke added this to the 3.11 milestone Mar 12, 2018
@yschimke
Copy link
Collaborator

Tagging for 3.11 as this seems hi-pri

wowex pushed a commit to wowex/okhttp that referenced this issue Mar 14, 2018
wowex pushed a commit to wowex/okhttp that referenced this issue Mar 15, 2018
fixed cases when:
1. Client cancels stream and does not update window
2. Client receives RstStream and does not update window

square#3915
wowex pushed a commit to wowex/okhttp that referenced this issue Mar 16, 2018
fixed cases when:
1. Client cancels stream and does not update window for unread data
2. Client receives RstStream and does not update window for unread data
3. Client receives Data for closed (unknown) stream and does not
   calculate that data into window

square#3915
wowex pushed a commit to wowex/okhttp that referenced this issue Mar 18, 2018
Added test cases when:
1. Client cancels stream and should update window for unread data
2. Client receives Data for closed (unknown) stream and should
   calculate that data into window

square#3915
wowex pushed a commit to wowex/okhttp that referenced this issue Mar 18, 2018
fixed cases when:
1. Client cancels stream and does not update window for unread data
2. Client receives RstStream and does not update window for unread data
3. Client receives Data for closed (unknown) stream and does not
   calculate that data into window
4. Client receives more bytes that it can handle and closes stream

square#3915
wowex pushed a commit to wowex/okhttp that referenced this issue Mar 18, 2018
fixed cases when:
1. Client cancels stream and does not update window for unread data
2. Client receives RstStream and does not update window for unread data
3. Client receives Data for closed (unknown) stream and does not
   calculate that data into window
4. Client receives more bytes that it can handle and closes stream

square#3915

not fixed:
1. Push requests
2. One unknown case which causes failure of https:/wowex/okhttp-3915-test
wowex pushed a commit to wowex/okhttp that referenced this issue Mar 19, 2018
fixed cases when:
1. Client cancels stream and does not update window for unread data
2. Client receives RstStream and does not update window for unread data
3. Client receives Data for closed (unknown) stream and does not
   calculate that data into window
4. Client receives more bytes that it can handle and closes stream

square#3915

not fixed:
1. Push requests
wowex pushed a commit to wowex/okhttp that referenced this issue Mar 19, 2018
fixed case when:
1. Client closes stream and does not update window for unread data

square#3915

Test code:
  Call call = client.newCall(request);
  try (Response response = call.execute()) {
    response.body().source().read(new byte[1]);
  }
dave-r12 added a commit that referenced this issue Apr 7, 2018
We rely on the application layer to read the response body buffer
before sending WINDOW_UPDATE's. Previously we'd immediately throw a
StreamResetException. This prevented the application layer from reading
the buffer which in turn means we would not send WINDOW_UPDATE's. This
has potential to deplete the flow-control window.

#3915
dave-r12 added a commit that referenced this issue Apr 8, 2018
We rely on the application layer to read the response body buffer
before sending WINDOW_UPDATE's. Previously we'd immediately throw a
StreamResetException. This prevented the application layer from reading
the buffer which in turn means we would not send WINDOW_UPDATE's. This
has potential to deplete the flow-control window.

#3915
@natez0r
Copy link

natez0r commented Apr 11, 2018

I have been debugging an issue with ExoPlayer and OkHttp Http/2 connections that I now think may be the same underlying issue as this one. I have a test project attached to the issue that I can get 100% reproduction of the SocketTimeoutException if you run the project on a Genymotion android emulator with the network throttled to GPRS or Edge.

Would love to have an idea of when I could grab a snapshot to test a potential fix to this. I tried the latest 3.11.0-SNAPSHOT from today (April 11th).

@ojw28
Copy link

ojw28 commented May 9, 2018

Is there any update on this issue? Thanks!

@natez0r
Copy link

natez0r commented May 10, 2018

Also looking for an update, we're still seeing this in production, causing any connection to a domain to fail until the socket gets removed from the connection pool

@dave-r12
Copy link
Collaborator

Can you try the latest SNAPSHOT? Be sure to fully read the response body, that's our signal to notify the peer we are ready for more bytes.

If you're still seeing an issue, can you please provide some steps to reproduce? Or ideally a test case if that's possible.

@natez0r
Copy link

natez0r commented May 16, 2018

@ojw28 do you think it's feasible to have the loaders in exoplayer clean up cancelled requests by finishing reading the bytes?

As an aside @dave-r12, I personally don't know if it's reasonable to ask the application layer to do something like that when you don't need to do this with Http/1.1 connections and it doesn't break all subsequent connections to that domain. I will say that this is a huge issue for our video playback on android right now and I'm open to fixing it in the short term by doing a workaround but this is just my initial reaction.

If you're looking to reproduce, I have a test project attached to the issue that I can get 100% reproduction of the SocketTimeoutException if you run the project on a Genymotion android emulator with the network throttled to GPRS or Edge.

@natez0r
Copy link

natez0r commented May 21, 2018

bump, can I work with someone from square to figure out how to fix this outside of the application layer? this is a huge issue for us right now

@ojw28
Copy link

ojw28 commented May 22, 2018

I don't think it's reasonable to require the application layer to fully read the response body. What if it's a 2GB video or something?

@swankjesse
Copy link
Collaborator

@natez0r did you try the latest snapshot? If that fixes the problem I can cut a release.

@natez0r
Copy link

natez0r commented May 22, 2018

@swankjesse just tried now with (https://oss.sonatype.org/content/repositories/snapshots/com/squareup/okhttp3/okhttp/3.11.0-SNAPSHOT/okhttp-3.11.0-20180522.074215-92.pom) and unfortunately, I was still reproducing the issue in my test app

@dave-r12
Copy link
Collaborator

does that mean that all other connections to that domain will have to wait for the exhausting of the response to complete before they could be multiplexed over the same socket?

Connections or requests? Assuming you mean requests and the connection flow-control window is full, then yes the server should not send us any more data frames until we free up space.

Do we only have to read whatever's actually been received, or does that mean we need to read the entire response the server was planning to serve up?

Should only be what we have buffered so far, not the entire response. There should be no blocking on IO.

@ojw28
Copy link

ojw28 commented May 29, 2018

So I think where we are with this is:

  • The application should call response.body().close() (we already do this in ExoPlayer).
  • OkHttp should do whatever is necessary when that happens (e.g. read and discard any buffered-but-not-yet-consumed data). @dave-r12 - Does the commit referenced above do this, or is there more work to be done?

Thanks!

@dave-r12
Copy link
Collaborator

Yes, I believe the commit above does that (pending code review.)

@natez0r
Copy link

natez0r commented May 29, 2018

Awesome, thanks for the fix @dave-r12! much appreciated. If you'd like, I can run my test project on it once it is in a snapshot.

@dave-r12
Copy link
Collaborator

@natez0r run your test project on latest snapshot?

@natez0r
Copy link

natez0r commented May 31, 2018

I tried it, but i am still seeing a socket timeout. I'm trying to investigate a bit further, but wanted to let you know I'm working on it.

@dave-r12
Copy link
Collaborator

dave-r12 commented Jun 5, 2018

Shoot, ok. Let us know if we can help out at all. Ideally we can get an executable test case for this and then fix.

@baconz
Copy link

baconz commented Jun 23, 2018

Hey @natez0r let me know if you'd like a second set of hands on this! We're eager to get this resolved.

@jifang
Copy link
Author

jifang commented Jun 23, 2018

I am able to verify that calling response.body().close(); had fixed this issue with the latest snapshot. Thanks!

@natez0r
Copy link

natez0r commented Jun 28, 2018

I got a chance to debug this a bit more today. It looks like my issue can occur when we've 'opened' the connection but have yet to begin reading the response but decide to cancel it.

@natez0r
Copy link

natez0r commented Jun 28, 2018

@dave-r12 I guess my issue is slightly different than this, but I am trying to figure out what the actual cause could be:

If we've opened a connection (gotten the response code), but have yet to start reading the body of the response and decide to cancel the request, does the underlying Http2Stream need to write a reset code to the server. I connected the debugger and it appears that the unacknowledgedBytes are 0 when closing the stream, which does not fire a windowUpdate (which sounds reasonable), but I still get sockettimeouts when attempting to make a new connection to the same domain after i've closed the response body of my connection that is in the state described in my first message

@dave-r12
Copy link
Collaborator

Alright. I'll continue to stare at it but nothing is jumping out at me yet. If you could grab the HTTP/2 frame logs or can come up with an executable test case that will help as well. I'd really like to get to the root cause of this one 😄.

does the underlying Http2Stream need to write a reset code to the server.

That should already be happening if we haven't received all the data for the stream or there wasn't an existing error code for the stream.

@natez0r
Copy link

natez0r commented Jun 29, 2018 via email

@swankjesse swankjesse modified the milestones: 3.12, 3.11 Jul 5, 2018
@yschimke
Copy link
Collaborator

yschimke commented Jul 6, 2018

@natez0r Just a headsup about 3.11 getting cut soon. Is the current status, that 3 good fixes have gone in but there is still an edge case we need to reproduce and fix?

@natez0r
Copy link

natez0r commented Jul 6, 2018

@yschimke that sounds correct to me. The issue I am having could probably be broken off into a separate issue.

I just got back from vacation today, I am going to try to write a test case to repro the issue before EOD (fingers crossed)

@yschimke
Copy link
Collaborator

yschimke commented Jul 6, 2018

If you think it's different enough, then please split it off.

@natez0r
Copy link

natez0r commented Jul 6, 2018

so, i cannot get it to reproduce yet with the the unit tests, I think it probably can be split off.

The issue that I am seeing is:

EDGE connection speed

open one connection and get the content size, begin reading a chunk of data from that connection (i used 8192 block size).

During that read, try to open a new connection to a URL on the same domain. Regardless of if you close the old connection or not, the second connection will fail for me.

Anywho, I will file an issue on monday.

@dave-r12
Copy link
Collaborator

dave-r12 commented Jul 7, 2018

@natez0r I tried to write a test case to match what you found: #4127. What do you think?

@noamtamim
Copy link

@natez0r can you please share your test code that shows this still happens?

@natez0r
Copy link

natez0r commented Jan 7, 2019

@natez0r can you please share your test code that shows this still happens?

@noamtamim I was using google/ExoPlayer#4078 (comment) to reproduce it. I haven't used it in awhile, but the basic gist for the test project was to have two connections to the same domain and cancel one mid-request and then try to read the second

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug in existing code
Projects
None yet
Development

No branches or pull requests

8 participants