Connection stuck when "keep alive" is used #1439

idubrov · 2018-02-10T10:29:36Z

I have a simple client that sends requests in chunks of 5 and after a while this client stucks.

The client looks like this:

  let mut core = tokio_core::reactor::Core::new().unwrap();
  let handle = core.handle();
  let items = ... // long list of Hyper requests
  let client = Client::new(&handle);
  let requests = stream::iter_result(items)
    .chunks(concurrency)
    .for_each(|reqs| {
      info!("Running {} tasks ...", concurrency);
      let vec = reqs
        .into_iter()
        .map(|(path, request)| process_request(&client, &path, request))
        .collect::<Vec<_>>();
      future::join_all(vec).map(|_results| ())
    });

  core.run(requests).unwrap();

Debugger shows that it waits for the mio event (sits in mio::sys::unix::kqueue::Selector::select):

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000106b7bd96 libsystem_kernel.dylib`kevent + 10
    frame #1: 0x0000000104f94c03 piston`mio::sys::unix::kqueue::Selector::select::h5908d164b552abc8(self=0x0000000106e33020, evts=0x00007fff5bca9f48, awakener=(__0 = 18446744073709551615), timeout=<unavailable>) at kqueue.rs:84
    frame #2: 0x0000000104f8e2fa piston`mio::poll::Poll::poll2::h47cfb2893a485481(self=0x0000000106e33020, events=0x00007fff5bca9f48, timeout=Option<std::time::duration::Duration> @ 0x00007fff5bca8ce8) at poll.rs:1161
    frame #3: 0x0000000104f8e026 piston`mio::poll::Poll::poll::h1c5a4889decc5479(self=0x0000000106e33020, events=0x00007fff5bca9f48, timeout=Option<std::time::duration::Duration> @ 0x00007fff5bca91a0) at poll.rs:1125
    frame #4: 0x0000000104f6a90b piston`tokio_core::reactor::Core::poll::h355d296d2c61bccc(self=0x00007fff5bca9f48, max_wait=Option<std::time::duration::Duration> @ 0x00007fff5bca9e18) at mod.rs:276
    frame #5: 0x0000000104354499 piston`tokio_core::reactor::Core::run::h3e432d0bd23cd991(self=0x00007fff5bca9f48, f=<unavailable>) at mod.rs:241
    frame #6: 0x000000010481faad piston`piston::submit::execute::h35469783a8e181fc(matches=0x0000000106e7c018) at submit.rs:231

Analyzing logs shows that the following events seem to be causing the issue:

// First, I see that token of interest receives Readable | Hup event (actual token number is 0, since this is x*2+2):
TRACE tokio_core::reactor               > event Readable | Hup Token(2)
// I added some debugging to tokio-core dispatch_io to print status of io.reader/io.writer, here you can see that reader task is not none:
DEBUG tokio_core::reactor > TKIO: is readable writer.is_none() == true, reader.is_none() == false
// Client writes some data to the connection (token 0 is the same as Token(2) above, as (2-2)/2 = 0
DEBUG tokio_core::net::tcp > DBG: WRITE BUFS: (token 0) Ok(63817) <-- this one I added to AsyncWrite.write_buf
DEBUG hyper::proto::h1::io > flushed 63817 bytes
// Finally, this is where error is returned from mio (error code is 54, ECONNRESET)
// This is normal token, it works just fine (has more logging output after that line)
TRACE tokio_core::reactor > event Writable Token(6)
// reader task is present for this token
DEBUG tokio_core::reactor > TKIO: is writable writer.is_none() == true, reader.is_none() == false
// However, this token receives Error event and it is the last line in logs related to it.
TRACE tokio_core::reactor > event Readable | Writable | Error | Hup Token(2)
// No reader task to notify? -- reader.is_none() is true.
DEBUG tokio_core::reactor > TKIO: is writable writer.is_none() == true, reader.is_none() == true```

I am still researching it, but it does look like it is related to "keep alive" feature.

It seems like Hyper does not like it when remote side closes the socket around the time the connection is taken from the "idle" and used for the next request.

The text was updated successfully, but these errors were encountered:

idubrov · 2018-02-10T20:31:13Z

Ok, this seems to be the test case to reproduce this behavior:

#[test]
fn client_keep_alive_connreset() {
    use std::sync::mpsc;
    extern crate pretty_env_logger;
    let _ = pretty_env_logger::try_init();
    
    let server = TcpListener::bind("127.0.0.1:0").unwrap();
    let addr = server.local_addr().unwrap();
    let mut core = Core::new().unwrap();
    
    // This one seems to hang forever
    let client = client(&core.handle());
    // This one works as expected (fails, because second connection is not handled by the server)
    //let client = Client::configure()
    //  .keep_alive(false)
    //  .build(&core.handle());
    
    let (tx1, rx1) = oneshot::channel();
    let (tx2, rx2) = mpsc::channel();
    thread::spawn(move || {
        let mut sock = server.accept().unwrap().0;
        //sock.set_read_timeout(Some(Duration::from_secs(5))).unwrap();
        //sock.set_write_timeout(Some(Duration::from_secs(5))).unwrap();
        let mut buf = [0; 4096];
        sock.read(&mut buf).expect("read 1");
        sock.write_all(b"HTTP/1.1 200 OK\r\nContent-Length: 0\r\n\r\n").expect("write 1");
        
        // Wait for client to indicate it is done processing the first request
        // This is what seem to trigger the race condition -- without it client notices
        // connection is closed while processing the first request.
        rx2.recv();
        let _r = sock.shutdown(std::net::Shutdown::Both);
        
        // Let client know it can try to reuse the connection
        let _ = tx1.send(());

        println!("we are ready to receive the second connection");
        let mut _sock = server.accept().unwrap().0;
        println!("accepted second connection");
    });

    
    let res = client.get(format!("http://{}/a", addr).parse().unwrap());
    core.run(res).unwrap();
    
    tx2.send(());

    let rx = rx1.map_err(|_| hyper::Error::Io(io::Error::new(io::ErrorKind::Other, "thread panicked")));
    core.run(rx).unwrap();

    println!("connecting for the second time -- hangs");
    let res = client.get(format!("http://{}/b", addr).parse().unwrap());   
    core.run(res).unwrap();
}

P.S. Seems like Hyper tries to re-use pooled connection that was closed on the remote. For some reason, write to the socket on the client side succeeds (doesn't return EPIPE), so Hyper hangs forever (if it was reported, though, I would expect getting "pooled connection was not ready, this is a hyper bug"). I don't quite understand why "event Readable | Hup" event received earlier does not trigger connection to be evicted from the pool.

P.P.S. I think, there are two issues here:

Connections are not checked when taken from the pool, so closed connections are reused.
When connection is reused, it is not in the same exact state as the new one. It looks to me that important difference is the "read_blocked" flag set to "true" on reused connection, since it blocks connection from doing "empty" read when it is idle. Hacking the code to reset this flags does make the client to fail when it tries to reuse "bad" connection (which is not good, but better than simply waiting forever).

seanmonstar · 2018-02-12T20:03:18Z

Hey @idubrov, thanks so much for the detailed report, and a test case!

For some reason, write to the socket on the client side succeeds (doesn't return EPIPE)

The first write can sometimes succeed, since it's just being put in the sndbuf, and then the kernel can try to send in on the wire. Networking is fun!

Let's see what we can do to fix this:

They sort of are. If the proto::h1::Conn gets closed while idle, the handle in the pool should notice that it's been closed (seeing status() == Disabled). Are you suggesting that when taking the handle from the pool, the connection should have read() called on it again?
It does seem like regardless of better detection for 1, it's all around better to error than to hang forever, and so this should probably be done.

idubrov · 2018-02-12T20:27:31Z

Are you suggesting that when taking the handle from the pool, the connection should have read() called on it again?

Yes, that's what I think. It looks to me that new connection does read as the first thing ("try_empty_read") whereas reused one does not.

On a related note, shouldn't the pool itself have a future associated with it to do the bookkeeping of the connections? Otherwise, closed connections will sit in the pool until they are reused?

seanmonstar · 2018-02-13T02:32:27Z

There should be a fix for this on master, including your test case passing!

On a related note, shouldn't the pool itself have a future associated with it to do the bookkeeping of the connections?

Most likely. However, it not existing doesn't mean closed sockets will be kept around. Those will be dropped and closed. It's just a sender handle in the pool. Still, it'd be better to periodically clean those up...

idubrov changed the title ~~Connection stuck when "keep alive" is set used~~ Connection stuck when "keep alive" is used Feb 11, 2018

seanmonstar closed this as completed in dc619a8 Feb 13, 2018

This was referenced Feb 10, 2020

reqwest 0.10.1 webern/tough#1

Merged

bump reqwest version to 0.10.1 awslabs/tough#55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection stuck when "keep alive" is used #1439

Connection stuck when "keep alive" is used #1439

idubrov commented Feb 10, 2018 •

edited

Loading

idubrov commented Feb 10, 2018 •

edited

Loading

seanmonstar commented Feb 12, 2018

idubrov commented Feb 12, 2018

seanmonstar commented Feb 13, 2018

Connection stuck when "keep alive" is used #1439

Connection stuck when "keep alive" is used #1439

Comments

idubrov commented Feb 10, 2018 • edited Loading

idubrov commented Feb 10, 2018 • edited Loading

seanmonstar commented Feb 12, 2018

idubrov commented Feb 12, 2018

seanmonstar commented Feb 13, 2018

idubrov commented Feb 10, 2018 •

edited

Loading

idubrov commented Feb 10, 2018 •

edited

Loading