Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection stuck when "keep alive" is used #1439

Closed
idubrov opened this issue Feb 10, 2018 · 4 comments
Closed

Connection stuck when "keep alive" is used #1439

idubrov opened this issue Feb 10, 2018 · 4 comments

Comments

@idubrov
Copy link

idubrov commented Feb 10, 2018

I have a simple client that sends requests in chunks of 5 and after a while this client stucks.

The client looks like this:

  let mut core = tokio_core::reactor::Core::new().unwrap();
  let handle = core.handle();
  let items = ... // long list of Hyper requests
  let client = Client::new(&handle);
  let requests = stream::iter_result(items)
    .chunks(concurrency)
    .for_each(|reqs| {
      info!("Running {} tasks ...", concurrency);
      let vec = reqs
        .into_iter()
        .map(|(path, request)| process_request(&client, &path, request))
        .collect::<Vec<_>>();
      future::join_all(vec).map(|_results| ())
    });

  core.run(requests).unwrap();

Debugger shows that it waits for the mio event (sits in mio::sys::unix::kqueue::Selector::select):

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000106b7bd96 libsystem_kernel.dylib`kevent + 10
    frame #1: 0x0000000104f94c03 piston`mio::sys::unix::kqueue::Selector::select::h5908d164b552abc8(self=0x0000000106e33020, evts=0x00007fff5bca9f48, awakener=(__0 = 18446744073709551615), timeout=<unavailable>) at kqueue.rs:84
    frame #2: 0x0000000104f8e2fa piston`mio::poll::Poll::poll2::h47cfb2893a485481(self=0x0000000106e33020, events=0x00007fff5bca9f48, timeout=Option<std::time::duration::Duration> @ 0x00007fff5bca8ce8) at poll.rs:1161
    frame #3: 0x0000000104f8e026 piston`mio::poll::Poll::poll::h1c5a4889decc5479(self=0x0000000106e33020, events=0x00007fff5bca9f48, timeout=Option<std::time::duration::Duration> @ 0x00007fff5bca91a0) at poll.rs:1125
    frame #4: 0x0000000104f6a90b piston`tokio_core::reactor::Core::poll::h355d296d2c61bccc(self=0x00007fff5bca9f48, max_wait=Option<std::time::duration::Duration> @ 0x00007fff5bca9e18) at mod.rs:276
    frame #5: 0x0000000104354499 piston`tokio_core::reactor::Core::run::h3e432d0bd23cd991(self=0x00007fff5bca9f48, f=<unavailable>) at mod.rs:241
    frame #6: 0x000000010481faad piston`piston::submit::execute::h35469783a8e181fc(matches=0x0000000106e7c018) at submit.rs:231

Analyzing logs shows that the following events seem to be causing the issue:

// First, I see that token of interest receives Readable | Hup event (actual token number is 0, since this is x*2+2):
TRACE tokio_core::reactor               > event Readable | Hup Token(2)
// I added some debugging to tokio-core dispatch_io to print status of io.reader/io.writer, here you can see that reader task is not none:
DEBUG tokio_core::reactor > TKIO: is readable writer.is_none() == true, reader.is_none() == false
// Client writes some data to the connection (token 0 is the same as Token(2) above, as (2-2)/2 = 0
DEBUG tokio_core::net::tcp > DBG: WRITE BUFS: (token 0) Ok(63817) <-- this one I added to AsyncWrite.write_buf
DEBUG hyper::proto::h1::io > flushed 63817 bytes
// Finally, this is where error is returned from mio (error code is 54, ECONNRESET)
// This is normal token, it works just fine (has more logging output after that line)
TRACE tokio_core::reactor > event Writable Token(6)
// reader task is present for this token
DEBUG tokio_core::reactor > TKIO: is writable writer.is_none() == true, reader.is_none() == false
// However, this token receives Error event and it is the last line in logs related to it.
TRACE tokio_core::reactor > event Readable | Writable | Error | Hup Token(2)
// No reader task to notify? -- reader.is_none() is true.
DEBUG tokio_core::reactor > TKIO: is writable writer.is_none() == true, reader.is_none() == true```

I am still researching it, but it does look like it is related to "keep alive" feature.

It seems like Hyper does not like it when remote side closes the socket around the time the connection is taken from the "idle" and used for the next request.
@idubrov
Copy link
Author

idubrov commented Feb 10, 2018

Ok, this seems to be the test case to reproduce this behavior:

#[test]
fn client_keep_alive_connreset() {
    use std::sync::mpsc;
    extern crate pretty_env_logger;
    let _ = pretty_env_logger::try_init();
    
    let server = TcpListener::bind("127.0.0.1:0").unwrap();
    let addr = server.local_addr().unwrap();
    let mut core = Core::new().unwrap();
    
    // This one seems to hang forever
    let client = client(&core.handle());
    // This one works as expected (fails, because second connection is not handled by the server)
    //let client = Client::configure()
    //  .keep_alive(false)
    //  .build(&core.handle());
    
    let (tx1, rx1) = oneshot::channel();
    let (tx2, rx2) = mpsc::channel();
    thread::spawn(move || {
        let mut sock = server.accept().unwrap().0;
        //sock.set_read_timeout(Some(Duration::from_secs(5))).unwrap();
        //sock.set_write_timeout(Some(Duration::from_secs(5))).unwrap();
        let mut buf = [0; 4096];
        sock.read(&mut buf).expect("read 1");
        sock.write_all(b"HTTP/1.1 200 OK\r\nContent-Length: 0\r\n\r\n").expect("write 1");
        
        // Wait for client to indicate it is done processing the first request
        // This is what seem to trigger the race condition -- without it client notices
        // connection is closed while processing the first request.
        rx2.recv();
        let _r = sock.shutdown(std::net::Shutdown::Both);
        
        // Let client know it can try to reuse the connection
        let _ = tx1.send(());

        println!("we are ready to receive the second connection");
        let mut _sock = server.accept().unwrap().0;
        println!("accepted second connection");
    });

    
    let res = client.get(format!("http://{}/a", addr).parse().unwrap());
    core.run(res).unwrap();
    
    tx2.send(());

    let rx = rx1.map_err(|_| hyper::Error::Io(io::Error::new(io::ErrorKind::Other, "thread panicked")));
    core.run(rx).unwrap();

    println!("connecting for the second time -- hangs");
    let res = client.get(format!("http://{}/b", addr).parse().unwrap());   
    core.run(res).unwrap();
}

P.S. Seems like Hyper tries to re-use pooled connection that was closed on the remote. For some reason, write to the socket on the client side succeeds (doesn't return EPIPE), so Hyper hangs forever (if it was reported, though, I would expect getting "pooled connection was not ready, this is a hyper bug"). I don't quite understand why "event Readable | Hup" event received earlier does not trigger connection to be evicted from the pool.

P.P.S. I think, there are two issues here:

  1. Connections are not checked when taken from the pool, so closed connections are reused.
  2. When connection is reused, it is not in the same exact state as the new one. It looks to me that important difference is the "read_blocked" flag set to "true" on reused connection, since it blocks connection from doing "empty" read when it is idle. Hacking the code to reset this flags does make the client to fail when it tries to reuse "bad" connection (which is not good, but better than simply waiting forever).

@idubrov idubrov changed the title Connection stuck when "keep alive" is set used Connection stuck when "keep alive" is used Feb 11, 2018
@seanmonstar
Copy link
Member

Hey @idubrov, thanks so much for the detailed report, and a test case!

For some reason, write to the socket on the client side succeeds (doesn't return EPIPE)

The first write can sometimes succeed, since it's just being put in the sndbuf, and then the kernel can try to send in on the wire. Networking is fun!

Let's see what we can do to fix this:

  1. They sort of are. If the proto::h1::Conn gets closed while idle, the handle in the pool should notice that it's been closed (seeing status() == Disabled). Are you suggesting that when taking the handle from the pool, the connection should have read() called on it again?
  2. It does seem like regardless of better detection for 1, it's all around better to error than to hang forever, and so this should probably be done.

@idubrov
Copy link
Author

idubrov commented Feb 12, 2018

Are you suggesting that when taking the handle from the pool, the connection should have read() called on it again?

Yes, that's what I think. It looks to me that new connection does read as the first thing ("try_empty_read") whereas reused one does not.

On a related note, shouldn't the pool itself have a future associated with it to do the bookkeeping of the connections? Otherwise, closed connections will sit in the pool until they are reused?

@seanmonstar
Copy link
Member

There should be a fix for this on master, including your test case passing!

On a related note, shouldn't the pool itself have a future associated with it to do the bookkeeping of the connections?

Most likely. However, it not existing doesn't mean closed sockets will be kept around. Those will be dropped and closed. It's just a sender handle in the pool. Still, it'd be better to periodically clean those up...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants