Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep reading when there's space in the buffer for tokio::io::copy #3694

Closed
Darksonn opened this issue Apr 12, 2021 · 23 comments · Fixed by #5066
Closed

Keep reading when there's space in the buffer for tokio::io::copy #3694

Darksonn opened this issue Apr 12, 2021 · 23 comments · Fixed by #5066
Labels
A-tokio Area: The main tokio crate M-io Module: tokio/io

Comments

@Darksonn
Copy link
Contributor

Darksonn commented Apr 12, 2021

This is the current poll method for tokio::io::copy and tokio::io::copy_bidirectional.

pub(super) fn poll_copy<R, W>(
&mut self,
cx: &mut Context<'_>,
mut reader: Pin<&mut R>,
mut writer: Pin<&mut W>,
) -> Poll<io::Result<u64>>
where
R: AsyncRead + ?Sized,
W: AsyncWrite + ?Sized,
{
loop {
// If our buffer is empty, then we need to read some data to
// continue.
if self.pos == self.cap && !self.read_done {
let me = &mut *self;
let mut buf = ReadBuf::new(&mut me.buf);
ready!(reader.as_mut().poll_read(cx, &mut buf))?;
let n = buf.filled().len();
if n == 0 {
self.read_done = true;
} else {
self.pos = 0;
self.cap = n;
}
}
// If our buffer has some data, let's write it out!
while self.pos < self.cap {
let me = &mut *self;
let i = ready!(writer.as_mut().poll_write(cx, &me.buf[me.pos..me.cap]))?;
if i == 0 {
return Poll::Ready(Err(io::Error::new(
io::ErrorKind::WriteZero,
"write zero byte into writer",
)));
} else {
self.pos += i;
self.amt += i as u64;
}
}
// If we've written all the data and we've seen EOF, flush out the
// data and finish the transfer.
if self.pos == self.cap && self.read_done {
ready!(writer.as_mut().poll_flush(cx))?;
return Poll::Ready(Ok(self.amt));
}
}
}

However, as long as self.cap < 2048, it would be possible to attempt to read more data into the buffer. We should do this. One might also consider copying data from the end of the buffer to the start to make space for further reading.

See also #3572.

@Darksonn Darksonn added E-help-wanted Call for participation: Help is requested to fix this issue. A-tokio Area: The main tokio crate M-io Module: tokio/io E-medium Call for participation: Experience needed to fix: Medium / intermediate labels Apr 12, 2021
@Rustin170506
Copy link
Contributor

I would like to try this.

@Darksonn
Copy link
Contributor Author

Go ahead!

@Rustin170506
Copy link
Contributor

Rustin170506 commented Apr 22, 2021

@cssivision Can't you see I'm trying to fix this?

Shouldn't you ask before you do it? Don't waste my time right?

It doesn't matter. Whatever.

@Darksonn
Copy link
Contributor Author

Please be more careful when you take on PRs that someone have already claimed. If you are concerned that it is stalled, you should ask whether they are still working on it.

@Rustin170506
Copy link
Contributor

Rustin170506 commented Apr 22, 2021

Yes, you are right.

@cssivision You don't have to close your PR, I'm just saying you should have asked me in advance instead of just submitting a PR, which would have caused everyone's time to be wasted. Please feel free to reopen your PR.(If you don't want to, that's fine) Thanks!

@rneswold
Copy link

Is this a real problem that needs to be handled? I find it hard to believe that, with modern drives and network speeds, the 2k buffer isn't filled all the time. I started a conversation that resulted in #3722 because I found reading and writing 2k at a time severely limited throughput.

I've found that reducing the calls into the kernel can increase performance due to context switches being a fairly expensive operation. Checking to see if a socket is readable or writable probably results in a kernel call (via ioctl) and then being added to the underlying poll is a call, albeit for all handles the scheduler blocks on. Making it do this multiple times for a 2k buffer seems like the wrong direction.

Unless you're targeting tiny machines (I don't know if Rust compiles code for embedded micros) 2k is a tiny amount of RAM (barely bigger than an MTU and much smaller than a GigE jumbo packet!)

I'm asking that, if this is a feature that is to be added, please benchmark it to see how it performs in reading or writing to a gigabit Ethernet link.

@rneswold
Copy link

rneswold commented Apr 26, 2021

we should not poll that resource again during this poll

That's good to know. However, I believe my other concerns are still valid. io::copy is a valuable building block but, if it doesn't transfer data quickly, everyone will be writing their own performant version.

@Darksonn
Copy link
Contributor Author

Yeah, we should probably change the default buffer size too. The one in std uses 8k, so I guess that would be reasonable?

@rneswold
Copy link

8k should definitely be the minimum since Jumbo frames may be more common as GigE becomes the "slowest" network interface. It'll cut the number of kernel calls to a quarter of they are now, at least.

Another thing to consider is TCP's NO_DELAY option. Some network traffic is chit-chatty (e.g. ssh and telnet.) By default, TCP tries to buffer up more data to fill the MTU. But for login sessions or protocols with short request/reply messages, you don't want to wait for a full buffer to send because more data may be held off until a reply is received. So some applications turn on the NO_DELAY option for the socket so that small packets are sent immediately.

Trying to fill the buffer defeats that option. So I'd avoid trying to fill the 8k buffer; just send what was received. This covers short exchanges that don't fill the buffer and large data transfers that completely fill it.

(in other words, short messages won't see a benefit of 8k but large data transfers will.)

@Darksonn
Copy link
Contributor Author

When it comes to NO_DELAY, that's not something io::copy controls. You can set it on the socket when you create it.

As for filling the buffer, the design I had in mind would only try to read more data into the buffer if the AsyncWrite is not able to send all of the data immediately. So it seems to me that this would not affect the case where the IO resources are extremely fast. It's still worth testing it of course.

@Darksonn
Copy link
Contributor Author

Darksonn commented Apr 26, 2021

That is, the idea is that if we are not able to write anything right now, we might as well read some more data from the socket instead if we have space for it. It would not involve postponing writing any data in the cases where we are able to write data for the sake of buffering many smaller reads together.

@rneswold
Copy link

When it comes to NO_DELAY, that's not something io::copy controls. You can set it on the socket when you create it.

Sorry, I wasn't being clear. If I set NO_DELAY on the socket, but you're waiting to fill the 2k buffer before sending it on, then you've defeated the purpose of NO_DELAY.

But your later response indicates you're only filling when you can't write which I'm more in favor for. Especially if you increase the buffer size.

@Darksonn
Copy link
Contributor Author

Darksonn commented Apr 26, 2021

Right, I understand now what you meant with NO_DELAY. To confirm, I am not suggesting that we wait for the buffer to be filled before we try to send, rather I am suggesting that if the send returns WouldBlock, then we try to do another read if there is space in the buffer.

@b-naber
Copy link
Contributor

b-naber commented Jan 10, 2022

the design I had in mind would only try to read more data into the buffer if the AsyncWrite is not able to send all of the data immediately

@Darksonn This sounds to me like the condition i < self.cap, where i = ready!(writer.as_mut().poll_write(cx, &me.buf[me.pos..me.cap]))?, but in the current implementation we already get back into the reading part if that condition holds. Or do you mean to try to read more data if poll_write returns Pending? Can you clarify what you mean by 'not able to send all of the data'?

@CthulhuDen
Copy link

Hey! So it does seem like the issue got a little stale, do you think I could offer my take on this?

@Noah-Kennedy
Copy link
Contributor

@CthulhuDen go ahead!

@farnz
Copy link
Contributor

farnz commented Sep 23, 2022

Is anyone currently actively working on this (@CthulhuDen perhaps?) If not, I'd like to tackle it.

@CthulhuDen
Copy link

CthulhuDen commented Sep 23, 2022

@farnz Please go ahead! Sorry I don't have numbers on hand to back me, but I found this fix changes nothing, cause we're always bottlenecked on either input or output and basically either way change did not affect transfer throughput for me. But, maybe you will arrive at positive results. Good luck!

@rneswold
Copy link

Definitely do some benchmarks before and after to make sure your changes improve things. I know it's not easy writing test benchmarks since they don't necessarily reflect the real world, but maximizing throughput can be affected in unexpected ways.

In a (C++) project, we wrapped a helper function around getting the system time. It was so easy to use, it got called in several places every time the program handled an I/O event. Getting the system time requires entering the kernel, which is pretty expensive. We changed the program to get the system time when it woke up and saved it. The helper function simply returned the cached value for the current I/O event. That easily doubled our throughput.

In another instance, I called an ioctl to see how much space was in the socket buffer to decide whether to do a send. This meant every successful send required two entries into the kernel. Instead, the socket was set to return an error if there wasn't enough space. That change gave us a substantial bump in throughput.

I know these approaches don't translate well to async coding because futures need to be cancelable so, sometimes, you need to make the extra system calls to meet async requirements. But if you can make the code "Correct" while reducing system calls, it'll be a win.

@farnz
Copy link
Contributor

farnz commented Sep 28, 2022

The changes to actually do extra reads while waiting for the write to become possible aren't that complex; I can put them up for review, but I'd prefer to finish the benchmarks to show that they're worthwhile first. I'm working now on benchmarks showing that:

  1. When doing memory-to-memory "I/O", there's no regression caused - this is the best case for the existing code, since neither reader nor writer ever block, and thus any slowdown here indicates that we're slowing things down in the fast case.
  2. That there is a case where the change helps - this is likely to need a reader supplying data in tiny blocks (on the order of 2 KiB, so that we can get multiple blocks into the buffer), and a writer that is IOPS-bound, not throughput bound (so can handle very large writes, but only lets you make a small number of writes per second).

@farnz
Copy link
Contributor

farnz commented Sep 29, 2022

So, benchmarks done, and I've been able to create one case in which the change speeds things up, and 4 in which it doesn't affect things.

The no change cases are:

  • Straight fast-as-possible memory to memory copy
  • Memory to HDD simulation copy
  • Slow network simulation to memory copy
  • Slow network simulation to HDD simulation copy

And the one that benefits (significantly)

  • Slow network simulation to IOPS throttled unbuffered device copy.

However, I cannot see a case in which an IOPS throttled unbuffered device is a realistic target. If you're sending over a network, the network stack adds buffering; if it's local, the kernel will normally add buffering. The only time this could matter is if you've implemented AsyncWrite yourself for a custom device, and it's badly implemented, and you can't afford the memory cost of adding a BufWriter with a small buffer around your AsyncWrite.

My changes (with benchmark numbers on my aging Xeon E3-1245v2) are at master...farnz:tokio:copy_buffer - I'm happy to turn them into a proper pull request if there's still interest in this, because they do at least make it slightly harder to shoot yourself in the foot.

@farnz
Copy link
Contributor

farnz commented Sep 30, 2022

I've come up with a realistic benchmark that shows improvements, and I'm going to force-push my branch ready for a pull request. farnz@382c0bc is the old benchmarks before the force push for historic reference (my previous comment).

The "trick" is to make the HDD stall randomly as the buffer fills up. I believe this is a realistic behaviour, since it's what a real NAS would do if it was under overload scenario, and you were unlucky enough to come in just after the NAS assigns buffer space to another writer.

@farnz
Copy link
Contributor

farnz commented Sep 30, 2022

And with a bit more thinking, I've been able to write a realistic benchmark that does show improvement - I'm going to force push my branch and use it for the pull request, but farnz@382c0bc has the old benchmarks.

The missing part in my thinking is that concurrent users introduce random chances of blocking because all buffers are full; it's reasonable to model this as a chance of blocking that's proportional to how full your buffer is (since your buffer empties regularly, this represents more of a chance of blocking as the underlying HDD buffer fills up).

With this model in the benchmark, I see wins copying from a slow source to an overloaded HDD simulation.

@Darksonn Darksonn removed E-help-wanted Call for participation: Help is requested to fix this issue. E-medium Call for participation: Experience needed to fix: Medium / intermediate labels Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-io Module: tokio/io
Projects
None yet
7 participants