Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fd performance decreases with increasing cores #616

Closed
lakshayg opened this issue Jun 10, 2020 · 13 comments
Closed

fd performance decreases with increasing cores #616

lakshayg opened this issue Jun 10, 2020 · 13 comments

Comments

@lakshayg
Copy link

lakshayg commented Jun 10, 2020

On a fresh clone on the chromium repo, if I try to list files using fd, here's the result I get

$ time fd --color=never --hidden --type f | wc -l
410215
fd --color=never --hidden --type f  1.39s user 1.03s system 496% cpu 0.487 total

When I use rg for the same purpose

$ time rg --color=never --hidden --no-messages --files | wc -l
410215
rg --color=never --hidden --no-messages --files  1.86s user 0.67s system 737% cpu 0.343 total

Notice how the %cpu used in case of rg is higher than with fd. My CPU has 8 cores and the usage reported for rg is what I would expect if it is using all my cores. I suspected that fd might not be detecting the number of cores in my machine properly so I tried using the -j flag to specify the number of cores manually. Here are some interesting results

$ for threads in `seq 1 8`; do time fd -j "$threads" --color=never --hidden --type f | wc -l; done
$ for threads in `seq 1 8`; do time rg -j "$threads" --color=never --hidden --no-messages --files | wc -l; done
threads fd CPU Utilization rg CPU Utilization
1 133% 99%
2 256% 239%
3 371% 352%
4 485% 457%
5 548% 560%
6 577% 663%
7 498% 750%
8 519% 747%

The user time seems to be correlated to the CPU utilization. This also ends up in fd being slightly slower than rg on my machine. Is there an explanation for this behaviour of fd?

OS: Ubuntu 14.04
CPU: 8 x Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
RAM: 64G

$ fd --version
fd 8.1.1

$ rg --version
ripgrep 12.1.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)
@sharkdp
Copy link
Owner

sharkdp commented Jun 11, 2020

Ok, a couple of things:

  • Where can I get the chromium repo to reproduce this? I cloned something from GitHub but I get a completely different file count.
  • You are running fd and rg without the --no-ignore options. This means that both tools will look for .rgignore or .fdignore files on your system. For me, the two file counts differed because I have a ~/.rgignore file. Even if the file counts are the same for you, that's something worth checking out to make the benchmark fair.
  • I can't tell if you are properly running your benchmarks. Measuring the execution time of I/O heavy programs is not trivial. (1) Results are highly dependent on whether or not the disk cache is warm. If you run program A first, then B, you will see better times for B because the cache is warm. Did you correct for that? (2) I/O heavy programs are influenced by all kinds of processes running in the background. Are you making sure that your system is quiet and the comparison is fair?

Notice how the %cpu used in case of rg is higher than with fd. My CPU has 8 cores and the usage reported for rg is what I would expect if it is using all my cores.

That might very well be a caching effect. fd is typically I/O-bound, not CPU bound.

Here are some interesting results

What exactly is interesting about these results? I don't see any error bars on these numbers, so I can't tell how representative they are. This could be just random fluctuations around j * 100%?

I have performed the following warm-cache benchmark with hyperfine:

hyperfine \
    --parameter-scan threads 1 10 \
    --warmup 3 \
    --export-json results-fd.json \
    --export-markdown results-fd.md \
    "fd -j {threads} --color=never --hidden --type f"

hyperfine \
    --parameter-scan threads 1 10 \
    --warmup 3 \
    --export-json results-rg.json \
    --export-markdown results-rg.md \
    "rg -j {threads} --color=never --hidden --no-messages --files"

# plot results:
hyperfine/scripts/plot_parametrized.py \
  results-fd.json \
  results-rg.json \
  --titles "fd,rg" \
  --parameter-name "number of threads"

This leads to the following results on my Laptop (8 x Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz, same rg and fd versions as you):

fd-vs-rg

Command Mean [s] Min [s] Max [s] Relative
fd -j 1 --color=never --hidden --type f 1.378 ± 0.009 1.364 1.390 3.48 ± 0.02
fd -j 2 --color=never --hidden --type f 0.750 ± 0.004 0.744 0.754 1.90 ± 0.01
fd -j 3 --color=never --hidden --type f 0.543 ± 0.012 0.519 0.563 1.37 ± 0.03
fd -j 4 --color=never --hidden --type f 0.462 ± 0.023 0.436 0.518 1.17 ± 0.06
fd -j 5 --color=never --hidden --type f 0.427 ± 0.005 0.421 0.437 1.08 ± 0.01
fd -j 6 --color=never --hidden --type f 0.412 ± 0.008 0.405 0.432 1.04 ± 0.02
fd -j 7 --color=never --hidden --type f 0.396 ± 0.001 0.394 0.398 1.00
fd -j 8 --color=never --hidden --type f 0.411 ± 0.010 0.399 0.432 1.04 ± 0.02
fd -j 9 --color=never --hidden --type f 0.418 ± 0.012 0.395 0.431 1.06 ± 0.03
fd -j 10 --color=never --hidden --type f 0.427 ± 0.019 0.404 0.472 1.08 ± 0.05
Command Mean [s] Min [s] Max [s] Relative
rg -j 1 --color=never --hidden --no-messages --files 1.456 ± 0.012 1.439 1.477 3.17 ± 0.03
rg -j 2 --color=never --hidden --no-messages --files 0.947 ± 0.005 0.937 0.956 2.06 ± 0.02
rg -j 3 --color=never --hidden --no-messages --files 0.696 ± 0.017 0.671 0.725 1.52 ± 0.04
rg -j 4 --color=never --hidden --no-messages --files 0.574 ± 0.021 0.546 0.619 1.25 ± 0.05
rg -j 5 --color=never --hidden --no-messages --files 0.525 ± 0.018 0.507 0.549 1.14 ± 0.04
rg -j 6 --color=never --hidden --no-messages --files 0.485 ± 0.005 0.480 0.496 1.06 ± 0.01
rg -j 7 --color=never --hidden --no-messages --files 0.460 ± 0.002 0.457 0.465 1.00
rg -j 8 --color=never --hidden --no-messages --files 0.460 ± 0.002 0.458 0.466 1.00 ± 0.01
rg -j 9 --color=never --hidden --no-messages --files 0.465 ± 0.008 0.459 0.485 1.01 ± 0.02
rg -j 10 --color=never --hidden --no-messages --files 0.473 ± 0.010 0.460 0.486 1.03 ± 0.02

fd is consistently 5%-20% faster. Both programs have a minimum execution time for j=7 threads, which could be explained by the fact that we have one receiver thread in addition. It collects all results and prints them. Given that this benchmark is about listing all files, a lot of printing needs to be done and this thread could be quite busy, making j=N-1 the optimal amount of worker threads.

I'm happy to investigate further if you can clearly show that something is not as fast as expected in your setting.

@lakshayg
Copy link
Author

lakshayg commented Jun 12, 2020

Thanks for the quick response

Where can I get the chromium repo to reproduce this? I cloned something from GitHub but I get a completely different file count

Sorry, my bad, I just realized that I was running the commands outside the chromium repo so there were other files too. But re-running on the repo yields similar results.

You are running fd and rg without the --no-ignore options. This means that both tools will look for .rgignore or .fdignore files on your system. For me, the two file counts differed because I have a ~/.rgignore file. Even if the file counts are the same for you, that's something worth checking out to make the benchmark fair.

My rgignore and fdignore files are identical, they contain .git and .cache.

(1) Results are highly dependent on whether or not the disk cache is warm
(2) I/O heavy programs are influenced by all kinds of processes running in the background. Are you making sure that your system is quiet and the comparison is fair

Agreed. I didn't know what what would be a good way to eliminate the impact of these factors so I ran both the commands several times (> 20) in different orders but yes, my approach wasn't very systematic. My system did have some browser windows open but there weren't any IO intensive processes running at that time. I don't know how to tell if the comparison was fair but I would expect it to be because I didn't start any other processes while I was testing the performance. All the measurements were taken at the same time.

That might very well be a caching effect. fd is typically I/O-bound, not CPU bound.

Wouldn't that be the case for rg too?

What exactly is interesting about these results?

I was referring to the observation that the cpu utilization was not monotonically increasing for fd.


I took a look at hyperfine, very cool! I used it to run some benchmarks and found something interesting.

Command Mean [ms] Min [ms] Max [ms] Relative
fd -j 1 --color=never --hidden --type f 994.2 ± 11.9 980.8 1014.8 3.23 ± 0.06
fd -j 2 --color=never --hidden --type f 615.3 ± 39.6 561.7 677.1 2.00 ± 0.13
fd -j 3 --color=never --hidden --type f 517.9 ± 5.6 509.9 526.8 1.68 ± 0.03
fd -j 4 --color=never --hidden --type f 421.7 ± 3.2 418.7 428.8 1.37 ± 0.02
fd -j 5 --color=never --hidden --type f 361.4 ± 5.1 352.4 367.4 1.17 ± 0.02
fd -j 6 --color=never --hidden --type f 327.4 ± 3.9 322.5 333.5 1.06 ± 0.02
fd -j 7 --color=never --hidden --type f 307.8 ± 4.4 302.2 313.9 1.00
fd -j 8 --color=never --hidden --type f 308.3 ± 3.1 302.8 314.0 1.00 ± 0.02
Command Mean [ms] Min [ms] Max [ms] Relative
fd -j 1 --color=never --hidden --type f | wc -l 1.135 ± 0.079 1.045 1.246 2.97 ± 0.21
fd -j 2 --color=never --hidden --type f | wc -l 0.723 ± 0.007 0.712 0.733 1.89 ± 0.03
fd -j 3 --color=never --hidden --type f | wc -l 0.515 ± 0.004 0.510 0.521 1.35 ± 0.02
fd -j 4 --color=never --hidden --type f | wc -l 0.413 ± 0.002 0.409 0.416 1.08 ± 0.02
fd -j 5 --color=never --hidden --type f | wc -l 0.383 ± 0.005 0.374 0.390 1.00 ± 0.02
fd -j 6 --color=never --hidden --type f | wc -l 0.382 ± 0.005 0.373 0.393 1.00
fd -j 7 --color=never --hidden --type f | wc -l 0.470 ± 0.007 0.452 0.476 1.23 ± 0.03
fd -j 8 --color=never --hidden --type f | wc -l 0.472 ± 0.011 0.459 0.489 1.23 ± 0.03
Command Mean [ms] Min [ms] Max [ms] Relative
rg -j 1 --color=never --hidden --no-messages --files 1.018 ± 0.006 1.008 1.027 3.11 ± 0.06
rg -j 2 --color=never --hidden --no-messages --files 0.723 ± 0.033 0.684 0.794 2.21 ± 0.11
rg -j 3 --color=never --hidden --no-messages --files 0.618 ± 0.006 0.607 0.623 1.89 ± 0.04
rg -j 4 --color=never --hidden --no-messages --files 0.489 ± 0.004 0.484 0.496 1.50 ± 0.03
rg -j 5 --color=never --hidden --no-messages --files 0.410 ± 0.003 0.403 0.415 1.25 ± 0.03
rg -j 6 --color=never --hidden --no-messages --files 0.360 ± 0.003 0.353 0.363 1.10 ± 0.02
rg -j 7 --color=never --hidden --no-messages --files 0.327 ± 0.006 0.315 0.336 1.00
rg -j 8 --color=never --hidden --no-messages --files 0.331 ± 0.004 0.326 0.337 1.01 ± 0.02
Command Mean [ms] Min [ms] Max [ms] Relative
rg -j 1 --color=never --hidden --no-messages --files | wc -l 1.028 ± 0.004 1.023 1.034 3.10 ± 0.06
rg -j 2 --color=never --hidden --no-messages --files | wc -l 0.897 ± 0.016 0.871 0.928 2.71 ± 0.07
rg -j 3 --color=never --hidden --no-messages --files | wc -l 0.632 ± 0.008 0.618 0.644 1.91 ± 0.04
rg -j 4 --color=never --hidden --no-messages --files | wc -l 0.490 ± 0.004 0.482 0.495 1.48 ± 0.03
rg -j 5 --color=never --hidden --no-messages --files | wc -l 0.411 ± 0.003 0.406 0.414 1.24 ± 0.03
rg -j 6 --color=never --hidden --no-messages --files | wc -l 0.363 ± 0.005 0.351 0.369 1.10 ± 0.03
rg -j 7 --color=never --hidden --no-messages --files | wc -l 0.331 ± 0.007 0.318 0.339 1.00
rg -j 8 --color=never --hidden --no-messages --files | wc -l 0.332 ± 0.007 0.318 0.340 1.00 ± 0.03

fd becomes slower when piping out the results to wc but rg seems to be unaffected by that. Without wc, fd performs slightly better than rg.

I can try to do more systematic benchmarks if you can guide me on how to do that

@sharkdp
Copy link
Owner

sharkdp commented Jun 12, 2020

That is interesting indeed!

I can reproduce this:
fd-vs-rg-with-wc

I have a feeling this might have something to do with the way we write output to the terminal (buffered vs unbuffered). We should definitely take a look at that. Thank you for reporting this and for the detailed analysis!

Wouldn't that be the case for rg too?

yes.

@aswild
Copy link
Contributor

aswild commented Jun 16, 2020

In a vacuum, it looks like wrapping stdout in a BufWriter can make a huge difference. I wrote this test program that prints "Hello World" a bunch.

use std::io::{self, stdout, Write, BufWriter};

fn main() -> io::Result<()> {
    let args: Vec::<_> = std::env::args().collect();
    let mut output: Box<dyn Write> = match args[1].as_str() {
        "b" => Box::new(BufWriter::new(stdout())),
        "u" => Box::new(stdout()),
        _ => panic!("invalid arg"),
    };
    let count: usize = args[2].parse().unwrap();

    let message = b"Hello World\n";
    for _ in 0..count {
        output.write(message)?;
    }

    Ok(())
}

According to a couple hyperfine benchmarks, the buffered version is about 50 times faster than the unbuffered version. I'm not sure why adding extra threads would slow things down, perhaps there's a lot of lock contention happening.

It looks like fd currently prints using the object returned by io::stdout().lock() without any extra buffering added.

That could be changed to a Box<dyn Write> trait object that's set up based on whether stdout is a TTY, or fd could support a --count option which prints the number of files found to replace the fd ... | wc -l usage paradigm.

(I'd be happy to implement one or both of these changes)

@sharkdp
Copy link
Owner

sharkdp commented Jun 17, 2020

@aswild Thank you for investigating. That's what I was thinking - exactly.

I'm not sure why adding extra threads would slow things down, perhaps there's a lot of lock contention happening.

That's the unsolved question, yes. There shouldn't be any lock contention. Note that the printing only happens in the "receiver" thread, not in any of the worker/sender/searcher threads. I think we should even be able to acquire the lock only once.

Adding a BufWriter for non-interactive use cases sounds good. However, I would like to know what kind of drawbacks this could have. What if we pipe the fd output to other programs where we would be interested in immediate feedback? Do we know what find does?

or fd could support a --count option which prints the number of files found to replace the fd ... | wc -l usage paradigm.

A --count option has never been requested so far. If it's not strictly necessary, I'd like to avoid adding it.

@lakshayg
Copy link
Author

Thanks @aswild

fd could support a --count option

I agree with @sharkdp on avoiding the --count option since counting isn't really the problem.

What if we pipe the fd output to other programs where we would be interested in immediate feedback

I have seen other unix tools with a -u or --unbuffered flag which forces them to write without buffering (alternatively, some applications take a --interactive flag), perhaps fd can take the same approach.

It might be too early to jump to solutions though since we don't understand the root cause of the problem

@aswild
Copy link
Contributor

aswild commented Jun 17, 2020

What if we pipe the fd output to other programs where we would be interested in immediate feedback?

Good point. If I ran fd foo | tee results.txt I'd still want to see output line-by-line, and fd wouldn't be able to tell the difference.

the printing only happens in the "receiver" thread, not in any of the worker/sender/searcher threads.

Ah, ok. I haven't looked at fd's internals that deeply yet

It sounds like some profiling might be in order to figure out where the hot spots are.

@sharkdp
Copy link
Owner

sharkdp commented Jul 5, 2020

How do other tools solve this? I get that there are some programs where the behavior can be chosen with a flag. But what would be a reasonable default? Or is there a way to improve the situation without the drawbacks?

@tmccombs
Copy link
Collaborator

tmccombs commented Jul 5, 2020

Although, I can't remember which off the top of my head, I think reminder there being at least one program that detects if stdout is a tty as a heuristic, but has a flag to override how output is buffered.

@Spaceface16518
Copy link
Contributor

Was this ever implemented? I'd be interested in working on this.

@sharkdp
Copy link
Owner

sharkdp commented Oct 25, 2020

@Spaceface16518 It's not clear to me (yet) what exactly should be implemented. I think we need further investigation first.

@arriven
Copy link

arriven commented Oct 3, 2022

@sharkdp is this issue still relevant? it seems like receiver thread is already using buffered writer (see code). It even locks for the full duration it to avoid contention and hyperfine reports expected perf results. Seems like it should be closed

@sharkdp
Copy link
Owner

sharkdp commented Nov 19, 2022

@arriven Thank you for the reminder. Indeed. I can not reproduce this anymore:

hyperfine \
    --parameter-scan threads 1 10 \
    --warmup 3 \
    --export-json results-fd.json \
    --export-markdown results-fd.md \
    "fd -j {threads} --color=never --hidden --type f | wc -l"

hyperfine \
    --parameter-scan threads 1 10 \
    --warmup 3 \
    --export-json results-rg.json \
    --export-markdown results-rg.md \
    "rg -j {threads} --color=never --hidden --no-messages --files | wc -l"

hyperfine/scripts/plot_parametrized.py \
  results-fd.json \
  results-rg.json \
  --titles "fd,rg"

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants