-
-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fd performance decreases with increasing cores #616
Comments
Ok, a couple of things:
That might very well be a caching effect.
What exactly is interesting about these results? I don't see any error bars on these numbers, so I can't tell how representative they are. This could be just random fluctuations around I have performed the following warm-cache benchmark with hyperfine \
--parameter-scan threads 1 10 \
--warmup 3 \
--export-json results-fd.json \
--export-markdown results-fd.md \
"fd -j {threads} --color=never --hidden --type f"
hyperfine \
--parameter-scan threads 1 10 \
--warmup 3 \
--export-json results-rg.json \
--export-markdown results-rg.md \
"rg -j {threads} --color=never --hidden --no-messages --files"
# plot results:
hyperfine/scripts/plot_parametrized.py \
results-fd.json \
results-rg.json \
--titles "fd,rg" \
--parameter-name "number of threads" This leads to the following results on my Laptop (8 x Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz, same
I'm happy to investigate further if you can clearly show that something is not as fast as expected in your setting. |
Thanks for the quick response
Sorry, my bad, I just realized that I was running the commands outside the chromium repo so there were other files too. But re-running on the repo yields similar results.
My rgignore and fdignore files are identical, they contain
Agreed. I didn't know what what would be a good way to eliminate the impact of these factors so I ran both the commands several times (> 20) in different orders but yes, my approach wasn't very systematic. My system did have some browser windows open but there weren't any IO intensive processes running at that time. I don't know how to tell if the comparison was fair but I would expect it to be because I didn't start any other processes while I was testing the performance. All the measurements were taken at the same time.
Wouldn't that be the case for
I was referring to the observation that the cpu utilization was not monotonically increasing for I took a look at hyperfine, very cool! I used it to run some benchmarks and found something interesting.
I can try to do more systematic benchmarks if you can guide me on how to do that |
That is interesting indeed! I have a feeling this might have something to do with the way we write output to the terminal (buffered vs unbuffered). We should definitely take a look at that. Thank you for reporting this and for the detailed analysis!
yes. |
In a vacuum, it looks like wrapping stdout in a BufWriter can make a huge difference. I wrote this test program that prints "Hello World" a bunch. use std::io::{self, stdout, Write, BufWriter};
fn main() -> io::Result<()> {
let args: Vec::<_> = std::env::args().collect();
let mut output: Box<dyn Write> = match args[1].as_str() {
"b" => Box::new(BufWriter::new(stdout())),
"u" => Box::new(stdout()),
_ => panic!("invalid arg"),
};
let count: usize = args[2].parse().unwrap();
let message = b"Hello World\n";
for _ in 0..count {
output.write(message)?;
}
Ok(())
} According to a couple hyperfine benchmarks, the buffered version is about 50 times faster than the unbuffered version. I'm not sure why adding extra threads would slow things down, perhaps there's a lot of lock contention happening. It looks like That could be changed to a (I'd be happy to implement one or both of these changes) |
@aswild Thank you for investigating. That's what I was thinking - exactly.
That's the unsolved question, yes. There shouldn't be any lock contention. Note that the printing only happens in the "receiver" thread, not in any of the worker/sender/searcher threads. I think we should even be able to acquire the lock only once. Adding a
A |
Thanks @aswild
I agree with @sharkdp on avoiding the
I have seen other unix tools with a It might be too early to jump to solutions though since we don't understand the root cause of the problem |
Good point. If I ran
Ah, ok. I haven't looked at fd's internals that deeply yet It sounds like some profiling might be in order to figure out where the hot spots are. |
How do other tools solve this? I get that there are some programs where the behavior can be chosen with a flag. But what would be a reasonable default? Or is there a way to improve the situation without the drawbacks? |
Although, I can't remember which off the top of my head, I think reminder there being at least one program that detects if stdout is a tty as a heuristic, but has a flag to override how output is buffered. |
Was this ever implemented? I'd be interested in working on this. |
@Spaceface16518 It's not clear to me (yet) what exactly should be implemented. I think we need further investigation first. |
@arriven Thank you for the reminder. Indeed. I can not reproduce this anymore: hyperfine \
--parameter-scan threads 1 10 \
--warmup 3 \
--export-json results-fd.json \
--export-markdown results-fd.md \
"fd -j {threads} --color=never --hidden --type f | wc -l"
hyperfine \
--parameter-scan threads 1 10 \
--warmup 3 \
--export-json results-rg.json \
--export-markdown results-rg.md \
"rg -j {threads} --color=never --hidden --no-messages --files | wc -l"
hyperfine/scripts/plot_parametrized.py \
results-fd.json \
results-rg.json \
--titles "fd,rg" |
On a fresh clone on the chromium repo, if I try to list files using
fd
, here's the result I getWhen I use
rg
for the same purposeNotice how the %cpu used in case of rg is higher than with fd. My CPU has 8 cores and the usage reported for rg is what I would expect if it is using all my cores. I suspected that
fd
might not be detecting the number of cores in my machine properly so I tried using the-j
flag to specify the number of cores manually. Here are some interesting resultsThe user time seems to be correlated to the CPU utilization. This also ends up in
fd
being slightly slower thanrg
on my machine. Is there an explanation for this behaviour offd
?OS: Ubuntu 14.04
CPU: 8 x Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
RAM: 64G
The text was updated successfully, but these errors were encountered: