fd is allocation heavy, which can lead to heavy lock contention on some malloc implementations, especially hardened ones #710

ericonr · 2020-12-31T19:49:58Z

EDIT: Ah, happy new year!

Describe the bug you encountered:

When comparing the speed of fd with musl vs glibc, they are very comparable with -j1, but when the core count goes above a certain threshold, musl gets diminishing returns, to the point that with -j12, on a 6-core system (with hyperthreading), it's actually slower than with -j4.

With fd + musl:

~ ➜ time fd -j1 freebsd ~/ >/dev/null

________________________________________________________
Executed in    3.17 secs   fish           external
   usr time    1.81 secs    0.00 micros    1.81 secs
   sys time    1.36 secs  599.00 micros    1.36 secs

~ took 3s ➜ time fd -j4 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.11 secs   fish           external
   usr time    5.40 secs  621.00 micros    5.40 secs
   sys time    2.47 secs   46.00 micros    2.47 secs

~ took 2s ➜ time fd -j8 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.62 secs   fish           external
   usr time   11.58 secs  476.00 micros   11.58 secs
   sys time    7.39 secs   35.00 micros    7.39 secs

~ took 2s ➜ time fd -j12 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.98 secs   fish           external
   usr time   16.22 secs    0.00 micros   16.22 secs
   sys time   15.78 secs  559.00 micros   15.78 secs

With fd + musl + pre-loaded jemalloc:

~ took 2s ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j1 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.75 secs   fish           external
   usr time    1.43 secs    0.00 micros    1.43 secs
   sys time    1.32 secs  913.00 micros    1.32 secs

~ ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j4 freebsd ~/ >/dev/null

________________________________________________________
Executed in  790.87 millis    fish           external
   usr time    1.75 secs  930.00 micros    1.75 secs
   sys time    1.37 secs    0.00 micros    1.37 secs

~ ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j8 freebsd ~/ >/dev/null

________________________________________________________
Executed in  552.14 millis    fish           external
   usr time    2.40 secs    0.00 micros    2.40 secs
   sys time    1.89 secs  882.00 micros    1.89 secs

~ ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j12 freebsd ~/ >/dev/null

________________________________________________________
Executed in  539.76 millis    fish           external
   usr time    3.42 secs  828.00 micros    3.42 secs
   sys time    2.76 secs    0.00 micros    2.76 secs

With fd + glibc:

[chroot] /home/ericonr > time fd -j1 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2,73 secs   fish           external
   usr time    1,28 secs  488,00 micros    1,28 secs
   sys time    1,35 secs  141,00 micros    1,35 secs

[chroot] /home/ericonr > time fd -j4 freebsd ~/ >/dev/null

________________________________________________________
Executed in  854,58 millis    fish           external
   usr time    1,75 secs    0,00 micros    1,75 secs
   sys time    1,61 secs  585,00 micros    1,61 secs

[chroot] /home/ericonr > time fd -j8 freebsd ~/ >/dev/null

________________________________________________________
Executed in  601,37 millis    fish           external
   usr time    2,25 secs  439,00 micros    2,25 secs
   sys time    2,35 secs  144,00 micros    2,35 secs

[chroot] /home/ericonr > time fd -j12 freebsd ~/ >/dev/null

________________________________________________________
Executed in  558,01 millis    fish           external
   usr time    3,10 secs  612,00 micros    3,10 secs
   sys time    3,15 secs    0,00 micros    3,15 secs

With perf, I can track the issue (apparently) down to __lock:

  Children      Self  Command  Shared Object     Symbol
+   29.83%    29.58%  fd       libc.so           [.] __lock
+   27.29%     0.00%  fd       libc.so           [.] a_cas (inlined)

This matches with what was observed: with more threads, you get more lock contention, and can end up spending way too much time locked. I'd expect a similar effect to be observable with hardened_malloc from GrapheneOS and possibly OpenBSD's malloc. Theoretically speaking, rust might not require a hardened malloc, but you usually don't want to carry your own malloc for each application either, so it'd be nice if this could perform better "out of the box", by simply performing less allocations (which helps by removing most of the dependency on malloc performance). That said, I don't know how feasible fixing this is.

Describe what you expected to happen:

I'd expected all versions of the test to show similar growth in execution speed. Note that this is in my ~/ directory, because otherwise execution is short enough that you can't spot the issue (the numbers with a tarball of the linux kernel were too close).

What version of fd are you using?

fd 8.2.1

Which operating system / distribution are you on?

Linux (kernel 5.10.2), with musl 1.2.1

glibc used for testing was 2.30

jemalloc used for LD_PRELOAD was 5.2.1

The text was updated successfully, but these errors were encountered:

sharkdp · 2021-01-02T19:59:22Z

Thank you very much for the detailed report and analysis!

I have a couple of questions and remarks:

Are these benchmark results reliable? In the sense of: statistically significant?
Did you account for I/O caching effects (i.e.: either reset disk caches between runs or perform warmup runs)?
How many results does the fd freebsd search yield in your home directory? (how many results need to be written to /dev/null?)
How can we be sure that this is related to lock contention on malloc? Couldn't it be another shared resource that is the bottleneck? Like .. the file system?

I tried to reproduce your results by running:

SEARCH_ROOT="$HOME/some-large-folder"

benchmark() {
    libc="$1"
    hyperfine \
        --parameter-scan threads 1 8 \
        --warmup 5 \
        --export-json "fd-threads-$libc.json" \
        --export-markdown "fd-threads-$libc.md" \
        "./fd-v8.2.1-$libc -j {threads} nonexisting $SEARCH_ROOT"
}

benchmark gnu
benchmark musl

hyperfine/scripts/plot_parametrized.py \
  fd-threads-gnu.json \
  fd-threads-musl.json \
  --titles gnu,musl

And I see the following results (I have a 4-core system with hyperthreading):

Some remarks:

I tried to run another set of benchmarks with a LD_PRELOADed jemalloc for the musl version, but I see no difference in timing to the plain musl version. It seems like LD_PRELOAD=/usr/lib/libjemalloc.so.2 doesn't work for me (the library does exist). Do I need a special version of libjemalloc for that? My mistake. I was using a statically linked version of fd-musl 🙈
The difference between GNU libc and musl is huge. From this experience I would have expected a slow-down of 30%, but not a factor of 2-3 😦.
I can't seem to reproduce your results. There doesn't seem to be a (significant) slowdown for j=$(nproc)=8 threads.

Possibly related: #616 ?

ericonr · 2021-01-03T01:59:12Z

Are these benchmark results reliable? In the sense of: statistically significant?

I'm not sure what you'd consider statistically relevant here. They are reproducible and the difference between runs is small. I tried hyperfine below.

With musl:

~ took 22s ➜ hyperfine -i 'fd -j1 akjdakljd'
Benchmark #1: fd -j1 akjdakljd
  Time (mean ± σ):      3.540 s ±  0.045 s    [User: 1.978 s, System: 1.538 s]
  Range (min … max):    3.494 s …  3.621 s    10 runs

~ took 35s ➜ hyperfine -i 'fd -j4 akjdakljd'
Benchmark #1: fd -j4 akjdakljd
  Time (mean ± σ):      2.241 s ±  0.070 s    [User: 5.783 s, System: 2.644 s]
  Range (min … max):    2.125 s …  2.341 s    10 runs

~ took 22s ➜ hyperfine -i 'fd -j8 akjdakljd'
Benchmark #1: fd -j8 akjdakljd
  Time (mean ± σ):      3.285 s ±  0.125 s    [User: 14.704 s, System: 9.194 s]
  Range (min … max):    3.058 s …  3.459 s    10 runs

~ took 32s ➜ hyperfine -i 'fd -j12 akjdakljd'
Benchmark #1: fd -j12 akjdakljd
  Time (mean ± σ):      3.493 s ±  0.208 s    [User: 19.171 s, System: 18.874 s]
  Range (min … max):    3.130 s …  3.705 s    10 runs

With glibc:

[chroot] /home/ericonr > hyperfine -i 'fd -j1 akjdakljd'
Benchmark #1: fd -j1 akjdakljd
  Time (mean ± σ):      2.724 s ±  0.023 s    [User: 1.326 s, System: 1.380 s]
  Range (min … max):    2.699 s …  2.762 s    10 runs

[chroot] /home/ericonr > hyperfine -i 'fd -j4 akjdakljd'
Benchmark #1: fd -j4 akjdakljd
  Time (mean ± σ):     914.6 ms ±  33.4 ms    [User: 1.823 s, System: 1.781 s]
  Range (min … max):   886.5 ms … 979.2 ms    10 runs

[chroot] /home/ericonr > hyperfine -i 'fd -j8 akjdakljd'
Benchmark #1: fd -j8 akjdakljd
  Time (mean ± σ):     725.3 ms ±  12.7 ms    [User: 2.865 s, System: 2.755 s]
  Range (min … max):   705.4 ms … 741.8 ms    10 runs

[chroot] /home/ericonr > hyperfine -i 'fd -j12 akjdakljd'
Benchmark #1: fd -j12 akjdakljd
  Time (mean ± σ):     718.1 ms ±  18.9 ms    [User: 3.969 s, System: 3.936 s]
  Range (min … max):   675.7 ms … 738.4 ms    10 runs

Did you account for I/O caching effects (i.e.: either reset disk caches between runs or perform warmup runs)?

I'm always working with a hot cache.

How many results does the fd freebsd search yield in your home directory? (how many results need to be written to /dev/null?)

It gives me 450 results; the search from the tests above have no results, so the issue is unlikely to be locking when writing to stdout (assuming that's what you were thinking about).

How can we be sure that this is related to lock contention on malloc? Couldn't it be another shared resource that is the bottleneck? Like .. the file system?

I couldn't get perf to show me a backtrace with __lock being called by malloc, so I can't say with absolute certainty that it was the culprit. I can say, however, that I have good evidence for it, since:

on the same machine, kernel and files, glibc doesn't have the issue
the issue vanished when I pre-loaded a different malloc implementation

The difference between GNU libc and musl is huge. From this experience I would have expected a slow-down of 30%, but not a factor of 2-3 frowning.

The jemalloc (on musl) performance was quite similar to glibc, so I think the outlier here is musl.

I can't seem to reproduce your results. There doesn't seem to be a (significant) slowdown for j=$(nproc)=8 threads.

It definitely seems reproducible here (see tests above)... Do you know what version of musl you're using? Given the general theme here, that just looks like the present malloc issue can get even worse; so if we can fix that, I'm reasonably confident the issue would resolve itself.

sharkdp · 2021-01-03T21:04:32Z

I'm not sure what you'd consider statistically relevant here. They are reproducible and the difference between runs is small

That's what I wanted to know 😄 👍

How many results does the fd freebsd search yield in your home directory? (how many results need to be written to /dev/null?)

It gives me 450 results; the search from the tests above have no results, so the issue is unlikely to be locking when writing to stdout (assuming that's what you were thinking about).

Ok, so that's ruled out. Yes - that's what I was thinking about.

I couldn't get perf to show me a backtrace with __lock being called by malloc, so I can't say with absolute certainty that it was the culprit. I can say, however, that I have good evidence for it, since:
* on the same machine, kernel and files, glibc doesn't have the issue

* the issue vanished when I pre-loaded a different malloc implementation

My mistake. I accidentally looked at the "usr time" results for the LD_PRELOAD benchmark and didn't see the huge difference.

Do you know what version of musl you're using?

I actually don't :-o. I don't have musl installed. It seems to come with the Rust toolchain? How did you even manage to build a dynamically linked musl version of fd? (rust-lang/rust#34987)

A perf profile looks quite different for me:

  Children      Self  Command  Shared Object     Symbol
+   39,68%    39,68%  fd       [unknown]         [k] 0xffffffff8c000163                                            ◆
+   18,30%     0,14%  fd       fd                [.] fstatat                                                       ▒
+   17,75%     0,46%  fd       fd                [.] core::ops::function::FnOnce::call_once{{vtable-shim}}         ▒
+   17,73%     3,91%  fd       fd                [.] ignore::walk::Worker::run_one                                 ▒
+   17,64%     0,00%  fd       fd                [.] start                                                         ▒
+   17,64%     0,00%  fd       fd                [.] std::sys::unix::thread::Thread::new::thread_start             ▒
+   17,64%     0,00%  fd       fd                [.] <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call▒
+   17,64%     0,00%  fd       fd                [.] core::ops::function::FnOnce::call_once{{vtable-shim}}         ▒
+   17,64%     0,00%  fd       fd                [.] std::sys_common::backtrace::__rust_begin_short_backtrace      ▒
+   16,47%    16,37%  fd       fd                [.] __bin_chunk                                                   ▒
+   11,42%     0,21%  fd       fd                [.] readdir                                                       ▒
+    9,28%     0,03%  fd       fd                [.] sccp                                                          ▒
+    6,35%     0,32%  fd       fd                [.] ignore::gitignore::Gitignore::matched_stripped                ▒
+    6,28%     6,28%  fd       fd                [.] malloc                                                        ▒

Given the general theme here, that just looks like the present malloc issue can get even worse; so if we can fix that, I'm reasonably confident the issue would resolve itself.

How would we go about this? I can certainly imagine that the recursive directory walker is quite heavy on allocations (readdir calls). I was actually thinking about this in the past. It would be a nice experiment to try out a custom allocator that is optimized for the directory traversal (i.e. for struct dirent allocations?).

We should probably find some way to count/measure all kinds of allocations (heaptrack?).

A quick fix would be to enable jemalloc for musl builds. For some reason, I deactivated them originally (#481), but we could try to re-activate this:

fd/src/main.rs

Lines 35 to 44 in 6a3aae5

 // We use jemalloc for performance reasons, see https:/sharkdp/fd/pull/481 

 // FIXME: re-enable jemalloc on macOS, see comment in Cargo.toml file for more infos 

 #[cfg(all( 

  not(windows), 

  not(target_os = "android"), 

  not(target_os = "macos"), 

  not(target_env = "musl") 

 ))] 

 #[global_allocator] 

 static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

sharkdp · 2021-01-03T21:08:36Z

Indeed: most of the allocations are caused by opendir:

ericonr · 2021-01-03T21:35:24Z

I actually don't :-o. I don't have musl installed. It seems to come with the Rust toolchain? How did you even manage to build a dynamically linked musl version of fd? (rust-lang/rust#34987)

Ah, that'd make sense. I have no idea what Rust is currently shipping, then; they might be stuck with older musl temporarily, since musl 1.2.x changed time_t to a 64-bit integer for all platforms (including 32-bit ones). Distros like Alpine and Void Linux patch the system rustc to use dynamic linking on musl targets (Alpine defines an alpine target, Void just changes the default for the normal musl target).

It would be a nice experiment to try out a custom allocator that is optimized for the directory traversal (i.e. for struct dirent allocations?).

Not sure that's possible; the allocation happens for the DIR struct, which is completely opaque. The struct in turn contains a buffer that stores the struct dirent read and returns them with readdir.

A quick fix would be to enable jemalloc for musl builds. For some reason, I deactivated them originally (#481)

I remember reading reports about jemalloc + musl being broken for Rust (or even in general), so this might not work at all and/or be unstable in some way. I don't know the current state.

richfelker · 2021-01-03T21:41:30Z

It's not plausible that allocations by opendir are a performance bottleneck. Each opendir is an open syscall which is multiple orders of magnitude more costly than malloc. If a large performance of overall time is being spent in malloc contention, the calls are from elsewhere.

Moreover, readdir does not allocate anything; the object returned is part of the open DIR object and its lifetime (from the perspective of the caller, per the spec) ends the next time readdir is called (or closedir) on the same DIR object. But perhaps Rust's POSIX glue shims readdir with an allocation for memory-safety reasons..?

ericonr · 2021-01-03T21:49:45Z

I'm not sure on the allocation, but it does seem to use a function that will usually require locking: https:/rust-lang/rust/blob/ddafcc0b66e04ac4793361d3bd604c8c8a9d16a8/library/std/src/sys/unix/fs.rs#L493

I will see if I can get a backtrace on the lock, since it might not be due to malloc after all...

Or maybe this (from right above it)

            let mut ret = DirEntry { entry: mem::zeroed(), dir: Arc::clone(&self.inner) };
            let mut entry_ptr = ptr::null_mut();

does allocations?

richfelker · 2021-01-03T21:59:57Z

I don't actually know Rust, but I think the quoted code necessitates allocation. It should probably be reusing internal storage or if there's an API contract to keep the object valid after the next ReadDir call on the same stream, allocating arrays of them in advance and returning a member of the array.

Hello71 · 2021-01-03T22:49:39Z

I don't actually know Rust, but I think the quoted code necessitates allocation. It should probably be reusing internal storage or if there's an API contract to keep the object valid after the next ReadDir call on the same stream, allocating arrays of them in advance and returning a member of the array.

although I am not overly familiar with Rust, the impl Iterator for ReadDir does not appear to require allocation. all of the allocations are on the stack or RVO. furthermore, the posted screenshot claims that the allocations are made in __alloc_dir in libc, not from any Rust code. actually, this screenshot does not appear to support the claim that excessive allocations are slowing down the program: 23848 allocations is not a substantial amount, assuming that 23848 directories are being scanned.

richfelker · 2021-01-04T00:04:30Z

That raises another issue, that Rust's API seems to impose an assumption that the platform has a NAME_MAX...

Hello71 · 2021-01-04T01:25:04Z

this is only on non-solaris/fuchsia/redox/illumos. on those platforms, the name is copied into a heap-allocated buffer:

https:/rust-lang/rust/blob/ddafcc0b66e04ac4793361d3bd604c8c8a9d16a8/library/std/src/sys/unix/fs.rs#L467

sharkdp · 2021-01-04T08:45:27Z

Or maybe this (from right above it)
let mut ret = DirEntry { entry: mem::zeroed(), dir: Arc::clone(&self.inner) };
let mut entry_ptr = ptr::null_mut();
does allocations?

I don't actually know Rust, but I think the quoted code necessitates allocation.

I don't think so. let mut ret = DirEntry { … } creates a new DirEntry object on the stack. The dir field is a Arc<InnerReadDir>, i.e. a shared pointer to InnerReadDir. Calling Arc::clone(&self.inner) clones the shared pointer (and increases the refcount), but does not perform an allocation.

It's not plausible that allocations by opendir are a performance bottleneck. Each opendir is an open syscall which is multiple orders of magnitude more costly than malloc. If a large performance of overall time is being spent in malloc contention, the calls are from elsewhere.

actually, this screenshot does not appear to support the claim that excessive allocations are slowing down the program: 23848 allocations is not a substantial amount, assuming that 23848 directories are being scanned.

Exactly 23848 directories are being scanned, yes. No additional allocations seem to happen.

Note that the heaptrack run was performed on the GNU libc version (as I don't have a dynamically linked musl version of fd). From the stacktrace, I can see that __alloc_dir it is called from __opendir_tail. From looking at the source, this is called from opendir and opendirat.

I don't actually know Rust, but I think the quoted code necessitates allocation. It should probably be reusing internal storage or if there's an API contract to keep the object valid after the next ReadDir call on the same stream, allocating arrays of them in advance and returning a member of the array.

I will see if I can get a backtrace on the lock, since it might not be due to malloc after all...

That would be great. But as you mentioned, your LD_PRELOAD experiment with jemalloc seems to indicate that this is caused by malloc.

ericonr · 2021-01-04T19:17:11Z

I had forgotten to grab proper debug info. With it available, I can better show what's going on. The flame graphs above are, respectively, the 1, 4, 8 and 12 thread runs.

From a 12 thread run, the main culprits for calling __lock seem to be:

Samples: 139K of event 'cycles:u', Event count (approx.): 40416534270
  Children      Self  Command  Shared Object     Symbol
-   46.41%    45.97%  fd       libc.so           [.] __lock                                   ◆
   - 14.00% 0xffffffffffffffff                                                                ▒
      - 6.46% std::path::PathBuf::_push                                                       ▒
         - realloc                                                                            ▒
              5.84% __lock                                                                    ▒
            - 0.62% malloc                                                                    ▒
                 __lock                                                                       ▒
      - 3.11% regex::dfa::Fsm::cached_state                                                   ▒
           2.82% __lock                                                                       ▒
      - 1.43% ignore::gitignore::GitignoreBuilder::build                                      ▒
           1.30% __lock                                                                       ▒
      - 1.13% <alloc::vec::Vec<T> as core::clone::Clone>::clone                               ▒
           1.03% __lock                                                                       ▒
      - 1.06% <thread_local::ThreadLocal<T> as core::default::Default>::default               ▒
           0.96% __lock                                                                       ▒
   - 3.07% 0x7f905a2d35ef                                                                     ▒
      - std::ffi::os_str::OsStr::to_os_string                                                 ▒
           2.73% __lock                                                                       ▒
     1.60% 0x8                                                                                ▒
   - 1.39% 0x682e                                                                             ▒
      - std::ffi::os_str::OsStr::to_os_string                                                 ▒
           1.23% __lock                                                                       ▒
   - 0.85% 0xf423f                                                                            ▒
        0x7f905af91d3f                                                                        ▒
   - 0.66% 0x94939291908f8e8c                                                                 ▒
      - globset::glob::char_to_escaped_literal                                                ▒
           0.59% __lock                                                                       ▒
   - 0.58% 0x7                                                                                ▒
        0.52% std::ffi::os_str::OsStr::to_os_string                                           ▒
     0.53% 0x7f905afdabbf

Using perf report --call-graph=callee, I get:

  Children      Self  Command  Shared Object     Symbol
-   46.41%    45.97%  fd       libc.so           [.] __lock                                   ◆
   - __lock                                                                                   ▒
      - 9.61% std::ffi::os_str::OsStr::to_os_string                                           ▒
           2.73% 0x7f905a2d35ef                                                               ▒
           1.23% 0x682e                                                                       ▒
      - 9.35% realloc                                                                         ▒
         - 8.91% std::path::PathBuf::_push                                                    ▒
              5.84% 0xffffffffffffffff                                                        ▒
      - 4.37% malloc                                                                          ▒
           1.28% std::ffi::os_str::OsStr::to_os_string                                        ▒
         - 0.92% realloc                                                                      ▒
            - 0.88% std::path::PathBuf::_push                                                 ▒
                 0.68% 0xffffffffffffffff                                                     ▒
      - 3.86% <thread_local::ThreadLocal<T> as core::default::Default>::default               ▒
           0.96% 0xffffffffffffffff                                                           ▒
      - 3.02% regex::dfa::Fsm::cached_state                                                   ▒
           2.82% 0xffffffffffffffff                                                           ▒
      - 2.73% calloc                                                                          ▒
           2.31% globset::GlobSet::matches_candidate_into                                     ▒
      - 2.07% globset::glob::char_to_escaped_literal                                          ▒
           0.59% 0x94939291908f8e8c                                                           ▒
        1.89% <&[u8] as std::ffi::c_str::CString::new::SpecIntoVec>::into_vec                 ▒
      - 1.41% <alloc::vec::Vec<T> as core::clone::Clone>::clone                               ▒
           1.03% 0xffffffffffffffff                                                           ▒
      - 1.30% ignore::gitignore::GitignoreBuilder::build                                      ▒
           1.30% 0xffffffffffffffff                                                           ▒
      - 1.18% ignore::walk::Worker::run_one                                                   ▒
           0.81% 0x7f905af91d3f                                                               ▒
        0.77% free                                                                            ▒
        0.62% regex::re_unicode::escape                                                       ▒
+   20.38%     0.00%  fd       [unknown]         [.] 0xffffffffffffffff                       ▒
+   19.22%     3.77%  fd       fd                [.] ignore::walk::Worker::run_one            ▒
-   17.98%     1.04%  fd       libc.so           [.] realloc                                  ▒
   - realloc                                                                                  ▒
      - 17.05% std::path::PathBuf::_push                                                      ▒
           8.19% 0xffffffffffffffff                                                           ▒
           0.76% 0xc                                                                          ▒
+   17.94%     0.34%  fd       fd                [.] std::path::PathBuf::_push                ▒
-   16.58%     0.26%  fd       fd                [.] core::ops::function::FnOnce::call_once{{v▒
   - core::ops::function::FnOnce::call_once{{vtable-shim}}                                    ▒
      - 16.33% std::sys_common::backtrace::__rust_begin_short_backtrace                       ▒
           core::ops::function::FnOnce::call_once{{vtable-shim}}                              ▒
           std::sys::unix::thread::Thread::new::thread_start                                  ▒
           start                                                                              ▒
           __clone

ghost · 2021-03-05T11:55:58Z

strace -c fd -uuu . / > /dev/null on my system reports the futex syscall using 99% of (system) time.
Would it be any use to have a single producer multi consumer queue sending back the PathBuf to each thread for reuse?

sharkdp · 2021-03-14T21:31:52Z

Would it be any use to have a single producer multi consumer queue sending back the PathBuf to each thread for reuse?

Could you please go into a bit more detail?

ericonr · 2021-12-03T19:04:48Z

I'm assuming it was caused by #863, the 8.3.0 release seems to have improved performance on musl. For a search with many results, where musl was taking about ~2x the time glibc was, musl is now taking approximately ~1.25x the time.

Unfortunately, when there aren't result matches, which means no printing, musl is still taking ~3x the time that glibc is.

sharkdp · 2021-12-03T19:13:50Z

@tavianator FYI

tavianator · 2021-12-03T22:41:04Z

Well, I can reproduce this:

Benchmark 1: ./fd-glibc -j24 . ~/code/linux
  Time (mean ± σ):     127.2 ms ±   2.8 ms    [User: 809.6 ms, System: 1858.4 ms]
  Range (min … max):   120.2 ms … 133.0 ms    23 runs
 
Benchmark 2: ./fd-musl -j24 . ~/code/linux
  Time (mean ± σ):      1.047 s ±  0.020 s    [User: 5.097 s, System: 14.196 s]
  Range (min … max):    1.011 s …  1.063 s    10 runs
 
Summary
  './fd-glibc -j24 . ~/code/linux' ran
    8.23 ± 0.24 times faster than './fd-musl -j24 . ~/code/linux'

There is a lot of allocation contention as this bug report suggests. I don't think this can be easily worked around in fd, the musl malloc() implementation just doesn't really scale.

Well, I can think of one workaround:

diff --git a/Cargo.toml b/Cargo.toml
index 203b98e..d8864f0 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -64,7 +64,7 @@ libc = "0.2"
 # FIXME: Re-enable jemalloc on macOS
 # jemalloc is currently disabled on macOS due to a bug in jemalloc in combination with macOS
 # Catalina. See https:/sharkdp/fd/issues/498 for details.
-[target.'cfg(all(not(windows), not(target_os = "android"), not(target_os = "macos"), not(target_os = "freebsd"), not(target_env = "musl")))'.dependencies]
+[target.'cfg(all(not(windows), not(target_os = "android"), not(target_os = "macos"), not(target_os = "freebsd")))'.dependencies]
 jemallocator = "0.3.0"
 
 [dev-dependencies]
diff --git a/src/main.rs b/src/main.rs
index 4f4ab14..371a1ce 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -39,7 +39,6 @@ use crate::regex_helper::{pattern_has_uppercase_char, pattern_matches_strings_wi
     not(target_os = "android"),
     not(target_os = "macos"),
     not(target_os = "freebsd"),
-    not(target_env = "musl")
 ))]
 #[global_allocator]
 static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

Much better:

Benchmark 1: ./fd-glibc -j24 . ~/code/linux
  Time (mean ± σ):     126.5 ms ±   3.4 ms    [User: 808.9 ms, System: 1836.1 ms]
  Range (min … max):   120.9 ms … 131.6 ms    22 runs
 
Benchmark 2: ./fd-musl -j24 . ~/code/linux
  Time (mean ± σ):     135.7 ms ±   2.5 ms    [User: 840.1 ms, System: 2140.8 ms]
  Range (min … max):   129.8 ms … 141.8 ms    21 runs
 
Summary
  './fd-glibc -j24 . ~/code/linux' ran
    1.07 ± 0.03 times faster than './fd-musl -j24 . ~/code/linux

richfelker · 2021-12-03T23:09:49Z

I don't know what to add at this point except that, if malloc is such a bottleneck here, you'd get much better performance on glibc too by eliminating the gratuitous malloc/free cycles...

ericonr · 2021-12-03T23:22:35Z

I might be mistaken, but it is my understanding that all Rust APIs that end up making a syscall, because they receive a OsStr, will end up allocating in order to generate a CStr to pass to the libc functions. Idk if removing that occurrence of allocations would help these results much.

tavianator · 2021-12-03T23:23:54Z

@richfelker That's probably true but likely impractical for this project, as the paths and whatever else are dynamically allocated by other libraries that we don't control.

If I'm not mistaken, the current implementation of mallocng takes an exclusive lock even for reads. Is it easy to convert it to a real rwlock? That might help a bit.

richfelker · 2021-12-04T00:32:20Z

The design facilitates rwlock but was found not to be of any benefit, because the critical section is tiny and the cost is the memory-synchronizing operation itself, not exclusion due to the lock. As far as I can tell this is fundamental to global consistency, not any deficiency of implementation. It's vaguely possible that a more specialized rwlock (vs the general purpose pthread primitive) could do slightly better, but based on everything I've seen so far, I doubt it.

thestinger · 2022-01-09T09:50:48Z

Traditional thread caching is incompatible with the consistency checking in hardened_malloc and musl's malloc-ng. They could use batched allocation for malloc by having a thread-local array for the smallest size classes with a small number (4-16) of allocations to amortize the cost of locking for malloc but they can't put allocations back into it on free like glibc and jemalloc. That means there's inherently going to be the cost of locking/unlocking for each call to malloc (at least without batching) and free. Even an uncontended lock is expensive. hardened_malloc may add optional batched allocation support.

It's worth noting that despite being a hardened allocator focused on security properties, hardened_malloc can be used in a lightweight configuration with comparable performance to glibc malloc and jemalloc with their thread caches disabled. If you disable all the optional security features other than slab canaries, it still provides better security properties than those in every way. There's an inherently high cost to most of the other optional features that are enabled by default (zero on free, write after free check, frequent guard slabs and especially the slab quarantines) and the optional slot randomization currently has a fair bit of overhead but we expect that we can optimize it to be quite cheap when built specifically for modern x86_64 CPUs via the PDEP feature.

hardened_malloc supports jemalloc-style arenas so it can scale well to multiple cores. It has round-robin assignment of threads to arenas similar to jemalloc. The non-Android configuration uses 4 arenas by default but we use 1 arena on Android by default. Enabling more arenas makes it scale better to more threads but the security-focused approach in hardened_malloc needs to reserve a massive amount of address space for each arena. The typical 47-bit userspace address space (48 total) on x86_64 isn't enough for a lot more than 4 arenas. Linux on mobile arm64 typically has only a 38-bit userspace address space (39 total) due to using 3-level page tables as an optimization, but it can be configured to have 47-bit like x86_64.

The arena approach fits nicely into hardened_malloc but it doesn't really play well with the optional slab quarantine feature. Arenas multiply the worst case cached / quarantined memory of the allocator since they're separate instances with their own cached/quarantined memory. The quarantine feature purposely holds onto freed allocations and can't easily purge any memory for them. Multiplying that for each arena makes it much worse. We may end up making the quarantines much smaller by default and not quarantining the largest slab allocation sizes by default but until then it's a strong pressure against using many arenas. We use only 1 arena on GrapheneOS since we use both of the slab quarantine features. Making the slab quarantine features lighter is an open problem but there's unlikely to be any substantial improvement beyond choosing a different security vs. memory usage balance for the default configuration which is more slanted towards lighter memory usage.

OpenBSD malloc also now supports a similar arena approach, although it doesn't require lots of address space since they don't provide strict isolated regions and use a hash table to find the metadata for a slab or large allocation rather than addresses. OpenBSD malloc is far slower than hardened_malloc though. The whole point of hardened_malloc was making a replacement for OpenBSD malloc on 64-bit with better performance and also security via isolated regions without address space reused between them and making the optional security features fast enough to have them all enabled by default.

musl could use arenas via inline metadata specifying the arena (which would end up being verified through the other checks) but multiplying the cached memory goes against the lightweight memory approach. It could be an option though and wouldn't be that complicated. It would probably only be around 50 lines of code. I don't really see that happening since musl is focused on being lightweight on memory usage and simplicity. Perhaps it's worth trying it to see how much benefit it would offer though. It's also not quite as bad as multiplying the worst case cached memory in practice as long as threads are statically assigned to arenas. It depends on the workload.

thestinger · 2022-01-09T09:58:38Z

This was the initial arena implementation for hardened_malloc (with w=1 as a URL parameter to hide whitespace-only changes):

GrapheneOS/hardened_malloc@c5e9114?w=1

It was subsequently switched to using round-robin static assignment of threads to arenas instead of the placeholder random static assignment of threads to arenas and also micro-optimized. It hasn't changed significantly from the initial implementation in terms of approach. The difference for malloc-ng would be needing to find the arena from inline metadata instead of address range. It's really fairly trivial to implement. I just don't knows how well it fits with what @richfelker is trying to for musl's malloc-ng since it raises memory usage. It would be pretty lightweight if it was only 2-4 arenas though. In jemalloc, they make 4 arenas for every logical thread which makes it quite heavyweight, and in hardened_malloc it's slab quarantines that make it heavyweight when those are enabled, since they're heavyweight themselves.

thestinger · 2022-01-09T10:02:55Z

Also worth noting: hardened_malloc has entirely separate allocators for each small (slab) allocation size class within each arena with separate locks. It has global locking for large allocations and then a lock for each per-size-class-per-arena allocator. There are no other locks. It has no shared state since it doesn't share anything across size classes for security reasons as part of that isolated size class design. I don't think that per-size-class locking reduces contention that much though. We wouldn't have gone out of the way to do it. It makes sense because the ONLY cost for us to provide it is needing a ChaCha8 CSPRNG state struct for each size class allocator rather than one per arena. It doesn't actually increase memory usage due to how metadata is split up anyway. It's not like it matters in a typical security-focused configuration using slab quarantines since those are so heavy...

Fixes sharkdp#710. Fixes sharkdp#980.

Partial fix for sharkdp#710 and sharkdp#980.

Partial fix for #710 and #980.

jirutka · 2022-11-19T22:21:32Z

Could someone please benchmark fd built with mimalloc instead of jemalloc?

sharkdp · 2022-11-21T20:30:55Z

Should be easy to do using https://docs.rs/mimalloc/latest/mimalloc/. We have a benchmark suite here: https:/sharkdp/fd-benchmarks (see regression.sh)

ericonr added the bug label Dec 31, 2020

sharkdp added performance and removed bug labels Feb 15, 2021

tavianator mentioned this issue Mar 8, 2022

3x~10x Performance regression between 7.2.0 and >7.3.0 on large folder #980

Open

tavianator added a commit to tavianator/fd that referenced this issue Jul 12, 2022

Enable jemallocator on musl builds

9bea057

Fixes sharkdp#710. Fixes sharkdp#980.

tavianator mentioned this issue Jul 12, 2022

Enable jemallocator on 64-bit musl builds #1062

Merged

tavianator added a commit to tavianator/fd that referenced this issue Jul 13, 2022

Enable jemallocator on musl builds

b1c3468

Fixes sharkdp#710. Fixes sharkdp#980.

tavianator added a commit to tavianator/fd that referenced this issue Sep 16, 2022

Enable jemallocator on 64-bit musl builds

bc438f4

Partial fix for sharkdp#710 and sharkdp#980.

sharkdp pushed a commit that referenced this issue Sep 19, 2022

Enable jemallocator on 64-bit musl builds

e46d402

Partial fix for #710 and #980.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fd is allocation heavy, which can lead to heavy lock contention on some malloc implementations, especially hardened ones #710

fd is allocation heavy, which can lead to heavy lock contention on some malloc implementations, especially hardened ones #710

ericonr commented Dec 31, 2020 •

edited

Loading

sharkdp commented Jan 2, 2021 •

edited

Loading

ericonr commented Jan 3, 2021

sharkdp commented Jan 3, 2021

sharkdp commented Jan 3, 2021 •

edited

Loading

ericonr commented Jan 3, 2021 •

edited

Loading

richfelker commented Jan 3, 2021

ericonr commented Jan 3, 2021 •

edited

Loading

richfelker commented Jan 3, 2021

Hello71 commented Jan 3, 2021

richfelker commented Jan 4, 2021

Hello71 commented Jan 4, 2021

sharkdp commented Jan 4, 2021

ericonr commented Jan 4, 2021 •

edited

Loading

ghost commented Mar 5, 2021 •

edited by ghost

Loading

sharkdp commented Mar 14, 2021

ericonr commented Dec 3, 2021

sharkdp commented Dec 3, 2021

tavianator commented Dec 3, 2021

richfelker commented Dec 3, 2021

ericonr commented Dec 3, 2021

tavianator commented Dec 3, 2021

richfelker commented Dec 4, 2021 •

edited

Loading

thestinger commented Jan 9, 2022

thestinger commented Jan 9, 2022

thestinger commented Jan 9, 2022

jirutka commented Nov 19, 2022

sharkdp commented Nov 21, 2022

fd is allocation heavy, which can lead to heavy lock contention on some malloc implementations, especially hardened ones #710

fd is allocation heavy, which can lead to heavy lock contention on some malloc implementations, especially hardened ones #710

Comments

ericonr commented Dec 31, 2020 • edited Loading

sharkdp commented Jan 2, 2021 • edited Loading

ericonr commented Jan 3, 2021

sharkdp commented Jan 3, 2021

sharkdp commented Jan 3, 2021 • edited Loading

ericonr commented Jan 3, 2021 • edited Loading

richfelker commented Jan 3, 2021

ericonr commented Jan 3, 2021 • edited Loading

richfelker commented Jan 3, 2021

Hello71 commented Jan 3, 2021

richfelker commented Jan 4, 2021

Hello71 commented Jan 4, 2021

sharkdp commented Jan 4, 2021

ericonr commented Jan 4, 2021 • edited Loading

ghost commented Mar 5, 2021 • edited by ghost Loading

sharkdp commented Mar 14, 2021

ericonr commented Dec 3, 2021

sharkdp commented Dec 3, 2021

tavianator commented Dec 3, 2021

richfelker commented Dec 3, 2021

ericonr commented Dec 3, 2021

tavianator commented Dec 3, 2021

richfelker commented Dec 4, 2021 • edited Loading

thestinger commented Jan 9, 2022

thestinger commented Jan 9, 2022

thestinger commented Jan 9, 2022

jirutka commented Nov 19, 2022

sharkdp commented Nov 21, 2022

ericonr commented Dec 31, 2020 •

edited

Loading

sharkdp commented Jan 2, 2021 •

edited

Loading

sharkdp commented Jan 3, 2021 •

edited

Loading

ericonr commented Jan 3, 2021 •

edited

Loading

ericonr commented Jan 3, 2021 •

edited

Loading

ericonr commented Jan 4, 2021 •

edited

Loading

ghost commented Mar 5, 2021 •

edited by ghost

Loading

richfelker commented Dec 4, 2021 •

edited

Loading