Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple target features in is_target_feature_detected #348

Open
gnzlbg opened this issue Mar 7, 2018 · 10 comments
Open

Allow multiple target features in is_target_feature_detected #348

gnzlbg opened this issue Mar 7, 2018 · 10 comments

Comments

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 7, 2018

We should support:

is_target_feature_detected!("avx2,sse4.2,lzcnt,popcnt")

so that users don't need to write

is_target_feature_detected!("avx2") && 
is_target_feature_detected!("sse4.2") && 
is_target_feature_detected!("lzcnt") &&
is_target_feature_detected!("popcnt")

cc @Cocalus

@alexcrichton
Copy link
Member

This is unfortunately gonna be really hard to do without writing a full-blown procedural macro, and that's one that would have to live in rustc itself most likely.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 7, 2018

A perma-unstable macro that splits strings at compile-time (the opposite of concat!) should be enough to implement this.

@Cocalus
Copy link

Cocalus commented Mar 7, 2018

It might be enough for now to mention in the docs that between the interfaces for #[cfg(target_feature = ..)], is_target_feature_detected! and #[target_feature(enable = "")], only the later accepts (requires?) multiple inputs.

Another possible advantage to using multiple is that it'll make it easier to minimize the cost of the dynamic check. It appears that each of these is_target_feature_detected is being run separately, but they could be merged into a single check against the features cache. Though maybe some extra inlines would let the compiler merge those checks without code changes. Alternately each is_target_feature_detected! could have it's own cache which would be a single check.

If the multiple checks are merged into single check mask then it might be worth, merger the two 32bit caches, in the x86 detect code, into a single 64 bit one when compiling on x86_64.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 8, 2018

Another possible advantage to using multiple is that it'll make it easier to minimize the cost of the dynamic check.

Hm, I hadn't thought of this. Right now each time we test the cache we perform a relaxed load. AFAICT, LLVM is able to merge multiple consecutive relaxed loads into a single load. For example, check this out: https://godbolt.org/g/jgi6R3 . Here, example::FOO@GOTPCREL is only read once on each function, even though the last two functions try to read from it multiple times (one reads two times, the other reads from it in a loop).

Alternately each is_target_feature_detected! could have it's own cache which would be a single check.

What does that buy us? That multiplies the memory requirements of the SIMD run-time with the number of features (~64x increase on x86), and would require the compiler to emit M loads for M feature tests - right now, it emits a single load, independently of how many features are being tested (except maybe on x86 where it emits a maximum of two loads, but one can't do better there anyways).

If the multiple checks are merged into single check mask then it might be worth, merger the two 32bit caches, in the x86 detect code, into a single 64 bit one when compiling on x86_64.

That is how it is done, or what am I missing? On x86_64 the cache is a single AtomicU64, while on x86 the cache is a pair of AtomicU32s.


It might be enough for now to mention in the docs that between the interfaces for #[cfg(target_feature = ..)], is_target_feature_detected! and #[target_feature(enable = "")], only the later accepts (requires?) multiple inputs.

@alexcrichton we can add an example to the docs that uses two target features to show this. This is probably obvious for those who have used cfg! before, but I can see how this is non-obvious for those who's first exposure to cfg is via the stdsimd crate. Thoughts?

@Cocalus
Copy link

Cocalus commented Mar 8, 2018

What does that buy us? That multiplies the memory requirements of the SIMD run-time with the number of features (~64x increase on x86), and would require the compiler to emit M loads for M feature tests - right now, it emits a single load, independently of how many features are being tested (except maybe on x86 where it emits a maximum of two loads, but one can't do better there anyways)

I didn't mean a load per feature but one for each is_target_feature_detected. So an AtomicBool, which would be a single load on x86. Now it could use the current system for singular features, which I think is already a single load even on x86, and AtomicBool for multiple features. Or even a global AtomicBool per used set of features. While that is an potentially exponential amount of feature sets it's one bool per is_target_feature_detected, which already uses more memory for the generated instructions.

But since it appears that LLVM can merge the relaxed loads and statically construct the feature mask, then it's probably not a gain outside maybe one fewer load on x86. I now expect the cache inefficiency will make it detrimental. I was originally more worried about wasting branch predictor capacity, but it already appears to compile to a single branch.

Technically even that branch can be removed in some cases, by using a AtomicUsize as a function pointer, which I've seen an example macro somewhere for the older simd crate. Alex Crichton might even have written it. The basic idea is you have a set of functions with identical inputs, that you want to select the fastest version function for at run time. Initial the function pointer points to a detection function, that changes to function pointer to the fastest version, then calls that version. On clang/gcc that type of check can even be done at load time with the ifunc attribute.

Note the reason I'm focusing on the cost of is_target_feature_detected is that when the check was used on a ~20 instruction function, the dynamic checks showed up in perf half as much as the original function, though it was still cheaper than without simd. So I'm curious how much of that overhead can be reduced. Of course on my end moving the check out of the hot loop to a higher level is the performance optimal solution.

Now that I think about I probably shouldn't see "std::stdsimd::arch::detect::check_for" in perf to begin with. The atomic check would ideally be inlined into the caller of is_target_feature_detected, but that doesn't seem to be the case.

That is how it is done, or what am I missing? On x86_64 the cache is a single AtomicU64, while on x86 the cache is a pair of AtomicU32s.

Opps I just missed the 64 cache code when scanning through code.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 8, 2018

I didn't mean a load per feature but one for each is_target_feature_detected. So an AtomicBool, which would be a single load on x86. Now it could use the current system for singular features, which I think is already a single load even on x86, and AtomicBool for multiple features. Or even a global AtomicBool per used set of features. While that is an potentially exponential amount of feature sets it's one bool per is_target_feature_detected, which already uses more memory for the generated instructions.

I really don't understand what you mean here. Could you ping me on IRC? Maybe it's easier if we just discuss this.

since it appears that LLVM can [...] statically construct the feature mask,

Can LLVM do this? I haven't checked that.

Technically even that branch can be removed in some cases, by using a AtomicUsize as a function pointer, which I've seen an example macro somewhere for the older simd crate. Alex Crichton might even have written it. The basic idea is you have a set of functions with identical inputs, that you want to select the fastest version function for at run time. Initial the function pointer points to a detection function, that changes to function pointer to the fastest version, then calls that version. On clang/gcc that type of check can even be done at load time with the ifunc attribute.

This is something that can be easily build on top of the current system without overhead. The RFC2045 had an example that builds this IIRC, and there are some crates that implement this with a proc-macro, but I don't know if these have been migrated to use stdsimd yet.

The atomic check would ideally be inlined into the caller of is_target_feature_detected, but that doesn't seem to be the case.

Do you have a minimal working example of this?

@alexcrichton
Copy link
Member

@gnzlbg

@alexcrichton ... Thoughts?

Definitely! I'm all for adding more examples :)


FWIW how much do we have to gain from optimizing these checks much more? I was under the impression that if they're called regularly throughout SIMD code it destroys performance anyway?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 8, 2018

FWIW how much do we have to gain from optimizing these checks much more?

It is unclear to me that this is possible. @Cocalus mentioned:

I didn't mean a load per feature but one for each is_target_feature_detected. So an AtomicBool, which would be a single load on x86. Now it could use the current system for singular features, which I think is already a single load even on x86, and AtomicBool for multiple features.

So IIUC (which I am not sure I do), @Cocalus proposes to use the current system for calls to is_feature_detected that query a single feature, but to use one AtomicBool per combination of features.

That is, is_feature_detected("foo,bar") instead of reading one atomic once, and then doing two bit tests - which LLVM might be able to merge into a single one using a mask because the bit indices are compile-time constants- one would instead read an AtomicBool that contains the precomputed result for this "somehow" (using a hashmap? ).

Either we would store these AtomicBools using dynamic memory (which is probably a bad idea), or we would need to store them at compile time. @Cocalus recognizes that:

While that is an potentially exponential amount of feature sets, it's one bool per is_target_feature_detected

First, it's not one bool. It's an AtomicBool. Second, even if we treat foo,bar and bar,foo as the same "feature", we would still need ~2^64 AtomicBools in the worst case. That is, dynamic memory would be our only sound option AFAICT, and even then, if we run OOM, it's game over anyways.

Then @Cocalus states:

then it's probably not a gain outside maybe one fewer load on x86.

On x86 one only gets two loads if one actually reads features from the two AtomicU32. If we sorted the features a bit (e.g. putting all SIMD-related features in one of the atomics, all crypto/bit manipulation features in the other), we could probably reduce this to one load in practice.

In any case, if each load of an AtomicBool is a cache miss, none of these performance considerations do matter because then you are adding 100-1000ns to each check. One advantage of having everything in a single memory location is that the chances of it being in the L1 cache or a register are high.

So... while I think that there might be some potential for improvement, I don't think this potential is high. A judicious use of is_target_feature_detected will probably have a larger impact than anything we can do here.


@alexcrichton

I was under the impression that if they're called regularly throughout SIMD code it destroys performance anyway?

As long as LLVM can hoist and merge the checks they should be cheap or free. Within the crate boundary this should happen often, across crates it would depend whether the relevant parts of the crate interface are #[inline] or whether one gets lucky with ThinLTO/LTO.

@Cocalus
Copy link

Cocalus commented Mar 8, 2018

I think if the relaxed load and check could be inlined it would speed up things with lots of dynamic checks. the other ideas are probably detrimental.

The atomic check would ideally be inlined into the caller of is_target_feature_detected, but that doesn't seem to be the case.

Do you have a minimal working example of this?

I might have time to take a swing at it, but I'm not having luck pointing an example at a local clone of stdsimd master.

[dependencies]
rand = "0.4.2"
stdsimd = {path = "../stdsimd"}
error: failed to load source for a dependency on `stdsimd`

Caused by:
  Unable to update file:///data/home/mwilkinson/rust/stdsimd

Caused by:
  found a virtual manifest at `/data/home/mwilkinson/rust/stdsimd/Cargo.toml` instead of a package manifest

also if I try to run

cargo test

on stdsimd I get

error: custom attribute panicked
  --> crates/coresimd/src/../../../coresimd/x86/rdtsc.rs:59:5
   |
59 |     #[simd_test = "sse2"]
   |     ^^^^^^^^^^^^^^^^^^^^^
   |
   = help: message: called `Result::unwrap()` on an `Err` value: NotPresent

error: Could not compile `coresimd`.
warning: build failed, waiting for other jobs to finish...
error: build failed

This is on x86_64 ubuntu 14.04 linux, rustc 1.26.0-nightly (2789b067d 2018-03-06)

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 9, 2018

I might have time to take a swing at it, but I'm not having luck pointing an example at a local clone of stdsimd master.

I've noticed this as well. I have fixed it locally, will send a PR to fix this soon.

The temporary workaround is to export a TARGET=... environment variable containing your target (e.g. TARGET=x86_64-unknown-linux-gnu cargo test --all).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants