Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HashMap use a per-process or per-thread seed instead of a per-instance seed #27243

Closed
Gankra opened this issue Jul 23, 2015 · 8 comments
Closed
Assignees
Labels
E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion.

Comments

@Gankra
Copy link
Contributor

Gankra commented Jul 23, 2015

Currently every call to HashMap::new makes a new RandomState that makes two new calls to the thread_rng to create its seed. This is, to my knowledge cryptographic overkill, as there is no clear distinction between a single HashMap with a seed and many HashMaps with the same seed, as long as that seed is not hardcoded (and therefore able to be attacked offline).

There is no argument to be had on minimizing the risk of a single compromise, as the SipHash security model is designed to make it impossible to compromise a single map. If our users disagree with this, we may consider exposing a constructor for RandomState that explicitly reseeds it.

This can be implemented via Once or through a Thread Local Static. Whichever is more efficient.

CC @alexcrichton

@Gankra Gankra added A-libs E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. labels Jul 23, 2015
@Gankra Gankra self-assigned this Jul 23, 2015
@Gankra
Copy link
Contributor Author

Gankra commented Jul 23, 2015

Willing to mentor

@apasel422
Copy link
Contributor

@gankro I'd be willing to take a look at this.

@Gankra
Copy link
Contributor Author

Gankra commented Jul 23, 2015

Awesome! Go for it!

@alexcrichton
Copy link
Member

Note that this is done today for ease of implementation, if a global seed
is painful to add I wouldn't bother
On Thu, Jul 23, 2015 at 12:22 PM Alexis Beingessner <
[email protected]> wrote:

Awesome! Go for it!


Reply to this email directly or view it on GitHub
#27243 (comment).

@frankmcsherry
Copy link
Contributor

Just a quick comment. There are definitely other reasons to value a seeded HashMap other than DoS attacks, if that is interesting. The most common one I deal with is when I partition work among worker threads, possibly using a hash function, and then hope that when they arrive they will not have rotted out some bits of their hashes and make HashMap perf all wonky. The "per-thread" version might not suffer from this much, but you could easily imagine other ways to shoot yourself in the foot if you re-use seeded hashes too often.

I need cluster-wide consistency, so I actually seed all my own stuff and personally rot out the bits, but ... Just thought I'd mention it!

@Gankra
Copy link
Contributor Author

Gankra commented Jul 23, 2015

@frankmcsherry could you elaborate on this idea of bits rotting out?

@frankmcsherry
Copy link
Contributor

Sure, sorry. It may be tricky to make a version using only HashMap, so perhaps the answer is ... "bah!" :)

Let's say you have a bunch of workers thinking they'd like to count the words in all the strings they have. So, they distribute the words among themselves using their favorite hash function. Once the words next arrive at each worker, each puts their words in a HashMap to count them up. If the hash function they used to distribute the strings is related to the one in the HashMap (e.g. the same) the low order bits will be fixed, and bad perf can happen.

So, you may say: "hey well, you don't get to use HashMap's hash function to do the distribution", and I totally don't plan on trying that, but if anything bad happens because of how the keys are laid out (say I pre-aggregate all my word counts, and ship them, using HashMap.drain()) you can get accidentally crummy performance.

Random stuff like this happens, and you beat your head against the wall trying to understand what went wrong, and then you learn about number theory and that you and the HashMap implementor chose the same prime number somewhere, and ...

So, probably don't sweat it. :)

@Gankra
Copy link
Contributor Author

Gankra commented Jul 23, 2015

Ah, neat! Thanks!

huonw added a commit to huonw/rust that referenced this issue Feb 2, 2016
This reduces how much rand is required in std, and makes creating
hash-maps noticably faster. The first invocation of HashMap::new() in a
thread goes from ~130_000 ns to ~1500 ns (i.e. 86x faster) and later
invocations go from 10-20 ns to a more consistent 1.8-2.0ns.

The mean for many invocations in a loop, on a single thread, goes from
~77ns to ~1.9ns, and the standard deviation drops from 2800ns to
33ns (the *massive* variance of the old scheme comes from the occasional
reseeding of the thread rng).

These new numbers are only slightly higher than creating the
`RandomState` outside the loop and calling `with_state` in it (i.e.
only measuring the non-random parts of hashmap initialisation): 1.3 and
18 ns respectively.

This new scheme has the slight downside of being consistent on a single
thread, so users may unintentionally rely on the order being
fixed (e.g. two hashmaps with the same contents).

Closes rust-lang#27243.
alexcrichton added a commit to alexcrichton/rust that referenced this issue May 19, 2016
This is a rebase and extension of rust-lang#31356 where we cache the keys in thread local
storage. This should give us a nice speed bost in creating hash maps along with
mostly retaining the property that all maps have a nondeterministic iteration
order.

Closes rust-lang#27243
bors added a commit that referenced this issue May 20, 2016
std: Cache HashMap keys in TLS

This is a rebase and extension of #31356 where we not only cache the keys in
thread local storage but we also bump each key every time a new `HashMap` is
created. This should give us a nice speed bost in creating hash maps along with
retaining the property that all maps have a nondeterministic iteration order.

Closes #27243
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion.
Projects
None yet
4 participants