Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocate_vec returns uninitialised memory #139

Closed
kaedroho opened this issue May 9, 2017 · 6 comments
Closed

allocate_vec returns uninitialised memory #139

kaedroho opened this issue May 9, 2017 · 6 comments
Milestone

Comments

@kaedroho
Copy link
Contributor

kaedroho commented May 9, 2017

https:/tantivy-search/tantivy/blob/0dad02791c74e572ec005abb1f75e3601bd25db5/src/common/mod.rs#L31-L41

This could lead to a number of problems if not used carefully but most rust developers would not expect this behaviour from a "safe" function.

I really think this should be replaced with something safer, or at least marked as unsafe

@fulmicoton
Copy link
Collaborator

Interesting...

A bit of background there.

Rust's Vec big initialization is dead slow in debug mode. I started using the unsafe set_len to have unit test run in a manageable time.

This could lead to a number of problems if not used carefully but most rust developers would not expect this behaviour from a "safe" function.

This is unsafe because it is by definition "undefined behavior". In my experience, it can be on the contrary more dangerous to zero out vectors that you use for instance as a memory arena. In C++ it is one of the common cause for code to fail in optimized mode and seem to work perfectly (until when though?) in debug mode.

I also see how it can be a security issue (possibly exposing within the same process)...

@kaedroho and @BurntSushi, I am very interested if you could educate me on the subject. My opinion at the moment is really "In practise, I don't care." and "zero-initialization actually have hidden some bug from me in the past, I would have prefer things to fail fast".

BTW tantivy has a lot of much more unsafe code in its indexer, for performance reasons. If you have time for it, I'd be glad to discuss with you if you can help me figure out alternatives.

@BurntSushi
Copy link

@fulmicoton To clear the air: allocate_vec as linked in this issue is wrong. Its implementation either needs to be completely safe or it needs to be marked as an unsafe function. Leaving it as-is is just about the worst sin one can commit in Rust. (That is, marking something that's safe that isn't actually safe.)

Consider this example: https://is.gd/ipfyEu --- If you run it, you'll silently get nonsense data. This is because your main function contains undefined behavior, and the fact that there is no unsafe there is really bad. You should never have undefined behavior in safe code. If you do (and you do), it's a bug.

Rust's Vec big initialization is dead slow in debug mode. I started using the unsafe set_len to have unit test run in a manageable time.

You might consider turning on optimizations for running tests. It will increase compile times, of course.

Popping up a level, I will say this: I think justifying unsafe with debug mode improvements is not worth it, even if you made allocate_vec safe. I'd encourage you to either write your test to use smaller data, or to run that test with optimizations on.

BTW tantivy has a lot of much more unsafe code in its indexer, for performance reasons. If you have time for it, I'd be glad to discuss with you if you can help me figure out alternatives.

Can you point to the files? I see src/datastruct/stacker/heap.rs, and that appears to have some interesting uses of unsafe.

I will say these things also:

  • I try very hard to have an actual benchmark that justifies each use of unsafe.
  • I try very hard to document every use of unsafe with an argument of why it is safe. Example.

@fulmicoton
Copy link
Collaborator

Leaving it as-is is just about the worst sin one can commit in Rust.

Well... This is not a public API. Surely having it as a public API would be an ever greater sin. :)

@BurntSushi thanks for the explanation.

I'll see how I can improve that. I will -at the very least- mark the function as unsafe, and
reduce its usage to Vec<u8>.

I think justifying unsafe with debug mode improvements is not worth it, even if you made allocate_vec safe. I'd encourage you to either write your test to use smaller data, or to run that test with optimizations on.

I guess I am not very patient as a person.

  • src/datastruct/stacker/* is the most complicated one yes. I am in the middle of simplifying it but it will keep a lot of unsafe code for the memory arena.

The role of the memory arena is both about performance (bump allocation, scratch for unalloc) and functionality.
Each thread works "in memory" until a limit. In the first versions of tantivy, the user had to express the limit in term of "number of documents". Tantivy now allows to define a limit in number of MB of anonymous memory used. (This amount of memory is then split accross the indexing threads).

Another usage is in the BitPacker reader. There, there is 2 usages of unsafe that are probably
uncalled for.
https:/tantivy-search/tantivy/blob/master/src/common/bitpacker.rs

I use a *const u8 pointer to the buffer. If I recall correctly, I ended up doing that as a workaround over the "self borrowing struct problem".
I also use read a *const u64 to do the bit unpacking.

I have no elegant solution to the first part of the problem, and the solution to the second problem is probably somewhere in the byteorder crate.

I try very hard to have an actual benchmark that justifies each use of unsafe.
I try very hard to document every use of unsafe with an argument of why it is safe.

This sounds like great practise. I'll try to do the same but I don't think I'll ever reach your degree of discipline / QofCode.

@kaedroho
Copy link
Contributor Author

kaedroho commented May 9, 2017

If you want to go down a safer route, it might be possible to refactor all the places that use allocate_vec(size) to use vec![0u8; size] instead. This syntax benefits from an upcoming optimisation which might solve the performance issue in debug mode (rust-lang/rust#40409).

@fulmicoton
Copy link
Collaborator

fulmicoton commented May 10, 2017

@kaedroho @BurntSushi I found out very recently #136 two horrible design choices that were extremely expensive in term of memory (the ez to explain one, is that there is currently one hashmap per-field).

Once this is fixed, the minimal memory usage in the unit test should be much smaller, and it should be fine to do buffer allocation the "safe way".

@kaedroho Thanks for linking #40409, I didn't know a fix was on the way this is awesome.

@BurntSushi
Copy link

@fulmicoton Thanks for being receptive. :-) It's crucial that awesome projects like this don't abuse unsafe! :-)

fulmicoton added a commit that referenced this issue May 11, 2017
Also.

Cleaned up the code to make sure that the logic
is only in one place.
Removed allocate_vec

Closes #141
Closes #139
Closes #142
Closes #138
fulmicoton added a commit that referenced this issue May 11, 2017
Also.

Cleaned up the code to make sure that the logic
is only in one place.
Removed allocate_vec

Closes #141
Closes #139
Closes #142
Closes #138
fulmicoton added a commit that referenced this issue May 11, 2017
Also.

Cleaned up the code to make sure that the logic
is only in one place.
Removed allocate_vec

Closes #141
Closes #139
Closes #142
Closes #138
@fulmicoton fulmicoton added this to the 0.4.0 milestone Jul 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants