Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle case when file size is too small for sample size #16

Merged
merged 2 commits into from
Aug 27, 2024

Conversation

kalafut
Copy link
Owner

@kalafut kalafut commented Aug 26, 2024

Issue

Behavior can be unpredictable if the file size is less than ~2x the sample size and sampling is used. This could include hash values that differ between imohash implementations, or even crashing (see #15).

Affected Users

This affects users who:

  • Set custom parameters
  • Have the sample_size > 2*sample_threshold
  • Are processing files in the size range: sample_threshold --> 2*sample_size

Analysis

imohash samples at the beginning, middle, and end of the file. If the sample size is s , the file needs to be at least 2s-1 to sample the middle without hitting EOF. Furthermore, if s is larger than the whole file, then there can be a seek() error when trying to sample s back from the end of the file.

The original spec didn't correctly address this. There is a check described relating the sample size and sample threshold, but it doesn't really make sense. What matters is the actual file size and the sample size. Furthermore, the check was never implemented anyway!

Since the default sample size is 0.125 of the default threshold, the problem condition is never hit when using defaults.

Fix

The spec and code have been updated to check for sample size relative to file size and choose the correct mode.

If you're using custom parameters in the range described above and saving the hash, the hash post-upgrade will differ for files within the affected size range. In practice, this is a rare case. Most uses are for synchronization, and the hashes are ephemeral. Technically, this is a breaking change. However, given the narrow nature of this issue, I'm merely going to v1.1.0 with it and not a full v2.

Fixes #15

This fixes the case where hashing may crash when using custom parameters
if the sample size is too large for the file.

- Update the spec and code to correctly constraint mode based on file size
- Update tests
- Minor lint fixes
@kalafut
Copy link
Owner Author

kalafut commented Aug 26, 2024

cc @oertl @hiql @Hakkin @schollz for any feedback/concerns

@Hakkin
Copy link

Hakkin commented Aug 26, 2024

I might just be missing something or not understand the algorithm correctly, but I'm not sure I quite understand where 2s-1 comes from. Wouldn't it be 3s? If we're reading a full buffer of s length from the start, middle and end of the reader, that would be 3s length. Currently it seems like if L is between 2s-1 and 3s, the hash will read overlapping data. If that is intentional, then I guess that's fine, but it seems a little odd to me. It seems like it would make more sense if a full hash was done if L < t || L <= 3s, otherwise do a sampled hash.

@Hakkin
Copy link

Hakkin commented Aug 26, 2024

Oh wait, I think I realize my mistake. I confused myself by thinking the "middle" read would start directly after the "start" read in the case of 3s, but obviously that doesn't make sense since the "middle" read starts at L/2, not L/2-(s/2). I suppose if you wanted no overlapping reads, it would actually be 4s...? Maybe it does make more sense to just go with the bare minimum that doesn't error the reader. The above confusion aside, this looks good to me, in that in fixes the linked issue.

@kalafut
Copy link
Owner Author

kalafut commented Aug 26, 2024

@Hakkin You raise a good point, and funnily enough I noticed 4s show up as a check in the Java port:

https:/dynatrace-oss/hash4j/blob/83f97b36f4bed60c6b9edc5674fba6971c462c1f/src/main/java/com/dynatrace/hash4j/file/Imohash1_0_2.java#L71

Since this PR effectively defines new behavior anyway, I'm coming around to the idea of 4s since, in almost all cases, it will be less actual I/O. Plus, it's just simpler to comprehend than 2s-1, which seems overly specific for no real benefit.

@hiql
Copy link

hiql commented Aug 27, 2024

cc @oertl @hiql @Hakkin @schollz for any feedback/concerns

Checked and published a new rust version 😀

@oertl
Copy link
Contributor

oertl commented Aug 27, 2024

As alternative, one could require that t >= 3*s and hash the following ranges (using Python slice notation, end is exclusive) [0 : s], [(L-s)/2 : s+(L-s)/2], and [L-s : L]. This minimizes the required threshold. However, this proposal is incompatible with previous releases.

@kalafut
Copy link
Owner Author

kalafut commented Aug 27, 2024

I want to maintain stability as much as possible for the well-defined part and keep the selection pattern the same. The change now is defining previously undefined behavior, so there is a little more latitude. BTW, this topic will also be revisited when I implement a user-defined number of chunks in the future.

For now, I like the cutoff being: Full mode if L < 4*s It is objectively more efficient than what I initially pushed, and is cleaner.

I've updated this PR with the change. And will get this released as v1.1.0.

Thank you @hiql for the prompt update. Would you mind changing the cutoff and test vectors one more time? 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants