Handle case when file size is too small for sample size #16

kalafut · 2024-08-26T17:32:58Z

Issue

Behavior can be unpredictable if the file size is less than ~2x the sample size and sampling is used. This could include hash values that differ between imohash implementations, or even crashing (see #15).

Affected Users

This affects users who:

Set custom parameters
Have the sample_size > 2*sample_threshold
Are processing files in the size range: sample_threshold --> 2*sample_size

Analysis

imohash samples at the beginning, middle, and end of the file. If the sample size is s , the file needs to be at least 2s-1 to sample the middle without hitting EOF. Furthermore, if s is larger than the whole file, then there can be a seek() error when trying to sample s back from the end of the file.

The original spec didn't correctly address this. There is a check described relating the sample size and sample threshold, but it doesn't really make sense. What matters is the actual file size and the sample size. Furthermore, the check was never implemented anyway!

Since the default sample size is 0.125 of the default threshold, the problem condition is never hit when using defaults.

Fix

The spec and code have been updated to check for sample size relative to file size and choose the correct mode.

If you're using custom parameters in the range described above and saving the hash, the hash post-upgrade will differ for files within the affected size range. In practice, this is a rare case. Most uses are for synchronization, and the hashes are ephemeral. Technically, this is a breaking change. However, given the narrow nature of this issue, I'm merely going to v1.1.0 with it and not a full v2.

Fixes #15

This fixes the case where hashing may crash when using custom parameters if the sample size is too large for the file. - Update the spec and code to correctly constraint mode based on file size - Update tests - Minor lint fixes

kalafut · 2024-08-26T17:36:25Z

cc @oertl @hiql @Hakkin @schollz for any feedback/concerns

Hakkin · 2024-08-26T18:06:02Z

I might just be missing something or not understand the algorithm correctly, but I'm not sure I quite understand where 2s-1 comes from. Wouldn't it be 3s? If we're reading a full buffer of s length from the start, middle and end of the reader, that would be 3s length. Currently it seems like if L is between 2s-1 and 3s, the hash will read overlapping data. If that is intentional, then I guess that's fine, but it seems a little odd to me. It seems like it would make more sense if a full hash was done if L < t || L <= 3s, otherwise do a sampled hash.

Hakkin · 2024-08-26T18:34:58Z

Oh wait, I think I realize my mistake. I confused myself by thinking the "middle" read would start directly after the "start" read in the case of 3s, but obviously that doesn't make sense since the "middle" read starts at L/2, not L/2-(s/2). I suppose if you wanted no overlapping reads, it would actually be 4s...? Maybe it does make more sense to just go with the bare minimum that doesn't error the reader. The above confusion aside, this looks good to me, in that in fixes the linked issue.

kalafut · 2024-08-26T18:58:18Z

@Hakkin You raise a good point, and funnily enough I noticed 4s show up as a check in the Java port:

https:/dynatrace-oss/hash4j/blob/83f97b36f4bed60c6b9edc5674fba6971c462c1f/src/main/java/com/dynatrace/hash4j/file/Imohash1_0_2.java#L71

Since this PR effectively defines new behavior anyway, I'm coming around to the idea of 4s since, in almost all cases, it will be less actual I/O. Plus, it's just simpler to comprehend than 2s-1, which seems overly specific for no real benefit.

hiql · 2024-08-27T12:21:47Z

cc @oertl @hiql @Hakkin @schollz for any feedback/concerns

Checked and published a new rust version 😀

oertl · 2024-08-27T12:53:04Z

As alternative, one could require that t >= 3*s and hash the following ranges (using Python slice notation, end is exclusive) [0 : s], [(L-s)/2 : s+(L-s)/2], and [L-s : L]. This minimizes the required threshold. However, this proposal is incompatible with previous releases.

kalafut · 2024-08-27T16:29:51Z

I want to maintain stability as much as possible for the well-defined part and keep the selection pattern the same. The change now is defining previously undefined behavior, so there is a little more latitude. BTW, this topic will also be revisited when I implement a user-defined number of chunks in the future.

For now, I like the cutoff being: Full mode if L < 4*s It is objectively more efficient than what I initially pushed, and is cleaner.

I've updated this PR with the change. And will get this released as v1.1.0.

Thank you @hiql for the prompt update. Would you mind changing the cutoff and test vectors one more time? 🙏

Handle case when file size is too small for sample size

a00abe6

This fixes the case where hashing may crash when using custom parameters if the sample size is too large for the file. - Update the spec and code to correctly constraint mode based on file size - Update tests - Minor lint fixes

Change cutoff to 4s

7c77f9b

kalafut merged commit 94d81a5 into master Aug 27, 2024

kalafut deleted the filesize-bugfix branch August 27, 2024 16:40

kalafut mentioned this pull request Aug 27, 2024

croc immediately exits with "Seek: invalid offset" when using imohash with certain files schollz/croc#786

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle case when file size is too small for sample size #16

Handle case when file size is too small for sample size #16

kalafut commented Aug 26, 2024

kalafut commented Aug 26, 2024

Hakkin commented Aug 26, 2024

Hakkin commented Aug 26, 2024

kalafut commented Aug 26, 2024 •

edited

Loading

hiql commented Aug 27, 2024

oertl commented Aug 27, 2024

kalafut commented Aug 27, 2024

Handle case when file size is too small for sample size #16

Handle case when file size is too small for sample size #16

Conversation

kalafut commented Aug 26, 2024

Issue

Affected Users

Analysis

Fix

kalafut commented Aug 26, 2024

Hakkin commented Aug 26, 2024

Hakkin commented Aug 26, 2024

kalafut commented Aug 26, 2024 • edited Loading

hiql commented Aug 27, 2024

oertl commented Aug 27, 2024

kalafut commented Aug 27, 2024

kalafut commented Aug 26, 2024 •

edited

Loading