Handle case when file size is too small for sample size #16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
Behavior can be unpredictable if the file size is less than ~2x the sample size and sampling is used. This could include hash values that differ between imohash implementations, or even crashing (see #15).
Affected Users
This affects users who:
sample_size
>2*sample_threshold
sample_threshold
-->2*sample_size
Analysis
imohash samples at the beginning, middle, and end of the file. If the sample size is
s
, the file needs to be at least2s-1
to sample the middle without hitting EOF. Furthermore, ifs
is larger than the whole file, then there can be a seek() error when trying to samples
back from the end of the file.The original spec didn't correctly address this. There is a check described relating the sample size and sample threshold, but it doesn't really make sense. What matters is the actual file size and the sample size. Furthermore, the check was never implemented anyway!
Since the default sample size is 0.125 of the default threshold, the problem condition is never hit when using defaults.
Fix
The spec and code have been updated to check for sample size relative to file size and choose the correct mode.
If you're using custom parameters in the range described above and saving the hash, the hash post-upgrade will differ for files within the affected size range. In practice, this is a rare case. Most uses are for synchronization, and the hashes are ephemeral. Technically, this is a breaking change. However, given the narrow nature of this issue, I'm merely going to v1.1.0 with it and not a full v2.
Fixes #15