Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Make the Rabin Chunker perform well, or document why it's not fixable #142

Open
4 tasks
flyingzumwalt opened this issue Feb 23, 2017 · 2 comments
Open
4 tasks

Comments

@flyingzumwalt
Copy link
Contributor

Based on the tests in #137 the rabin chunker isn't actually providing any real deduplication benefits. It's also really slow.

  • Identify why the rabin chunker is not providing many deduplication benefits
  • Make some repeatable tests for benchmarking the chunker
  • Improve the chunker or document why it's not fixable
  • If this is a matter of rabin being appropriate for some content and NOT for other content, document the scenarios where is does and doesn't apply
@DonaldTsang
Copy link

DonaldTsang commented Feb 26, 2018

@flyingzumwalt a good way to try this is to create file-format specific Rabin chunking.

  • Audio and video chunking should be chunked by the frame
    • use chunk headers to find boundaries
  • Text and Ebook chunking should be chunked by sentences or paragraphs
    • Epub is an uncompressed zip container, it should chunk each sub-file separately
    • Mobi/Azw is a container with "Records", it should chunk each "Record" separately
    • Djvu is based on IFF, it should chunk using FORM:and other chunk markers
    • PDF and PS should be dealt with as well
    • File headers and files within containers should be chunked independently
  • Source code should be chunked by object or function blocks
    • Each programming language needs its own chunking algorithm
    • The chunking algorithm should make each chunk to be longer than 44 bytes (SHA256)
      • If moving to SHA512, BLAKE2b or Skein512, it will have to be 88 bytes
    • The safest bet is to chunk by { }, since it is used for functions in C-likes
      • chunking with ; is tricky, as JS will auto-convert newlines to it
      • chunking with newline is also tricky, as Java allows abusive use of newlines
      • chunking semicolons will lead to the creation of super small chunks
    • Python, Ruby, Lua, VB and Haskell should be chunked by indentations
      • Ruby, VB and Lua has end or similar to signify end of block
      • Python and Haskell has syntax-implied indentations
  • HTML, XML and JSON should be chunked by element blocks

Keywords: Content Defined, Chunking, Deduplication

@DonaldTsang
Copy link

DonaldTsang commented Feb 26, 2018

It might be good to do some research on FastCDC and Asymmetric Extremum, which has low computational overheads.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants