Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to rechunk single variables or batch variables from existing MDIO #368

Merged
merged 9 commits into from
Mar 12, 2024

Conversation

tasansal
Copy link
Collaborator

@tasansal tasansal commented Mar 12, 2024

This PR allows users to rechunk MDIO datasets when needed.

Also added examples and documentation.

The MDIO API has been enhanced with support for additional file operation modes ('w' for rechunking) and a new rechunking feature. The 'copy_mdio' function now accepts strongly typed arguments, and 'rechunk' functions have been added to efficiently resize chunks for large datasets, with progress tracking and error handling improvements.
@tasansal tasansal added the enhancement New feature or request label Mar 12, 2024
@tasansal tasansal self-assigned this Mar 12, 2024
Expanded the docstrings in `convenience.py` to include examples illustrating how to use `rechunk` and `rechunk_batch` functions for clarity and ease of use for developers.
The new section "Convenience Functions" has been added to the reference documentation. It specifically includes documentation for the `mdio.api.convenience` module, excluding `create_rechunk_plan` and `write_rechunked_values`.
The rechunk operations in the MDIO API now accept an optional compressor parameter, allowing users to specify a custom data compression codec. The default compressor, Blosc('zstd'), is set if none is provided, ensuring backward compatibility.
Inserted a TODO comment in `rechunk` function for writing tests, referencing the relevant issue.
Removed an extraneous newline and introduced a constant MAX_BUFFER to handle buffer size for chunking. Updated the create_rechunk_plan function's docstring to include the buffer size details, making it clearer how the buffer size can be adjusted by altering the MAX_BUFFER variable. This change enhances code readability and maintainability.
Added a new demo `rechunking.ipynb` to demonstrate how to optimize access patterns using rechunking and lossy compression. The notebook includes detailed steps and code snippets to create optimized, compressed copies for different access patterns, enhancing read performance.
Streamlined the Jupyter notebook by resetting execution counts and cleaning up metadata fields. This provides a fresh state for the execution environment and a more structured document for other developers to follow.
Corrected the reference from 'notebook' to 'page', added a parenthetical clarification to a section heading, and updated the performance benchmark outputs. These changes improve document clarity and provide the latest performance metrics.
@tasansal tasansal merged commit d157465 into main Mar 12, 2024
20 checks passed
@tasansal tasansal deleted the add-rechunker branch March 12, 2024 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant