Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1280:[C++] Implement block-based buffer(Part I) #1271

Merged
merged 12 commits into from
Oct 12, 2022

Conversation

coderex2522
Copy link
Contributor

What changes were proposed in this pull request?

This patch implements a block-based buffer allocator. Provides smarter memory management for the buffer of BufferedOutputStream class.

Why are the changes needed?

This patch implements smarter memory management, which can effectively solve the issue-1240. And replacing DataBuffer with BlockBuffer in BufferedOutputStream class will be implemented in another patch.

How was this patch tested?

The UTs in TestBlockBuffer.cc can test this patch.

@coderex2522 coderex2522 changed the title ORC-1280:[C++][Part I] Implement block-based buffer ORC-1280:[C++] Implement block-based buffer(Part I) Oct 6, 2022
#ifndef ORC_MEMORYPOOL_IMPL_HH
#define ORC_MEMORYPOOL_IMPL_HH

#include "orc/MemoryPool.hh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing making another MemoryPool.hh like this. Could you create a new file instead of MemoryPool.hh file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I rename src/MemoryPool.hh into src/BlockBuffer.hh.

@dongjoon-hyun dongjoon-hyun added this to the 1.9.0 milestone Oct 6, 2022
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 6, 2022

Thank you for making a PR. I have one comment about file structure, #1271 (comment).

cc @wgtmac , @stiga-huang

@@ -0,0 +1,90 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,82 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 7, 2022

It seems that the GitHub Action labeler is broken. Let me fix it.

@github-actions github-actions bot added the CPP label Oct 7, 2022
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @coderex2522 for the improvement! Would be great to equip the new class with more comment in detail.

Block() : data(nullptr), size(0) {}
Block(char* _data, uint64_t _size) : data(_data), size(_size) {}
Block(const Block& block) = default;
~Block() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~Block() = default;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

void resize(uint64_t size);
void reserve(uint64_t capacity);
};
} // namespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// namespace orc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

/**
* Get the Block object
*/
Block getBlock(uint64_t blockIndex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some detail explanation with regard to the mechanism and behavior of the BlockBuffer. Especially concepts like block, block index and block number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some description for block, blockbuffer, block index and block number.

/**
* Get the Block object
*/
Block getBlock(uint64_t blockIndex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide mutable and immutable overloads?

Copy link
Contributor Author

@coderex2522 coderex2522 Oct 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the getBlock function does not modify class member variables, the const attribute is added to the function.

}

void resize(uint64_t size);
void reserve(uint64_t capacity);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert a new line before reserve function. Better to explain what will happen in the reserve function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to explain the reserve function.

return currentCapacity;
}

void resize(uint64_t size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support shrinkToFit parameter?

Copy link
Contributor Author

@coderex2522 coderex2522 Oct 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class DataBuffer only recycles memory during destructor, so class BlockBuffer does not support shrink function, which can be further optimized later.

* Block points to a section of memory allocated by BlockBuffer,
* containing the corresponding physical memory address and size.
*/
struct Block {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to make it nested class of BlockBuffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands struct Block only serves class BlockBuffer, so it makes reasonable to treat it as a nested class of class BlockBuffer.

/**
* Get a empty block or a new block if the buffer is exhausted.
* If the last allocated block size is less than blockSize,
* the empty block size is blockSize - lastBlockSize.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the lastBlockSize?
It would be more user-friendly to get the available size of the returned block via output parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastBlockSize means the size of the last allocated block. The comment has been modified here.
The returned block is used to describe the section of memory that can be read or written. And I add comment to the size in struct Block. So I don't recommend adding extra output parameter.

* Otherwise, the empty block size is blockSize.
* @return a empty block object
*/
Block getEmptyBlock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to getNextBlock or getBlockToWrite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Rename to getNextBlock.

c++/src/BlockBuffer.hh Show resolved Hide resolved
Block getEmptyBlock();

/**
* Get the number of allocated blocks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get the number of blocks that are fully or partially occupied

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left an inline comment. Overall looks good.

Block emptyBlock(
blocks[currentSize / blockSize] + currentSize % blockSize,
blockSize - currentSize % blockSize);
currentSize = (currentSize / blockSize + 1) * blockSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide a function to update currentSize to reflect the actual size written? Maybe something setSize() or backup(). This looks weird when the returned block is written partially meaning that currentSize is larger than used

To provide better usability, we may provide a function like void append(const char data, size_t size)* to append data to the buffer and manage the blocks internally. This can be a separate patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user only uses part of the block, the size of BlockBuffer can currently be set via the resize() function. BlockBuffer class is temporarily used to replace DataBuffer in BufferedOutputStream, and the existing functions conform to BufferedOutputStream's usage behavior.

The append function can be added additionally later if there are usage scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the flush() function of class BufferedOutputStream needs to access all allocated memory blocks, it needs class BlockBuffer to provide an interface in order for BufferedOutputStream to access all allocated memory blocks.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM, too.

@wgtmac wgtmac merged commit 8cf1047 into apache:main Oct 12, 2022
@wgtmac
Copy link
Member

wgtmac commented Oct 12, 2022

I've submitted and closed it. Thanks @coderex2522 and @dongjoon-hyun

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants