-
Notifications
You must be signed in to change notification settings - Fork 749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-2159] Adding support for partition level copy in Iceberg distcp #4058
base: master
Are you sure you want to change the base?
[GOBBLIN-2159] Adding support for partition level copy in Iceberg distcp #4058
Conversation
b4f6369
to
d8356e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a great start! mostly suggestions to leverage a bit more of the existing classes (rather than creating near clones) and also to simplify some interfaces (esp. for the partition filter predicates) to take in specific params, rather than Properties
. given the latter may hold just about anything, the API "contract" they define is weaker than we'd want.
...main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDatasetFinder.java
Outdated
Show resolved
Hide resolved
...main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDatasetFinder.java
Outdated
Show resolved
Hide resolved
...a-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java
Show resolved
Hide resolved
...a-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java
Outdated
Show resolved
Hide resolved
...a-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java
Outdated
Show resolved
Hide resolved
...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java
Outdated
Show resolved
Hide resolved
...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java
Outdated
Show resolved
Hide resolved
CopyableFile fileEntity = CopyableFile.fromOriginAndDestination( | ||
actualSourceFs, srcFileStatus, targetFs.makeQualified(destPath), copyConfig) | ||
.fileSet(fileSet) | ||
.datasetOutputPath(targetFs.getUri().getPath()) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you skip first doing this, like in IcebergDataset
:
// preserving ancestor permissions till root path's child between src and dest
List<OwnerAndPermission> ancestorOwnerAndPermissionList =
CopyableFile.resolveReplicatedOwnerAndPermissionsRecursively(actualSourceFs,
srcPath.getParent(), greatestAncestorPath, copyConfig);
is that intentional? do you feel it's not necessary or actually contra-indicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the IcebergDataset the path of tables are exactly since table UUID are same on source and destination here it can be different, so copying permissions atleast in first draft is not necessary I believe.
Even if there is need that we need to make sure ancestor path, parent path are ones we want, that's why I have removed it for now.
// Adding this check to avoid adding post publish step when there are no files to copy. | ||
if (CollectionUtils.isNotEmpty(destDataFiles)) { | ||
copyEntities.add(createPostPublishStep(destDataFiles)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is one difference with IcebergDataset::generateCopyEntities
, which always wants to add its post-publish step. (but it shouldn't be hard to refactor to isolate this difference)
* @throws IOException if an I/O error occurs | ||
*/ | ||
@Override | ||
Collection<CopyEntity> generateCopyEntities(FileSystem targetFs, CopyConfiguration copyConfig) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this impl is really, really similar to the one it's based on in its base class. deriving from a class and then overriding methods w/ only small changes is pretty nearly cut-and-paste code. sometimes it's inevitable, but let's avoid when we can. in this case, could we NOT override this method, but only GetFilePathsToFileStatusResult getFilePathsToFileStatus(...)
so this derived class's version runs the new code instead:
IcebergTable srcIcebergTable = getSrcIcebergTable();
List<DataFile> srcDataFiles = srcIcebergTable.getPartitionSpecificDataFiles(this.partitionFilterPredicate);
List<DataFile> destDataFiles = getDestDataFiles(srcDataFiles);
Configuration defaultHadoopConfiguration = new Configuration();
for (FilePathsWithStatus filePathsWithStatus : getFilePathsStatus(srcDataFiles, destDataFiles, this.sourceFs)) {
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will list down my reason here -
- In IcebergDataset implementation it is assumed that srcPath and destPath are same which is not the case here, if you see the code we are using srcPath, srcFileStatus but here those needs to be changed to destPath & srcFileStatus for readability and maintaining the code as well.
- Currently I have added just ReplacePartitionStep as post publish step but IcebergRegisterStep too needs to be added based on Schema Validation scenario which I will be raising as different PR because that needs a proper validation so that we are not corrupting datafiles on dest table.
- I am not fully convinced on copying Ancestor Permission, whether it is even required or not, although I did tried making it work by changing ancestor path parent path but wasn't working so removing it is a must for now.
- If i will try to just override GetFilePathsToFileStatusResult getFilePathsToFileStatus(...) then we need to override Data class GetFilePathsToFileStatusResult too as we need datafiles too along with destPath srcFileStatus.
To conclude it -
reader should understand whether it is actually srcPath or destPath while creating copyable file
need of adding replacepartition commit step along with registerstep (based on condition)
and to remove copying permission for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looking good. part 1 of 2 done on this re-review... will return
...main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDatasetFinder.java
Outdated
Show resolved
Hide resolved
...a-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java
Outdated
Show resolved
Hide resolved
...a-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java
Outdated
Show resolved
Hide resolved
...che/gobblin/data/management/copy/iceberg/predicates/IcebergPartitionFilterPredicateUtil.java
Show resolved
Hide resolved
...ain/java/org/apache/gobblin/data/management/copy/iceberg/IcebergOverwritePartitionsStep.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/gobblin/data/management/copy/iceberg/IcebergOverwritePartitionsStep.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/gobblin/data/management/copy/iceberg/IcebergOverwritePartitionsStep.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/gobblin/data/management/copy/iceberg/IcebergOverwritePartitionsStep.java
Show resolved
Hide resolved
...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java
Show resolved
Hide resolved
...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java
Outdated
Show resolved
Hide resolved
...a-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java
Outdated
Show resolved
Hide resolved
} catch (IOException e) { | ||
log.warn("Failed to read manifest file: {} " , manifestFile.path(), e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iceberg is atomic/transactional, so I really don't agree w/ swallowing exceptions and still proceeding onward when the table is corrupted. that has the potential for us to lay even more corruption on top of that...
please explain if you see a genuine argument for ignoring errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah completely agree with your suggestion, somehow i missed it let me correct it by failing the copy with proper logging
* @return the index of the partition column if found, otherwise -1 | ||
* @throws IllegalArgumentException if the partition transform is not supported | ||
*/ | ||
public static int getPartitionColumnIndex( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this single static
seems closely related enough to IcebergMatchesAnyPropNamePartitionFilterPredicate
that it could reasonably live there as a public static
(eliminating the need for an additional separate class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it looks like that but in future we will need more filter predicates and every filter will need the partition column index, so i believe keeping it separate for now should be fine , maybe we can put this in factory class itself or convert this class to factory class
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Tests
- testGetPartitionSpecificDataFiles()
- testReplacePartitions()
Commits