Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add PdfWriter.remove_objects_from_page(page: PageObject, to_delete: ObjectDeletionFlag) #1648

Merged
merged 35 commits into from
Feb 26, 2023

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented Feb 20, 2023

fix remove_text to set contents as indirect_objects (iaw PDF Specification)

wipe out text also in XObject forms

Same issues + refactoring for remove_images()

closes #1644
closes #1650

@codecov
Copy link

codecov bot commented Feb 20, 2023

Codecov Report

Base: 92.25% // Head: 92.40% // Increases project coverage by +0.15% 🎉

Coverage data is based on head (b70986a) compared to base (14b61ea).
Patch coverage: 91.50% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1648      +/-   ##
==========================================
+ Coverage   92.25%   92.40%   +0.15%     
==========================================
  Files          33       33              
  Lines        6431     6468      +37     
  Branches     1286     1287       +1     
==========================================
+ Hits         5933     5977      +44     
  Misses        315      315              
+ Partials      183      176       -7     
Impacted Files Coverage Δ
pypdf/_writer.py 85.72% <91.08%> (+1.05%) ⬆️
pypdf/__init__.py 100.00% <100.00%> (ø)
pypdf/generic/_base.py 99.65% <100.00%> (+<0.01%) ⬆️
pypdf/generic/_annotations.py 97.75% <0.00%> (+0.13%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@pubpub-zz pubpub-zz mentioned this pull request Feb 21, 2023
@pubpub-zz
Copy link
Collaborator Author

I've completed the test file with some Form data
ANNOTED completed.pdf

Fix wrongly removed Forms
prevent text to remain hidden
add test
@pubpub-zz
Copy link
Collaborator Author

@MartinThoma
All yours. 😊
Happy to be back over 92% of test coverage

@MartinThoma MartinThoma changed the title ROB : Remove text fixed and improved ROB: fix and improve remove_text Feb 25, 2023
@MartinThoma MartinThoma added the is-robustness-issue From a users perspective, this is about robustness label Feb 25, 2023
pypdf/_writer.py Outdated
@@ -1808,85 +1808,101 @@ def removeLinks(self) -> None: # deprecated
deprecation_with_replacement("removeLinks", "remove_links", "3.0.0")
return self.remove_links()

def remove_img_or_text(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather make this a private method:

Suggested change
def remove_img_or_text(
def _remove_img_or_text(

(that needs some adjustments in other parts as well)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to leave it public : currently remove_images() and remove_text() are cleaning the full document : this will provide a solution to just clean one page.
However I agree the name is too tricky : I propose to rename the function into remove_image_or_text_from_page()

Your opinion ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand! So it's actually adding a new feature :-)

I don't like the image_or_text part of the method name. Maybe we can change the signature to this:

class ObjectDeletionFlag(enum.IntFlag):
    TEXT = enum.auto()
    IMAGES = enum.auto()

...

    def remove_objects_from_page(
        self, page: Union[PageObject, DictionaryObject], to_delete: ObjectDeletionFlag
    ) -> None:

I see some advantages in the proposed interface:

  1. We could add the capability to delete annotations (maybe even specific annotation types) in the same method
  2. We could allow users to specify multiple flags at one method call
  3. Personal preference: When I see an _or_ in a method name, I tend to think that those should be two methods 😄
  4. Personal preference: I think that the current use of del_image=False => delete text is not intuitive. I would like to avoid that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    def remove_objects_from_page(
        self, page: Union[PageObject, DictionaryObject], to_delete: ObjectDeletionFlag
    ) -> None:

Sounds good to me.

I see some advantages in the proposed interface:
1. We could add the capability to delete annotations (maybe even specific annotation types) in the same method

Text and Images requires to remove to parse content, that's why they are specific, however from a interface point of view we can merge them.

2. We could allow users to specify multiple flags at one method 

good idea implemented.

3. Personal preference: When I see an `_or_` in a method name, I tend to think that those should be two methods 😄
 not iaw my rules...😉🤣🤣🤣🤣 
4. Personal preference: I think that the current use of `del_image=False` => delete text is not intuitive. I would like to avoid that.

btw I hate the name of the function remove_links() which name is not describing it properly. Shouldn't we have remove_annots() and remove_links() - for links only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to look it up - https://pypdf.readthedocs.io/en/latest/modules/PdfWriter.html#pypdf.PdfWriter.remove_links

Yes, that is exactly this kind of unexpected dual-purpose which I don't like -.-

I want to keep the amount of breaking changes low for this year. Give people a chance to adapt. But we should make a list of such things to be cleaned-up in future.

Additionally, the proposed remove_objects_from_page could support links and annotations as well (just by adding it to the ObjectDeletionFlag / letting users provide that value.

pypdf/_writer.py Outdated Show resolved Hide resolved
pypdf/_writer.py Outdated Show resolved Hide resolved
pypdf/_writer.py Outdated Show resolved Hide resolved
@MartinThoma MartinThoma changed the title ROB: fix and improve remove_text ROB: Fix and improve remove_text Feb 25, 2023
@pubpub-zz
Copy link
Collaborator Author

for extra tests
GeoBaseWithComments.pdf

pypdf/_writer.py Outdated Show resolved Hide resolved
pypdf/_writer.py Outdated Show resolved Hide resolved
def remove_objects_from_page(
self,
page: Union[PageObject, DictionaryObject],
to_delete: Union[ObjectDeletionFlag, Iterable[ObjectDeletionFlag]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make our lives here easier and not allow the Iterables (tuple / list). It is a flag and thus people should use it.

If they don't know it, they will just do multiple calls (which is fine as well)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see how to change keeping the code just affecting annotations/comments...

pypdf/_writer.py Outdated Show resolved Hide resolved
pypdf/_writer.py Outdated Show resolved Hide resolved
@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Feb 26, 2023
@MartinThoma
Copy link
Member

@pubpub-zz I would like to get this merged in the release today. Would it be OK for you if I take care of the remaining (rather stylistic) changes within your PR?

pubpub-zz and others added 11 commits February 26, 2023 15:34
Co-authored-by: Martin Thoma <[email protected]>
Co-authored-by: Martin Thoma <[email protected]>
Co-authored-by: Martin Thoma <[email protected]>
Co-authored-by: Martin Thoma <[email protected]>
Co-authored-by: Martin Thoma <[email protected]>
Co-authored-by: Martin Thoma <[email protected]>
Co-authored-by: Martin Thoma <[email protected]>
Co-authored-by: Martin Thoma <[email protected]>
Co-authored-by: Martin Thoma <[email protected]>
Co-authored-by: Martin Thoma <[email protected]>
@pubpub-zz
Copy link
Collaborator Author

last point remaining is about LIST/TUPLES

tests/test_writer.py Outdated Show resolved Hide resolved
@pubpub-zz
Copy link
Collaborator Author

@MartinThoma All yours 😊

@MartinThoma MartinThoma changed the title ROB: Fix and improve remove_text ENH: Add PdfWriter.remove_objects_from_page(page: PageObject, to_delete: ObjectDeletionFlag) Feb 26, 2023
@MartinThoma MartinThoma merged commit 67b085b into py-pdf:main Feb 26, 2023
@MartinThoma
Copy link
Member

Very nice PR! I've just merged it and will release it today :-)

Good work @pubpub-zz 👏

@MartinThoma MartinThoma added is-feature A feature request and removed soon PRs that are almost ready to be merged, issues that get solved pretty soon labels Feb 26, 2023
MartinThoma added a commit that referenced this pull request Feb 26, 2023
New Features (ENH)
-  Add reader.attachments public interface (#1611, #1661)
-  Add PdfWriter.remove_objects_from_page(page: PageObject, to_delete: ObjectDeletionFlag) (#1648)
-  Allow free-text annotation to have transparent border/background (#1664)

Bug Fixes (BUG)
-  Allow decryption with empty password for AlgV5 (#1663)
-  Let PdfWriter.pages return PageObject after calling `clone_document_from_reader()` (#1613)
-  Invalid font pointed during merge_resources (#1641)

Robustness (ROB)
-  Cope with invalid objects in IndirectObject.clone (#1637)
-  Improve tolerance to invalid Names/Dests (#1658)
-  Decode encoded values in get_fields (#1636)
-  Let PdfWriter.merge cope with missing "/Fields" (#1628)

[Full Changelog](3.4.1...3.5.0)
@pubpub-zz pubpub-zz deleted the remove_text branch June 24, 2023 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken compatibility remove_text not working
2 participants