Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove_text not working #1644

Closed
Jmuccigr opened this issue Feb 19, 2023 · 5 comments · Fixed by #1648
Closed

remove_text not working #1644

Jmuccigr opened this issue Feb 19, 2023 · 5 comments · Fixed by #1648
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-feature A feature request soon PRs that are almost ready to be merged, issues that get solved pretty soon

Comments

@Jmuccigr
Copy link

Using remove_text to remove the text in a PDF as I assemble it, but the text remains.

Environment

$ python3 -m platform
macOS-12.4-arm64-arm-64bit

$ python3 -c "import pypdf;print(pypdf.__version__)"
3.4.1

Code + PDF

#!/opt/homebrew/bin/python3
import sys
from pypdf import PdfWriter, PdfReader
output = PdfWriter()

filename = str(sys.argv[1])
outputname = str(sys.argv[2])

ipdf = PdfReader(open(filename, 'rb'))

for i in range(len(ipdf.pages)):
    page = ipdf.pages[i]
    output.add_page(page)

output.remove_text

with open(outputname, 'wb') as f:
   output.write(f)

So I go through the pages in the input PDF (ipdf), then add them sequentially to a new pdf and remove the text. If I move the remove_text out of the loop and apply it to the output after it's assembled, same result.

Not quite sure that I'm doing this right, TBH, but it seems like it should work. I just updated from PyPDF2 recently, but I thought the updated script was working.

@Frisso92
Copy link

Frisso92 commented Feb 19, 2023

Hi, I was tinkering with the code, and looked into the method itself, I tried setting the 'ignore_byte_string_object' parameter to True and it seem to word Ok, but it has some trouble with underlines and some fonts I think.
Also be careful to put the brackets () after the method, in this case, .remove_text()

#!/opt/homebrew/bin/python3
import sys
from pypdf import PdfWriter, PdfReader
output = PdfWriter()

filename = str(sys.argv[1])
outputname = str(sys.argv[2])

ipdf = PdfReader(open(filename, 'rb'))

for i in range(len(ipdf.pages)):
    page = ipdf.pages[i]
    output.add_page(page)

output.remove_text(ignore_byte_string_object=True)

with open(outputname, 'wb') as f:
    output.write(f)

@Jmuccigr
Copy link
Author

I'm afraid that doesn't work for me. First, if I include parentheses without any enclosed text, I get a file that's larger than the original and unreadable, according to my apps. If I include the parameter with either True or False as the value, I still get an unreadable and larger file.

No parentheses at least gets me a readable file, though with text still included.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Feb 20, 2023

For me the code is working and not working 😆
I've used this code:

import pypdf

w = pypdf.PdfWriter()
# loaded from https://static.tp-link.com/2020/202004/20200422/1910012794_TL-WA850RE_UG_REV7.0.1.pdf
w.append("1910012794_TL-WA850RE_UG_REV7.0.1.pdf")
w.remove_text()
w.write("tt.pdf")

The file can be read with pdf.js and the text has been wiped out with the exception of the header (the text is stored in an XObject which are currently not processed)
@Jmuccigr can you provide the pdf where you are facing this issue to confirm that it is the same issue.

When I open the file within Acrobat an error is reported.
(edit : the root cause has been identified : the code was setting the content stream directly in the page object which is not in accordance with PDF specification)
I've started a PR and this last issue is now solved.

I will add the code to wipe out the text from the XObject.
(edit : the code is complete, just pending for a test file to get proper test coverage)

PS : I have some troubles to understand the parameter (the difference between False and True is to remove Hexadecimal strings : I see no difference from a PDF point of view. @MartinThoma / @MasterOdin, Should we cleanup the code?

@MartinThoma
Copy link
Member

I guess you are refering to ignoreByteStringObject / ignore_byte_string_object. I also don't understand why people would ever set that to True.

Let's see:

  • Stackoverflow: Only https://stackoverflow.com/q/51107368/562769 - and that seems to be just some copy-pasted code
  • GitHub:
    • Seems like it was originally added in Add method ignoreImage #59 . The comment " I have not been able to get them working in Python version 3 or higher, even when ignoreByteStringObject = True" makes it sound a bit as if it was added to avoid an issue within pypdf. That would be a strong reason to remove it (if we have solved that issue meanwhile)
  • Google: Nothing except for copies of those functions

Let's deprecate that parameter. Let's also NOT consider this a breaking change.

@pubpub-zz pubpub-zz added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-feature A feature request soon PRs that are almost ready to be merged, issues that get solved pretty soon labels Feb 26, 2023
MartinThoma pushed a commit that referenced this issue Feb 26, 2023
…te: ObjectDeletionFlag) (#1648)

This fixes remove_text to set contents as indirect_objects in accordance with the PDF specification.
It wipes out text in XObject forms as well.

The same issues were fixed for remove_images()

Finally, the new method
    PdfWriter.remove_objects_from_page(page: PageObject, to_delete: ObjectDeletionFlag)
was created. This allows a more fine-granular control of what to delete. It also is easy to expand via the to_delete flag.

Closes #1644
Closes #1650
@Jmuccigr
Copy link
Author

Jmuccigr commented Mar 3, 2023

Late to the game here, but this seems to be working as expected now. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-feature A feature request soon PRs that are almost ready to be merged, issues that get solved pretty soon
Projects
None yet
4 participants