remove_text not working #1644

Jmuccigr · 2023-02-19T17:58:31Z

Using remove_text to remove the text in a PDF as I assemble it, but the text remains.

Environment

$ python3 -m platform
macOS-12.4-arm64-arm-64bit

$ python3 -c "import pypdf;print(pypdf.__version__)"
3.4.1

Code + PDF

#!/opt/homebrew/bin/python3
import sys
from pypdf import PdfWriter, PdfReader
output = PdfWriter()

filename = str(sys.argv[1])
outputname = str(sys.argv[2])

ipdf = PdfReader(open(filename, 'rb'))

for i in range(len(ipdf.pages)):
    page = ipdf.pages[i]
    output.add_page(page)

output.remove_text

with open(outputname, 'wb') as f:
   output.write(f)

So I go through the pages in the input PDF (ipdf), then add them sequentially to a new pdf and remove the text. If I move the remove_text out of the loop and apply it to the output after it's assembled, same result.

Not quite sure that I'm doing this right, TBH, but it seems like it should work. I just updated from PyPDF2 recently, but I thought the updated script was working.

Frisso92 · 2023-02-19T22:25:49Z

Hi, I was tinkering with the code, and looked into the method itself, I tried setting the 'ignore_byte_string_object' parameter to True and it seem to word Ok, but it has some trouble with underlines and some fonts I think.
Also be careful to put the brackets () after the method, in this case, .remove_text()

#!/opt/homebrew/bin/python3
import sys
from pypdf import PdfWriter, PdfReader
output = PdfWriter()

filename = str(sys.argv[1])
outputname = str(sys.argv[2])

ipdf = PdfReader(open(filename, 'rb'))

for i in range(len(ipdf.pages)):
    page = ipdf.pages[i]
    output.add_page(page)

output.remove_text(ignore_byte_string_object=True)

with open(outputname, 'wb') as f:
    output.write(f)

Jmuccigr · 2023-02-19T23:25:31Z

I'm afraid that doesn't work for me. First, if I include parentheses without any enclosed text, I get a file that's larger than the original and unreadable, according to my apps. If I include the parameter with either True or False as the value, I still get an unreadable and larger file.

No parentheses at least gets me a readable file, though with text still included.

pubpub-zz · 2023-02-20T18:46:54Z

For me the code is working and not working 😆
I've used this code:

import pypdf

w = pypdf.PdfWriter()
# loaded from https://static.tp-link.com/2020/202004/20200422/1910012794_TL-WA850RE_UG_REV7.0.1.pdf
w.append("1910012794_TL-WA850RE_UG_REV7.0.1.pdf")
w.remove_text()
w.write("tt.pdf")

The file can be read with pdf.js and the text has been wiped out with the exception of the header (the text is stored in an XObject which are currently not processed)
@Jmuccigr can you provide the pdf where you are facing this issue to confirm that it is the same issue.

When I open the file within Acrobat an error is reported.
(edit : the root cause has been identified : the code was setting the content stream directly in the page object which is not in accordance with PDF specification)
I've started a PR and this last issue is now solved.

I will add the code to wipe out the text from the XObject.
(edit : the code is complete, just pending for a test file to get proper test coverage)

PS : I have some troubles to understand the parameter (the difference between False and True is to remove Hexadecimal strings : I see no difference from a PDF point of view. @MartinThoma / @MasterOdin, Should we cleanup the code?

closes py-pdf#1644

MartinThoma · 2023-02-25T05:38:28Z

I guess you are refering to ignoreByteStringObject / ignore_byte_string_object. I also don't understand why people would ever set that to True.

Let's see:

Stackoverflow: Only https://stackoverflow.com/q/51107368/562769 - and that seems to be just some copy-pasted code
GitHub:
- Seems like it was originally added in Add method ignoreImage #59 . The comment " I have not been able to get them working in Python version 3 or higher, even when ignoreByteStringObject = True" makes it sound a bit as if it was added to avoid an issue within pypdf. That would be a strong reason to remove it (if we have solved that issue meanwhile)
Google: Nothing except for copies of those functions

Let's deprecate that parameter. Let's also NOT consider this a breaking change.

…te: ObjectDeletionFlag) (#1648) This fixes remove_text to set contents as indirect_objects in accordance with the PDF specification. It wipes out text in XObject forms as well. The same issues were fixed for remove_images() Finally, the new method PdfWriter.remove_objects_from_page(page: PageObject, to_delete: ObjectDeletionFlag) was created. This allows a more fine-granular control of what to delete. It also is easy to expand via the to_delete flag. Closes #1644 Closes #1650

Jmuccigr · 2023-03-03T11:01:04Z

Late to the game here, but this seems to be working as expected now. Thanks.

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Feb 20, 2023

ROB : Remove Text not working in all cases

d27adda

closes py-pdf#1644

pubpub-zz mentioned this issue Feb 20, 2023

ENH: Add PdfWriter.remove_objects_from_page(page: PageObject, to_delete: ObjectDeletionFlag) #1648

Merged

pubpub-zz added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-feature A feature request soon PRs that are almost ready to be merged, issues that get solved pretty soon labels Feb 26, 2023

MartinThoma closed this as completed in #1648 Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove_text not working #1644

remove_text not working #1644

Jmuccigr commented Feb 19, 2023

Frisso92 commented Feb 19, 2023 •

edited

Loading

Jmuccigr commented Feb 19, 2023

pubpub-zz commented Feb 20, 2023 •

edited by MartinThoma

Loading

MartinThoma commented Feb 25, 2023

Jmuccigr commented Mar 3, 2023

remove_text not working #1644

remove_text not working #1644

Comments

Jmuccigr commented Feb 19, 2023

Environment

Code + PDF

Frisso92 commented Feb 19, 2023 • edited Loading

Jmuccigr commented Feb 19, 2023

pubpub-zz commented Feb 20, 2023 • edited by MartinThoma Loading

MartinThoma commented Feb 25, 2023

Jmuccigr commented Mar 3, 2023

Frisso92 commented Feb 19, 2023 •

edited

Loading

pubpub-zz commented Feb 20, 2023 •

edited by MartinThoma

Loading