-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove_text not working #1644
Comments
Hi, I was tinkering with the code, and looked into the method itself, I tried setting the 'ignore_byte_string_object' parameter to True and it seem to word Ok, but it has some trouble with underlines and some fonts I think. #!/opt/homebrew/bin/python3
import sys
from pypdf import PdfWriter, PdfReader
output = PdfWriter()
filename = str(sys.argv[1])
outputname = str(sys.argv[2])
ipdf = PdfReader(open(filename, 'rb'))
for i in range(len(ipdf.pages)):
page = ipdf.pages[i]
output.add_page(page)
output.remove_text(ignore_byte_string_object=True)
with open(outputname, 'wb') as f:
output.write(f) |
I'm afraid that doesn't work for me. First, if I include parentheses without any enclosed text, I get a file that's larger than the original and unreadable, according to my apps. If I include the parameter with either True or False as the value, I still get an unreadable and larger file. No parentheses at least gets me a readable file, though with text still included. |
For me the code is working and not working 😆 import pypdf
w = pypdf.PdfWriter()
# loaded from https://static.tp-link.com/2020/202004/20200422/1910012794_TL-WA850RE_UG_REV7.0.1.pdf
w.append("1910012794_TL-WA850RE_UG_REV7.0.1.pdf")
w.remove_text()
w.write("tt.pdf") The file can be read with pdf.js and the text has been wiped out with the exception of the header (the text is stored in an XObject which are currently not processed) When I open the file within Acrobat an error is reported. I will add the code to wipe out the text from the XObject. PS : I have some troubles to understand the parameter (the difference between False and True is to remove Hexadecimal strings : I see no difference from a PDF point of view. @MartinThoma / @MasterOdin, Should we cleanup the code? |
I guess you are refering to Let's see:
Let's deprecate that parameter. Let's also NOT consider this a breaking change. |
…te: ObjectDeletionFlag) (#1648) This fixes remove_text to set contents as indirect_objects in accordance with the PDF specification. It wipes out text in XObject forms as well. The same issues were fixed for remove_images() Finally, the new method PdfWriter.remove_objects_from_page(page: PageObject, to_delete: ObjectDeletionFlag) was created. This allows a more fine-granular control of what to delete. It also is easy to expand via the to_delete flag. Closes #1644 Closes #1650
Late to the game here, but this seems to be working as expected now. Thanks. |
Using remove_text to remove the text in a PDF as I assemble it, but the text remains.
Environment
$ python3 -m platform macOS-12.4-arm64-arm-64bit $ python3 -c "import pypdf;print(pypdf.__version__)" 3.4.1
Code + PDF
So I go through the pages in the input PDF (ipdf), then add them sequentially to a new pdf and remove the text. If I move the remove_text out of the loop and apply it to the output after it's assembled, same result.
Not quite sure that I'm doing this right, TBH, but it seems like it should work. I just updated from PyPDF2 recently, but I thought the updated script was working.
The text was updated successfully, but these errors were encountered: