You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(NOTE: Using Python 3.8, so I'm stuck on ocrmypdf v14.4.0 at the moment)
First off, thanks developers for this amazing tool! I have utilized it a lot recently, as I'm trying to make a FOSS OCR workflow that can mimic most behaviors of ABBYY FineReader.
Before integrating OCRmyPDF into my workflow, I was doing things the old-fashioned way by extracting images out of the PDF and performing tesseract-ocr on each one. What I noticed when using Tesseract directly is that Tesseract often guessed the DPI incorrectly (usually defaulting to 70 DPI). When it did that, the page would often have line breaks between words on the same line. So, for example, if I wanted to search for "See dog run", but the OCR is effectively "See\ndog\nrun", the search doesn't work. So, the solution was to manually calculate the DPI (usually around 200-300 for the scans I saw) and tell Tesseract to set the DPI with --dpi 300.
My issue currently with some books OCR'd with OCRmyPDF is that there is nothing in the manuals nor in --help that suggests I can pass the --dpi 300 option to tesseract. There is upscale and downscale, yes, but that's not quite what I want. Even when I tried using upscale to 300, it took up most of my CPU to try to OCR (especially with 12 jobs going at once!)
Can anyone recommend a workaround or solution that doesn't involve me going back to the imagestack?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
(NOTE: Using Python 3.8, so I'm stuck on ocrmypdf v14.4.0 at the moment)
First off, thanks developers for this amazing tool! I have utilized it a lot recently, as I'm trying to make a FOSS OCR workflow that can mimic most behaviors of ABBYY FineReader.
Before integrating OCRmyPDF into my workflow, I was doing things the old-fashioned way by extracting images out of the PDF and performing tesseract-ocr on each one. What I noticed when using Tesseract directly is that Tesseract often guessed the DPI incorrectly (usually defaulting to 70 DPI). When it did that, the page would often have line breaks between words on the same line. So, for example, if I wanted to search for "See dog run", but the OCR is effectively "See\ndog\nrun", the search doesn't work. So, the solution was to manually calculate the DPI (usually around 200-300 for the scans I saw) and tell Tesseract to set the DPI with
--dpi 300
.My issue currently with some books OCR'd with OCRmyPDF is that there is nothing in the manuals nor in
--help
that suggests I can pass the--dpi 300
option to tesseract. There is upscale and downscale, yes, but that's not quite what I want. Even when I tried using upscale to 300, it took up most of my CPU to try to OCR (especially with 12 jobs going at once!)Can anyone recommend a workaround or solution that doesn't involve me going back to the imagestack?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions