Skip to content

zacharysyoung/extract_highlighted_text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extract highlighted text from PDF

I've written two command-line programs that are equivalent. They both read some number of PDFs that you list, locate highlighted text per page, and then print out CSV of the PDF, page number, and the highlighted text.

For both the PyMuPDF (Python) and the UniPDF (Go) code examples, I'll use these two reference PDFs:

one_page.pdf two_pages.pdf
page 1 one_page.pdf page one two_pages.pdf page one
page 2 two_pages.pdf page two

main.py

./main.py one_page.pdf two_pages.pdf 
Filename,Page_num,Highlighted_text
one_page.pdf,1,word1
one_page.pdf,1,d2 word3 wo
two_pages.pdf,1,"“and what is the use of a book,” thought Alice “without pictures or conversations?”"
two_pages.pdf,2,when suddenly a White Rabbit with pink eyes ran close by her.

and

main.go

go run main.go one_page.pdf two_pages.pdf 
Filename,Page_num,Highlighted_text
one_page.pdf,1,word1
one_page.pdf,1,d2 word3 wor
two_pages.pdf,1,"“and what is the use of a book,” thought Alice “without pictures or conversations?”"
two_pages.pdf,2,when suddenly a White Rabbit with pink eyes ran close by her.

The Python program uses the PyMuPDF library, which is being actively maintained and updated. Consult the docs for getting the library installed, PyMuPDF: Installation.

The Go program uses the UniPDF library, which has a freemium tier where 100 document reads/writes are free per month:

Both programs also have a VISUALIZE switch in the source code that you can turn on to visualize what the highlight rectangles look like, which can be very handy if you ever start getting weird results. The highlights don't contain the text, they're just graphical objects that are drawn over text. Both programs get the highlight rectangles, then use other APIs in their respective libraries to query a page for text in the rectangle, but the different libraries have different ideas about what text is inside or outside of a particular highlighted region (rectangle):

  • For PyMuPDF I had to grow the highlight rectangle just ever so slightly to get any text.

  • For UniPDF I had to shrink the highlight rectangle vertically, and substantially, to avoid extraneous text (outside of the intended highlight region).

If you're new to using Go, you can download Go for your system (Linux) and then you can run the program interactively with go run main.go pdf1 ..., or you can build an executable and install it on your path and then call it like any command-line utility:

go build -o listhltext main.go
cp listhltext [SOMEWHERE ON YOUR PATH]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published