Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page Label Access Fails when PDF PageLabels Number Tree Does Not Contain "/S" #1560

Closed
jonahmajumder opened this issue Jan 18, 2023 · 5 comments · Fixed by #1562
Closed

Page Label Access Fails when PDF PageLabels Number Tree Does Not Contain "/S" #1560

jonahmajumder opened this issue Jan 18, 2023 · 5 comments · Fixed by #1562
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@jonahmajumder
Copy link
Contributor

Explanation

The PDF spec supports page labeling (distinct from page indexing), and I am happy to see that this has been incorporated into the PdfReader class via the page_labels property.

However, the current implementation throws an error in the edge case where "/S" is not defined (and so no representation of the current page index should be used). Also, the current implementation is incomplete in that it does not incorporate the "/P" and "/St" keys in page label dictionaries.

Environment

$ python -m platform
macOS-10.16-x86_64-i386-64bit

$ python --version
Python 3.8.5

$ python -c "import pypdf;print(pypdf.__version__)"
3.2.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('Numerical Mathematics.pdf') # my pdf file with somewhat strange (but legal) page labeling

print(reader.page_labels)

PDF file:
Numerical Mathematics.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_reader.py", line 1067, in page_labels
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_reader.py", line 1067, in <listcomp>
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_page_labels.py", line 164, in index2label
    return m[value["/S"]](index - start_index + 1)
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/generic/_data_structures.py", line 274, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/S'

Proposed Solution

The two modifications I would propose are:

  1. In pypdf._page_labels, on line 164, the dictionaries corresponding to page indices in the PageLabels number tree should be parsed in a way that incorporates PDF spec defaults: S = value.get("/S"), P = value.get("/P", ""), and St = value.get("/St", 1) and then used for labeling in a way that uses these regardless of whether their keys were present: return P + m[S](index - start_index + St)

  2. In pypdf._page_labels, in the dictionary m (defined starting on line 153), an entry should be added corresponding to the case where the "/S" key is not included, i.e.: None: lambda n: '' so that when "/S" is not included, the page index is simply ignored.

It would also be great to see page labeling incorporated into the PdfWriter class to support writing PDFs with custom page labeling, but I realize that is more work and probably lower priority than fixing the PdfReader page labeling functionality.

@MartinThoma
Copy link
Member

@jonahmajumder Thank you for the detailed error description and for providing an example 🙏

What do you think about #1562 ? (Please leave a comment there; then we can continue discussing details about how to fix it there)

I've added you as a co-author as you have pretty much already solved the issue - good work 👍 :-)

@MartinThoma
Copy link
Member

As an side-note: I love Python 3.11 for its new tracebacks

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_reader.py", line 997, in page_labels
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_reader.py", line 997, in <listcomp>
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_page_labels.py", line 155, in index2label
    return m[value["/S"]](index - start_index + 1)
             ~~~~~^^^^^^
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 266, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: '/S'

MartinThoma added a commit that referenced this issue Jan 18, 2023
@MartinThoma
Copy link
Member

Congratulations @jonahmajumder - your change is now in main and will be on PyPI on Sunday :-)

If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

@jonahmajumder
Copy link
Contributor Author

jonahmajumder commented Jan 18, 2023 via email

@MartinThoma
Copy link
Member

d942a49 - added :-)

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants