Page Label Access Fails when PDF PageLabels Number Tree Does Not Contain "/S" #1560

jonahmajumder · 2023-01-18T15:01:16Z

Explanation

The PDF spec supports page labeling (distinct from page indexing), and I am happy to see that this has been incorporated into the PdfReader class via the page_labels property.

However, the current implementation throws an error in the edge case where "/S" is not defined (and so no representation of the current page index should be used). Also, the current implementation is incomplete in that it does not incorporate the "/P" and "/St" keys in page label dictionaries.

Environment

$ python -m platform
macOS-10.16-x86_64-i386-64bit

$ python --version
Python 3.8.5

$ python -c "import pypdf;print(pypdf.__version__)"
3.2.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('Numerical Mathematics.pdf') # my pdf file with somewhat strange (but legal) page labeling

print(reader.page_labels)

PDF file:
Numerical Mathematics.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_reader.py", line 1067, in page_labels
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_reader.py", line 1067, in <listcomp>
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_page_labels.py", line 164, in index2label
    return m[value["/S"]](index - start_index + 1)
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/generic/_data_structures.py", line 274, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/S'

Proposed Solution

The two modifications I would propose are:

In pypdf._page_labels, on line 164, the dictionaries corresponding to page indices in the PageLabels number tree should be parsed in a way that incorporates PDF spec defaults: S = value.get("/S"), P = value.get("/P", ""), and St = value.get("/St", 1) and then used for labeling in a way that uses these regardless of whether their keys were present: return P + m[S](index - start_index + St)
In pypdf._page_labels, in the dictionary m (defined starting on line 153), an entry should be added corresponding to the case where the "/S" key is not included, i.e.: None: lambda n: '' so that when "/S" is not included, the page index is simply ignored.

It would also be great to see page labeling incorporated into the PdfWriter class to support writing PDFs with custom page labeling, but I realize that is more work and probably lower priority than fixing the PdfReader page labeling functionality.

The text was updated successfully, but these errors were encountered:

Fixes #1560 Co-authored-by: jonahmajumder <[email protected]>

MartinThoma · 2023-01-18T18:29:20Z

@jonahmajumder Thank you for the detailed error description and for providing an example 🙏

What do you think about #1562 ? (Please leave a comment there; then we can continue discussing details about how to fix it there)

I've added you as a co-author as you have pretty much already solved the issue - good work 👍 :-)

MartinThoma · 2023-01-18T18:29:41Z

As an side-note: I love Python 3.11 for its new tracebacks

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_reader.py", line 997, in page_labels
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_reader.py", line 997, in <listcomp>
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_page_labels.py", line 155, in index2label
    return m[value["/S"]](index - start_index + 1)
             ~~~~~^^^^^^
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 266, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: '/S'

Fixes #1560 Co-authored-by: jonahmajumder <[email protected]>

MartinThoma · 2023-01-18T20:04:25Z

Congratulations @jonahmajumder - your change is now in main and will be on PyPI on Sunday :-)

If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

jonahmajumder · 2023-01-18T20:19:38Z

Sure, sounds good! I use the package a lot, so I will continue to bring up and issues/features I find.

…

On Wed, Jan 18, 2023 at 3:04 PM Martin Thoma ***@***.***> wrote: Congratulations @jonahmajumder <https:/jonahmajumder> - your change is now in main and will be on PyPI on Sunday :-) If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-) — Reply to this email directly, view it on GitHub <#1560 (comment)>, or unsubscribe <https:/notifications/unsubscribe-auth/AC6LF7XPSPHK7LXT66IILN3WTBEFJANCNFSM6AAAAAAT7GBI2E> . You are receiving this because you were mentioned.Message ID: ***@***.***>

MartinThoma · 2023-01-18T20:30:24Z

d942a49 - added :-)

MartinThoma added a commit that referenced this issue Jan 18, 2023

BUG: Fix dictionary access of optional page label keys

2f689a7

Fixes #1560 Co-authored-by: jonahmajumder <[email protected]>

MartinThoma mentioned this issue Jan 18, 2023

BUG: Fix dictionary access of optional page label keys #1562

Merged

MartinThoma closed this as completed in #1562 Jan 18, 2023

MartinThoma added a commit that referenced this issue Jan 18, 2023

BUG: Fix dictionary access of optional page label keys (#1562)

c293b95

Fixes #1560 Co-authored-by: jonahmajumder <[email protected]>

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Mar 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page Label Access Fails when PDF PageLabels Number Tree Does Not Contain "/S" #1560

Page Label Access Fails when PDF PageLabels Number Tree Does Not Contain "/S" #1560

jonahmajumder commented Jan 18, 2023

MartinThoma commented Jan 18, 2023

MartinThoma commented Jan 18, 2023

MartinThoma commented Jan 18, 2023

jonahmajumder commented Jan 18, 2023 via email

MartinThoma commented Jan 18, 2023

Page Label Access Fails when PDF PageLabels Number Tree Does Not Contain "/S" #1560

Page Label Access Fails when PDF PageLabels Number Tree Does Not Contain "/S" #1560

Comments

jonahmajumder commented Jan 18, 2023

Explanation

Environment

Code + PDF

Traceback

Proposed Solution

MartinThoma commented Jan 18, 2023

MartinThoma commented Jan 18, 2023

MartinThoma commented Jan 18, 2023

jonahmajumder commented Jan 18, 2023 via email

MartinThoma commented Jan 18, 2023