Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROB: Rebuild xref table if one entry is invalid #2528

Merged
merged 10 commits into from
Mar 24, 2024

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented Mar 17, 2024

closes #2516

cope with cases where the xref entries do not point to valid headers

fixes py-pdf#2523
situation met:
* length field is not correct
* xref may contains not ordered stream datas
* xref contains some free entries (i.e. not contains stream offset)
Copy link

codecov bot commented Mar 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.48%. Comparing base (c4641d1) to head (553165c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2528      +/-   ##
==========================================
- Coverage   94.52%   94.48%   -0.04%     
==========================================
  Files          49       49              
  Lines        8178     8181       +3     
  Branches     1659     1660       +1     
==========================================
  Hits         7730     7730              
- Misses        277      280       +3     
  Partials      171      171              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stefan6419846
Copy link
Collaborator

I am going to wait with the review of these changes until #2526 is merged as #2528 already incorporates the changes of #2526.

@pubpub-zz
Copy link
Collaborator Author

test ,file:
iss2516.pdf

@stefan6419846 stefan6419846 changed the title ROB: rebuild xref table if one entry is invalid ROB: Rebuild xref table if one entry is invalid Mar 18, 2024
tests/test_reader.py Outdated Show resolved Hide resolved
@stefan6419846
Copy link
Collaborator

It seems like this small change has quite some impact on the coverage as

pypdf/pypdf/_reader.py

Lines 1277 to 1301 in c4641d1

except Exception:
if hasattr(self.stream, "getbuffer"):
buf = bytes(self.stream.getbuffer())
else:
p = self.stream.tell()
self.stream.seek(0, 0)
buf = self.stream.read(-1)
self.stream.seek(p, 0)
m = re.search(
rf"\s{indirect_reference.idnum}\s+{indirect_reference.generation}\s+obj".encode(),
buf,
)
if m is not None:
logger_warning(
f"Object ID {indirect_reference.idnum},{indirect_reference.generation} ref repaired",
__name__,
)
self.xref[indirect_reference.generation][
indirect_reference.idnum
] = (m.start(0) + 1)
self.stream.seek(m.start(0) + 1)
idnum, generation = self.read_object_header(self.stream)
else:
idnum = -1 # exception will be raised below
if idnum != indirect_reference.idnum and self.xref_index:
is not being covered by the tests any more. Is there something we can do about this without ignoring it or excluding the error handling from the coverage?

@pubpub-zz
Copy link
Collaborator Author

This is the best I can propose

@stefan6419846 stefan6419846 merged commit f8edf3c into py-pdf:main Mar 24, 2024
14 of 15 checks passed
stefan6419846 added a commit that referenced this pull request Apr 7, 2024
REL: 4.2.0

## What's new

### New Features (ENH)
- Allow multiple charsets for NameObject.read_from_stream (#2585) by @pubpub-zz
- Add support for /Kids in page labels (#2562) by @stefan6419846
- Allow to update fields on many pages (#2571) by @pubpub-zz
- Tolerate PDF with invalid xref pointed objects (#2335) by @pubpub-zz
- Add Enforce from PDF2.0 in viewer_preferences (#2511) by @pubpub-zz
- Add += and -= operators to ArrayObject (#2510) by @pubpub-zz

### Bug Fixes (BUG)
- Fix merge_page sometimes generating unknown operator 'QQ' (#2588) by @rfotino
- Fix fields update where annotations are kids of field (#2570) by @pubpub-zz
- Process CMYK images without a filter correctly (#2557) by @pubpub-zz
- Extract text in layout mode without finding resources (#2555) by @pubpub-zz
- Prevent recursive loop in some PDF files (#2505) by @pubpub-zz

### Robustness (ROB)
- Tolerate "truncated" xref (#2580) by @pubpub-zz
- Replace error by warning for EOD in RunLengthDecode/ASCIIHexDecode (#2334) by @pubpub-zz
- Rebuild xref table if one entry is invalid (#2528) by @pubpub-zz
- Robustify stream extraction (#2526) by @pubpub-zz

### Documentation (DOC)
- Update release process for latest changes (#2564) by @stefan6419846
- Encryption/decryption: Clone document instead of copying all pages (#2546) by @redfast00
- Minor improvements (#2542) by @j-t-1
- Update annotation list (#2534) by @j-t-1
- Update references and formatting (#2529) by @j-t-1
- Correct threads reference, plus minor changes (#2521) by @j-t-1
- Minor readability increases (#2515) by @j-t-1
- Simplify PaperSize examples (#2504) by @j-t-1
- Minor improvements (#2501) by @j-t-1

### Developer Experience (DEV)
- Remove unused dependencies (#2572) by @stefan6419846
- Remove page labels PR link from message (#2561) by @stefan6419846
- Fix changelog generator regarding whitespace and handling of "Other" group (#2492) by @stefan6419846
- Add REL to known PR prefixes (#2554) by @stefan6419846
- Release using the REL commit instead of git tag (#2500) by @MartinThoma
- Unify code between PdfReader and PdfWriter (#2497) by @pubpub-zz
- Bump softprops/action-gh-release from 1 to 2 (#2514) by @dependabot[bot]

### Maintenance (MAINT)
- Ressources → Resources (and internal name childs) (#2550) by @pubpub-zz
- Fix typos found by codespell (#2549) by @stefan6419846
- Update Read the Docs configuration (#2538) by @j-t-1
- Add root_object, _info and _ID to PdfReader (#2495) by @pubpub-zz

### Testing (TST)
- Allow loading truncated images if required (#2586) by @stefan6419846
- Fix download issues from #2562 (#2578) by @pubpub-zz
- Improve test_get_contents_from_nullobject to show real use-case (#2524) by @stefan6419846
- Add missing test annotations (#2507) by @stefan6419846

[Full Changelog](4.1.0...4.2.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"/Pages" might be undefined
2 participants