Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malicious PDF documents test suite #1147

Open
bitsgalore opened this issue Feb 26, 2021 · 9 comments
Open

Malicious PDF documents test suite #1147

bitsgalore opened this issue Feb 26, 2021 · 9 comments
Assignees
Labels
feature New functionality to be developed fixed-in-dev P2 Medium priority issues to be scheduled in a future release
Milestone

Comments

@bitsgalore
Copy link

Earlier this week some researchers of Ruhr University Bochum published a conference paper on insecure features in PDF, based on a systematic review of the full format spec:

https://www.ndss-symposium.org/wp-content/uploads/ndss2021_1B-2_23109_paper.pdf

There's a good summary in this blog post:

https://web-in-security.blogspot.com/2021/01/insecure-features-in-pdfs.html

They've also released a suite of malicious test files, which includes the helper scripts they used to generate these:

https://pdf-insecurity.org/download/pdf-dangerous-paths/exploits-and-helper-scripts.zip

As some of those files might be of interest for VeraPDF testing (if only to make sure that VeraPDF doesn't get caught up in some infinite loop), I'm just dropping the link here.

@bdoubrov
Copy link
Contributor

Thanks a lot, Johan! We were certainly fixing already a number of issues to prevent runtime exceptions on similar malicious PDFs. So, it is indeed an excellent stability test for veraPDF

@bitsgalore
Copy link
Author

In addition to this, one of the Apache Tika developers pointed me to their "stressful PDF corpus", which I think would be useful for stability testing as well. See this post for a description:

https://www.pdfa.org/a-new-stressful-pdf-corpus/

Here's the link to the corpus:

https://corpora.tika.apache.org/base/docs/bug_trackers

Packaged downloads here:

https://corpora.tika.apache.org/base/packaged/pdfs/

@AlainVagner
Copy link

AlainVagner commented Apr 2, 2021

Hi, there is also the corpus of PDF from the pdfium project:

https://pdfium.googlesource.com/pdfium/+/refs/heads/master/testing/resources

I tried PDF-UA validation on this corpus and verapdf crashed on some files without being able to provide a valid xml output.
For example when calling :
verapdf -f ua1 bug_113.pdf

I get the following xml output:

<?xml version="1.0" encoding="utf-8"?>
<report>
  <buildInformation>
    <releaseDetails id="core" version="1.19.18" buildDate="2021-03-29T12:47:00+02:00"></releaseDetails>
    <releaseDetails id="validation-model" version="1.19.50" buildDate="2021-03-30T15:18:00+02:00"></releaseDetails>
    <releaseDetails id="gui" version="1.19.53" buildDate="2021-03-30T15:25:00+02:00"></releaseDetails>
  </buildInformation>
  <jobs% 

Edit: just seen that the corpus from pdfium is included in the one of Apache Tika.

@bdoubrov
Copy link
Contributor

bdoubrov commented Apr 2, 2021

Thanks a lot for bringing our attention to this corpus. We are working right now on stabilization of veraPDF on various collections of malformed documents. So, this one would certainly be covered as well by the next official release.

@bdoubrov
Copy link
Contributor

bdoubrov commented Apr 2, 2021

@AlainVagner in fact, checking this particular test file bug_113.pdf, I see that veraPDF does correctly catch the error and generates the XML report. The issue is that in the simple command line, as you use, both stdout and stderr are mixed up. If you redirect stderr to a different file, the remaining XML looks well-formed. You can use
verapdf -f ua1 bug_113.pdf 2>/dev/null
or
verapdf -f ua1 bug_113.pdf 2>NUL
if you are on Windows.

A related issue on mixed stdout and stderr is already reported here: #1155

@AlainVagner
Copy link

@bdoubrov thanks for checking! I tried on my side and I still have the issue when redirecting the stderr to /dev/null. I am on MacOS, and using the version veraPDF 1.19.53. I should probably test on the latest build.

@bdoubrov
Copy link
Contributor

bdoubrov commented Jun 3, 2021

Hi @AlainVagner

Sorry, I might have missed your latest comment. Would you please clarify what exactly doesn't work for you on Mac with the command line: verapdf -f ua1 bug_113.pdf 2>/dev/null ?

Is generated XML report still not well-formed? Could you send the terminal output together the the XML report in this case?

@carlwilson carlwilson added feature New functionality to be developed P2 Medium priority issues to be scheduled in a future release labels Feb 1, 2022
@carlwilson carlwilson added this to the 1.22 milestone Feb 1, 2022
@bdoubrov bdoubrov modified the milestones: 1.22, 1.26 Jul 4, 2023
@sneakers-the-rat
Copy link

may i say thank you for the malicious PDF document set in this issue, it is a wonderful and terrifying format and malicious PDFs are a true, underappreciated net art format

@MaximPlusov MaximPlusov modified the milestones: 1.26, 1.28 May 22, 2024
@bdoubrov
Copy link
Contributor

bdoubrov commented Oct 9, 2024

All known issues in these files are covered. Further performance improvements are done to handle very large files in these collections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality to be developed fixed-in-dev P2 Medium priority issues to be scheduled in a future release
Projects
None yet
Development

No branches or pull requests

6 participants