Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default use of PDF/A-1B profile for regular (non-PDF/A) PDFs #1040

Open
bitsgalore opened this issue Sep 11, 2019 · 11 comments
Open

Default use of PDF/A-1B profile for regular (non-PDF/A) PDFs #1040

bitsgalore opened this issue Sep 11, 2019 · 11 comments
Assignees
Labels
feature New functionality to be developed fixed-in-dev P2 Medium priority issues to be scheduled in a future release

Comments

@bitsgalore
Copy link

I just ran VeraPDF on a "regular" (i.e. non-PDF/A) PDF with automatic flavour detection:

verapdf text_only_fontsEmbeddedAll.pdf > test.xml

(File can be found here: text_only_fontsEmbeddedAll.pdf)

From the profileName attribute in the validationReport I see this activates the PDF/A-1B profile. It seems this is used as a fallback value if no flavour can be detected.

This seems pretty arbitrary, and it might also result in false positives if one uses VeraPDF to identify the PDF/A flavour (if any) of a document. See also the Twitter discussion here.

A better approach might be to report profileName as unknown (or something similar), and skip the validation altogether for these cases. The -f option would still enable a user to validate such files against any of the available profiles.

@bdoubrov
Copy link
Contributor

Thanks for reporting this issue. The only question is that currently the user may run veraPDF on the large collection of PDFs to understand their archival quality even if the majority does not contain a PDF/A ID.

If we implement your suggestion as is, the above scenario would not be possible. And if the user does specify the profile explicitly (eg., PDF/A1-b), he still risks that he would be in the situation that this profile differs from what the document PDF/A identification.

So, to me this looks like another option is needed: the profile to validate non-PDF/A files (or none and report an error). If this sounds reasonable, we can then include it in the next official release of veraPDF.

@bitsgalore
Copy link
Author

Yes, you're right. Adding an option that defines the profile for non-PDF/A looks like a good solution. In addition it would be helpful to explain the behaviour for non-PDF/A files (including the default profile that is used for these cases) in the documentation, as I couldn't find any info on this (or perhaps I looked in the wrong place).

@ghost ghost assigned carlwilson Oct 24, 2019
@ghost ghost added the documentation label Oct 24, 2019
@bdoubrov bdoubrov assigned MaximPlusov and unassigned carlwilson Feb 10, 2021
@bdoubrov bdoubrov added feature New functionality to be developed P2 Medium priority issues to be scheduled in a future release labels Feb 10, 2021
@bdoubrov
Copy link
Contributor

Fix released in v1.20

@runeflobakk
Copy link

I am a bit unsure on how to configure the parser to not assume any default flavour if no is detected in the PDF. I have tried creating a parser with

PDFAParser parser = veraPDFFoundry.createParser(
        inputStream, PDFAFlavour.NO_FLAVOUR, PDFAFlavour.NO_FLAVOUR);

But I still get "1b" from parser.getFlavour() when I expected to get PDFAFlavour.NO_FLAVOUR from parsing a non-PDF/A (and with no "claim" of being PDF/A). I have tried to track down the changes of this feature, and I get the impression that PDFAFlavour.NO_FLAVOUR is used to preserve the existing behavior, i.e. that "no flavour" means nothing specified, and "1b" is still used as the default flavour if none is detected.

If can however create the parser and set null as the default flavour, and parser.getFlavour() will indeed return null if no flavour is detected, so I can check for null to determine if the PDF is a "regular" PDF, and not claiming to be a PDF/A. But I wonder is this the intended behavior, or if it is accidental.

I got the impression that you, @bitsgalore , had the same expectation that I have, that a non-PDF/A should not be "detected" arbitrarily as "PDF/A-1b", but instead an indication that this is not, nor claiming to be, a PDF/A at all. Did you manage to configure the parser in this way, to get PDFAFlavour.NO_FLAVOUR returned from PDFAParser.getFlavour()?

I hope you don't mind if I mention you as well @bdoubrov . Is this perhaps documented somewhere, and I have just not been able to find it? :)

I use Verapdf Core v1.24.1. Thank you!

@bdoubrov bdoubrov reopened this Apr 8, 2024
@bdoubrov
Copy link
Contributor

bdoubrov commented Apr 8, 2024

@runeflobakk thanks for raising this question again. I'm not sure though what is the expected behavior in your case: would you like to skip validation of PDF documents that don't identify themselves as PDF/A? veraPDF is not able to do this as of now, at least not directly.

The workaround would be to run first feature extraction to get document metadata, check the presence PDF/A identification package via some XPath tool and skip validation, if it is not available.

@runeflobakk
Copy link

@bdoubrov Thanks for your very swift answer :)

Yes, I was hoping to skip validation, if the parser does not detect any PDF/A-flavour.
My code currently looks like this (simplified), and this seems to work, but I suspect this may not be intended use of the API, and is just a NullPointerException waiting to happen maybe on future upgrades of VeraPDF 😅

boolean isValidPdfA(InputStream documentContent) {
    try (VeraPDFFoundry pdfFoundry = Foundries.defaultInstance()) {
        try (PDFAParser parser = pdfFoundry.createParser(documentContent, PDFAFlavour.NO_FLAVOUR, (PDFAFlavour)null)) {
            PDFAFlavour detectedFlavour = parser.getFlavour();
            if (detectedFlavour == null) {
                return false;
            }
            return pdfFoundry
                .createValidator(detectedFlavour, false)
                .validate(parser)
                .isCompliant();

(Omitted various exception handling for brevity)

I intended the method to return false both when the given PDF is not identifying as a PDF/A and a PDF/A which is not compliant.

Previously I used the "autodetect"-variant of createParser(InputStream), and noticed that the detected flavour for non-PDF/As (not identifying themselves as such) was "1b". It would probably work to just keep doing that, as validating a non-PDF/A I guess would in any case fail validation as PDF/A-1b, but it would be nice to be able to short-circuit and not even validate if the parser does not detect any PDF/A-identifier at all.

@bdoubrov
Copy link
Contributor

I see the point. We'll extend the interface of veraPDF parser to report all conformance declarations present in the document. There might be none, or more than one (eg. PDF/A, PDF/UA and now also WFPDF). For the moment the logic of choosing the flavor used for validation is mostly hardcoded. This new interface will allow implementing custom logic, as in your code example.

@bdoubrov
Copy link
Contributor

bdoubrov commented Aug 2, 2024

@runeflobakk we have extended the veraPDF API (not released yet, available in the latest dev builds) to retrieve all conformance claims in the document metadata and programmatically select 0 or more of them for validation: #1414

So, a sample code might look like:

VeraGreenfieldFoundryProvider.initialise();
try (PDFAParser parser = Foundries.defaultInstance().createParser(new FileInputStream("mydoc.pdf"))) {
	List<PDFAFlavour> detectedFlavours = parser.getFlavours();
	List<PDFAFlavour> flavours = new LinkedList<>();
	for (PDFAFlavour flavour : detectedFlavours) {
		// iterate through all detected flavours and pick up PDF/A and PDF/UA ones for validation
		if (PDFFlavours.isFlavourFamily(flavour, PDFAFlavour.SpecificationFamily.PDF_A) || 
				PDFFlavours.isFlavourFamily(flavour, PDFAFlavour.SpecificationFamily.PDF_UA)) {
			flavours.add(flavour);
		}
	}
	PDFAValidator validator = Foundries.defaultInstance().createValidator(flavours);
	List<ValidationResult> results = validator.validateAll(parser);
	for (ValidationResult result : results) {
		if (result.isCompliant()) {
			// File complies to flavour
		} else {
			// File doesn't comply to flavour
		}
	}
} catch (IOException | ValidationException | ModelParsingException | EncryptedPdfException exception) {
	// Exception during validation
} 

@runeflobakk
Copy link

runeflobakk commented Aug 2, 2024

@bdoubrov Thank you! I will check it out when I get some time.

In the code above, first conformance claims are collected in List<PDFAFlavour> flavours, and then this is used to create an applicable validator using Foundries.defaultInstance().createValidator(flavours). If flavours ends up being empty (no conformance claims was found), is it still ok to create the PDFAValidator? Will that validator simply be a "no-op" validator, since it is specified to validate "no flavours"? Or should the entire validation step be skipped (i.e. handled accordingly how your application needs to respond to "no conformance claims") if the PDF is not claiming any PDF/A conformance?

In any case, my application needs to handle if the PDF does not claim any PDF/A-conformance, if I am trying to parse a PDF which may or may not be PDF/A, or even not claiming to be. But I am mostly curious to how the validator will behave if configured to validate no flavours of PDF/A, which may happen in your example code. I can of course test this myself, but wanted to mention this as it was the first thing I noticed. Is it even sensible to create a PDFAValidator set up to validate no flavours of PDF/A?

@MaximPlusov
Copy link
Contributor

MaximPlusov commented Aug 2, 2024

@runeflobakk It's possible to create PDFAValidator with empty flavours, but that validator will not be a "no-op".
The next code example more suited to your goals:

VeraGreenfieldFoundryProvider.initialise();
try (PDFAParser parser = Foundries.defaultInstance().createParser(new FileInputStream("mydoc.pdf"))) {
	List<PDFAFlavour> detectedFlavours = parser.getFlavours();
	PDFAFlavour pdfaFlavour = null;
	for (PDFAFlavour flavour : detectedFlavours) {
		// iterate through all detected flavours and pick up PDF/A one for validation
		if (PDFFlavours.isFlavourFamily(flavour, PDFAFlavour.SpecificationFamily.PDF_A)) {
			pdfaFlavour = flavour;
			break;
		}
	}
        if (pdfaFlavour != null) {
	        PDFAValidator validator = Foundries.defaultInstance().createValidator(pdfaFlavour);
	        ValidationResult result = validator.validate(parser);
		if (result.isCompliant()) {
			// File complies to PDF/A flavour
		} else {
			// File doesn't comply to PDF/A flavour
		}
        } else {
                //File doesn't contain PDF/A Identification schema
        }
} catch (IOException | ValidationException | ModelParsingException | EncryptedPdfException exception) {
	// Exception during parsing or validation
} 

@runeflobakk
Copy link

@MaximPlusov Thank you very much! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality to be developed fixed-in-dev P2 Medium priority issues to be scheduled in a future release
Projects
None yet
Development

No branches or pull requests

5 participants