Default use of PDF/A-1B profile for regular (non-PDF/A) PDFs #1040

bitsgalore · 2019-09-11T11:32:07Z

I just ran VeraPDF on a "regular" (i.e. non-PDF/A) PDF with automatic flavour detection:

verapdf text_only_fontsEmbeddedAll.pdf > test.xml

(File can be found here: text_only_fontsEmbeddedAll.pdf)

From the profileName attribute in the validationReport I see this activates the PDF/A-1B profile. It seems this is used as a fallback value if no flavour can be detected.

This seems pretty arbitrary, and it might also result in false positives if one uses VeraPDF to identify the PDF/A flavour (if any) of a document. See also the Twitter discussion here.

A better approach might be to report profileName as unknown (or something similar), and skip the validation altogether for these cases. The -f option would still enable a user to validate such files against any of the available profiles.

The text was updated successfully, but these errors were encountered:

bdoubrov · 2019-09-12T09:28:53Z

Thanks for reporting this issue. The only question is that currently the user may run veraPDF on the large collection of PDFs to understand their archival quality even if the majority does not contain a PDF/A ID.

If we implement your suggestion as is, the above scenario would not be possible. And if the user does specify the profile explicitly (eg., PDF/A1-b), he still risks that he would be in the situation that this profile differs from what the document PDF/A identification.

So, to me this looks like another option is needed: the profile to validate non-PDF/A files (or none and report an error). If this sounds reasonable, we can then include it in the next official release of veraPDF.

bitsgalore · 2019-09-12T10:25:33Z

Yes, you're right. Adding an option that defines the profile for non-PDF/A looks like a good solution. In addition it would be helpful to explain the behaviour for non-PDF/A files (including the default profile that is used for these cases) in the documentation, as I couldn't find any info on this (or perhaps I looked in the wrong place).

bdoubrov · 2022-01-24T11:29:56Z

Fix released in v1.20

runeflobakk · 2024-04-08T10:22:01Z

I am a bit unsure on how to configure the parser to not assume any default flavour if no is detected in the PDF. I have tried creating a parser with

PDFAParser parser = veraPDFFoundry.createParser(
        inputStream, PDFAFlavour.NO_FLAVOUR, PDFAFlavour.NO_FLAVOUR);

But I still get "1b" from parser.getFlavour() when I expected to get PDFAFlavour.NO_FLAVOUR from parsing a non-PDF/A (and with no "claim" of being PDF/A). I have tried to track down the changes of this feature, and I get the impression that PDFAFlavour.NO_FLAVOUR is used to preserve the existing behavior, i.e. that "no flavour" means nothing specified, and "1b" is still used as the default flavour if none is detected.

If can however create the parser and set null as the default flavour, and parser.getFlavour() will indeed return null if no flavour is detected, so I can check for null to determine if the PDF is a "regular" PDF, and not claiming to be a PDF/A. But I wonder is this the intended behavior, or if it is accidental.

I got the impression that you, @bitsgalore , had the same expectation that I have, that a non-PDF/A should not be "detected" arbitrarily as "PDF/A-1b", but instead an indication that this is not, nor claiming to be, a PDF/A at all. Did you manage to configure the parser in this way, to get PDFAFlavour.NO_FLAVOUR returned from PDFAParser.getFlavour()?

I hope you don't mind if I mention you as well @bdoubrov . Is this perhaps documented somewhere, and I have just not been able to find it? :)

I use Verapdf Core v1.24.1. Thank you!

bdoubrov · 2024-04-08T11:45:27Z

@runeflobakk thanks for raising this question again. I'm not sure though what is the expected behavior in your case: would you like to skip validation of PDF documents that don't identify themselves as PDF/A? veraPDF is not able to do this as of now, at least not directly.

The workaround would be to run first feature extraction to get document metadata, check the presence PDF/A identification package via some XPath tool and skip validation, if it is not available.

runeflobakk · 2024-04-08T12:28:49Z

@bdoubrov Thanks for your very swift answer :)

Yes, I was hoping to skip validation, if the parser does not detect any PDF/A-flavour.
My code currently looks like this (simplified), and this seems to work, but I suspect this may not be intended use of the API, and is just a NullPointerException waiting to happen maybe on future upgrades of VeraPDF 😅

boolean isValidPdfA(InputStream documentContent) {
    try (VeraPDFFoundry pdfFoundry = Foundries.defaultInstance()) {
        try (PDFAParser parser = pdfFoundry.createParser(documentContent, PDFAFlavour.NO_FLAVOUR, (PDFAFlavour)null)) {
            PDFAFlavour detectedFlavour = parser.getFlavour();
            if (detectedFlavour == null) {
                return false;
            }
            return pdfFoundry
                .createValidator(detectedFlavour, false)
                .validate(parser)
                .isCompliant();

(Omitted various exception handling for brevity)

I intended the method to return false both when the given PDF is not identifying as a PDF/A and a PDF/A which is not compliant.

Previously I used the "autodetect"-variant of createParser(InputStream), and noticed that the detected flavour for non-PDF/As (not identifying themselves as such) was "1b". It would probably work to just keep doing that, as validating a non-PDF/A I guess would in any case fail validation as PDF/A-1b, but it would be nice to be able to short-circuit and not even validate if the parser does not detect any PDF/A-identifier at all.

bdoubrov · 2024-04-12T08:32:50Z

I see the point. We'll extend the interface of veraPDF parser to report all conformance declarations present in the document. There might be none, or more than one (eg. PDF/A, PDF/UA and now also WFPDF). For the moment the logic of choosing the flavor used for validation is mostly hardcoded. This new interface will allow implementing custom logic, as in your code example.

bdoubrov · 2024-08-02T10:03:43Z

@runeflobakk we have extended the veraPDF API (not released yet, available in the latest dev builds) to retrieve all conformance claims in the document metadata and programmatically select 0 or more of them for validation: #1414

So, a sample code might look like:

VeraGreenfieldFoundryProvider.initialise();
try (PDFAParser parser = Foundries.defaultInstance().createParser(new FileInputStream("mydoc.pdf"))) {
	List<PDFAFlavour> detectedFlavours = parser.getFlavours();
	List<PDFAFlavour> flavours = new LinkedList<>();
	for (PDFAFlavour flavour : detectedFlavours) {
		// iterate through all detected flavours and pick up PDF/A and PDF/UA ones for validation
		if (PDFFlavours.isFlavourFamily(flavour, PDFAFlavour.SpecificationFamily.PDF_A) || 
				PDFFlavours.isFlavourFamily(flavour, PDFAFlavour.SpecificationFamily.PDF_UA)) {
			flavours.add(flavour);
		}
	}
	PDFAValidator validator = Foundries.defaultInstance().createValidator(flavours);
	List<ValidationResult> results = validator.validateAll(parser);
	for (ValidationResult result : results) {
		if (result.isCompliant()) {
			// File complies to flavour
		} else {
			// File doesn't comply to flavour
		}
	}
} catch (IOException | ValidationException | ModelParsingException | EncryptedPdfException exception) {
	// Exception during validation
}

runeflobakk · 2024-08-02T10:18:34Z

@bdoubrov Thank you! I will check it out when I get some time.

In the code above, first conformance claims are collected in List<PDFAFlavour> flavours, and then this is used to create an applicable validator using Foundries.defaultInstance().createValidator(flavours). If flavours ends up being empty (no conformance claims was found), is it still ok to create the PDFAValidator? Will that validator simply be a "no-op" validator, since it is specified to validate "no flavours"? Or should the entire validation step be skipped (i.e. handled accordingly how your application needs to respond to "no conformance claims") if the PDF is not claiming any PDF/A conformance?

In any case, my application needs to handle if the PDF does not claim any PDF/A-conformance, if I am trying to parse a PDF which may or may not be PDF/A, or even not claiming to be. But I am mostly curious to how the validator will behave if configured to validate no flavours of PDF/A, which may happen in your example code. I can of course test this myself, but wanted to mention this as it was the first thing I noticed. Is it even sensible to create a PDFAValidator set up to validate no flavours of PDF/A?

MaximPlusov · 2024-08-02T11:39:05Z

@runeflobakk It's possible to create PDFAValidator with empty flavours, but that validator will not be a "no-op".
The next code example more suited to your goals:

VeraGreenfieldFoundryProvider.initialise();
try (PDFAParser parser = Foundries.defaultInstance().createParser(new FileInputStream("mydoc.pdf"))) {
	List<PDFAFlavour> detectedFlavours = parser.getFlavours();
	PDFAFlavour pdfaFlavour = null;
	for (PDFAFlavour flavour : detectedFlavours) {
		// iterate through all detected flavours and pick up PDF/A one for validation
		if (PDFFlavours.isFlavourFamily(flavour, PDFAFlavour.SpecificationFamily.PDF_A)) {
			pdfaFlavour = flavour;
			break;
		}
	}
        if (pdfaFlavour != null) {
	        PDFAValidator validator = Foundries.defaultInstance().createValidator(pdfaFlavour);
	        ValidationResult result = validator.validate(parser);
		if (result.isCompliant()) {
			// File complies to PDF/A flavour
		} else {
			// File doesn't comply to PDF/A flavour
		}
        } else {
                //File doesn't contain PDF/A Identification schema
        }
} catch (IOException | ValidationException | ModelParsingException | EncryptedPdfException exception) {
	// Exception during parsing or validation
}

runeflobakk · 2024-08-02T11:41:29Z

@MaximPlusov Thank you very much! 👍

ghost assigned carlwilson Oct 24, 2019

ghost added the documentation label Oct 24, 2019

carlwilson removed the documentation label Dec 18, 2019

bdoubrov assigned MaximPlusov and unassigned carlwilson Feb 10, 2021

bdoubrov added feature New functionality to be developed P2 Medium priority issues to be scheduled in a future release labels Feb 10, 2021

MaximPlusov added the fixed-in-dev label Jun 3, 2021

bdoubrov closed this as completed Jan 24, 2022

bdoubrov reopened this Apr 8, 2024

bdoubrov removed the fixed-in-dev label May 24, 2024

bdoubrov added the fixed-in-dev label Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default use of PDF/A-1B profile for regular (non-PDF/A) PDFs #1040

Default use of PDF/A-1B profile for regular (non-PDF/A) PDFs #1040

bitsgalore commented Sep 11, 2019

bdoubrov commented Sep 12, 2019

bitsgalore commented Sep 12, 2019

bdoubrov commented Jan 24, 2022

runeflobakk commented Apr 8, 2024

bdoubrov commented Apr 8, 2024

runeflobakk commented Apr 8, 2024

bdoubrov commented Apr 12, 2024

bdoubrov commented Aug 2, 2024 •

edited by MaximPlusov

Loading

runeflobakk commented Aug 2, 2024 •

edited

Loading

MaximPlusov commented Aug 2, 2024 •

edited

Loading

runeflobakk commented Aug 2, 2024

Default use of PDF/A-1B profile for regular (non-PDF/A) PDFs #1040

Default use of PDF/A-1B profile for regular (non-PDF/A) PDFs #1040

Comments

bitsgalore commented Sep 11, 2019

bdoubrov commented Sep 12, 2019

bitsgalore commented Sep 12, 2019

bdoubrov commented Jan 24, 2022

runeflobakk commented Apr 8, 2024

bdoubrov commented Apr 8, 2024

runeflobakk commented Apr 8, 2024

bdoubrov commented Apr 12, 2024

bdoubrov commented Aug 2, 2024 • edited by MaximPlusov Loading

runeflobakk commented Aug 2, 2024 • edited Loading

MaximPlusov commented Aug 2, 2024 • edited Loading

runeflobakk commented Aug 2, 2024

bdoubrov commented Aug 2, 2024 •

edited by MaximPlusov

Loading

runeflobakk commented Aug 2, 2024 •

edited

Loading

MaximPlusov commented Aug 2, 2024 •

edited

Loading