Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check that the manifest only lists publication resources #1452

Closed
rdeltour opened this issue Dec 8, 2022 · 7 comments · Fixed by #1465
Closed

check that the manifest only lists publication resources #1452

rdeltour opened this issue Dec 8, 2022 · 7 comments · Fixed by #1465
Assignees
Labels
status: has PR The issue is being processed in a pull request type: false-negative This issue is about invalid content being incorrectly accepted
Milestone

Comments

@rdeltour
Copy link
Member

rdeltour commented Dec 8, 2022

The spec says:

the manifest MUST only list publication resources

I don't think this is covered by EPUBCheck.

For each manifest item, we could lookup the reference registry to see if a reference to the resource was found, and issue an error if not.
We would issue a mere usage message if the EPUB contains scripted content, as the item could potentially be a legit publication resource used in scripting.

@rdeltour rdeltour added status: accepted Ready to be further processed type: false-negative This issue is about invalid content being incorrectly accepted labels Dec 8, 2022
@rdeltour rdeltour added this to the v5.0.0-rc milestone Dec 8, 2022
@rdeltour rdeltour self-assigned this Dec 15, 2022
@rdeltour
Copy link
Member Author

For each manifest item, we could lookup the reference registry to see if a reference to the resource was found, and issue an error if not. We would issue a mere usage message if the EPUB contains scripted content, as the item could potentially be a legit publication resource used in scripting.

@mattgarrish do you think that check would be useful and/or reasonable?

Existing EPUBs may list things in the manifest that are not strictly speaking publication resources (e.g. unused images). As we did not check this in the past, so I'm wondering if it's opening a can of worms…

@rdeltour rdeltour added status: in discussion The issue is being discussed by the development team and removed status: accepted Ready to be further processed labels Dec 15, 2022
@mattgarrish
Copy link
Member

The requirement is only there to keep linked resources and resources travelling in the container from littering the manifest, since reading systems don't need these (I guess it might keep them from being extracted for reading). It was never meant to trap former publication resources that are still listed but have stopped being referenced, so I agree this does seem like a can of worms to implement now.

It is useful to know there are unused resources still kicking around in the publication from a cleanup perspective, but I don't know this is worth the headache of making errors. And I don't think there's any simple fix, either here or for the spec. We're just too far along in EPUB 3.

What if we deliberately ignore the spec and make it a usage message so at least publishers will know if a resource is still used but without adding new errors?

@rdeltour
Copy link
Member Author

What if we deliberately ignore the spec and make it a usage message so at least publishers will know if a resource is still used but without adding new errors?

Yes, it looks like a good compromise, I was thinking the same.

The requirement is only there to keep linked resources and resources travelling in the container from littering the manifest, since reading systems don't need these

IMO the requirement is too strong. We already say that publication resources MUST be listed; we don't need to add that others MUST NOT, especially when this cannot be checked.

Also, EPUBCheck reports files present in the container but not listed in the manifest as a WARNING (OPF-003). It has for a looong time. I know it's contrary to the spec, but I think it was added back in the day to warn about littering system files (.DSStore and the like).
But the current state of affair is the EPUBCheck implicitly enforces (by a warning) a spec violation 🙃.

In any case, I for one always have a little difficulty understanding what is a publication resource.

For instance, when the exemption section says:

This exemption allows EPUB creators to include resources in the EPUB container that are not for use by EPUB reading systems.

if a resource is not for use by RS, is it even a publication resource, and in that case is it even relevant to create exemption rules for it?

@mattgarrish
Copy link
Member

if a resource is not for use by RS, is it even a publication resource

That's what makes wading into this so complicated. We've carved out a lot of niche cases over the years for resources that don't quite conform to the standard.

But I was forgetting last night that resources travelling in the container still have to be listed in the manifest. It's just that they're exempt from fallback requirements. How do you differentiate a resource travelling in the container from one that used to be a publication resource and is no longer in use, though? That still remains the problem I don't see a way of writing around.

But that particular problem aside, the only resources in the container that shouldn't be in the manifest are linked resources (the ones that aren't also publication resources!) and the stuff in META-INF.

It's not clear why you don't list files in META-INF in the manifest, though -- there should probably be a statement that because they're reserved files for processing the OCF, they don't need listing.

If we add that statement, then I think we could take the problematic "must not" here and turn it into a note that just mentions the resources that don't get listed (or even just remove it altogether, as at that point the manifest requirements would cover all cases).

That would only leave the problem of former publication resources / resources travelling in the container, but I think that's a legit case for a usage message.

@rdeltour
Copy link
Member Author

Ok, so assuming w3c/epub-specs#2506 is merged, that leaves us with the following EPUBCheck behavior:

  • do not report files that are known publication resources, with references found in content documents
  • report as errors the following if listed in the manifest
    • mimetype file
    • META-INF/ files
    • linked resources, when not also in the spine or embedded in content documents
  • warn or usage about other resources, for which no reference were found

About the last point: EPUBCheck historically warns about those (since #58, which was released in v1.1, 12 years ago).
According to the clarified spec, this warning would become a mere usage message. This was also discussed a few years back in w3c/epub-specs#563 (comment):

@mattgarrish

the working group resolved on the July 6 [2016] call not to change the rules but that epubcheck should be changed to only emit a info message about resources that are not in the manifest, linked or "manifest free" (rendition mapping doc)

I'm still slightly hesitant to change this, based on @dauwhe's original comment in #58 (and the following discussion):

A major new player in the eBook marketplace is now checking this, and
rejecting titles with "unmanifested" files.

@mattgarrish
Copy link
Member

I'm still slightly hesitant to change this

Ya, I thought it was a new warning, but if it's always been around...

But, the problem is the new "allowed to travel in the container without being used" resources (data sets, etc.). We can't make one camp happy without annoying the other.

We probably should have added a manifest property for the new case to make clear they've been added intentionally. What do you think @iherman? Doesn't seem like too big an ask, and it's not like there's an established base using the container to carry files around that we need to worry about yet.

I just have no idea what we'd name the property. <item ... properties="data">, maybe?

@iherman
Copy link
Member

iherman commented Dec 17, 2022

I had two worries:

  1. Are we creating a new normative feature
  2. Are the results of testing such that we may have to remove (or weaken) the fallback feature from the spec; it is the explanation of fallback that led to the introduction of this category of resources in the first place... (although I realize that even without fallbacks the issue would arise).

It seems that both of my worries are moot, so I am fine with this change. I have added some additional (minor) comments to w3c/epub-specs#2507.

B.t.w., if the group decides it is not a good time to make the change, we can postpone it for later. It would not create any practical problems.

rdeltour added a commit that referenced this issue Dec 22, 2022
EPUBCheck used to report any resource found in the the container but not
listed in the manifest as a warning (since #58 was fixed, in v1.1). But
the EPUB specification does not require that.

This commit downgrades the severity of `OPF-003` to a usage report.

See also #1452
See also w3c/epub-specs#563
rdeltour added a commit that referenced this issue Dec 22, 2022
This PR introduces a new usage message (`OPF-097`) reported when the
manifest includes a resource for which no reference was found in content.

This is legit for instance for exempt resources (like data files included
in the container). But it is likely an author error otherwise, so reporting
a usage message can help to detect that.

We do not report an error since:
- references can be added via scripting, so the check can only be
  informative for scripted content
- EPUBCheck historically required all container resources to be listed
  in the manifest (not doing so was reported as a warning, `OPF-003`),
  which is conflicting with a requirement to **not** list unsused
  resources

Close #1452
@rdeltour rdeltour added status: has PR The issue is being processed in a pull request and removed status: in discussion The issue is being discussed by the development team labels Dec 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: has PR The issue is being processed in a pull request type: false-negative This issue is about invalid content being incorrectly accepted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants