Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVG DTD not accepted #1114

Closed
iherman opened this issue Mar 16, 2020 · 32 comments
Closed

SVG DTD not accepted #1114

iherman opened this issue Mar 16, 2020 · 32 comments
Labels
spec: EPUB 3.x Impacting the support of EPUB 3.x specifications status: completed Work completed, can be closed type: spec The issue is related to a Specification update
Milestone

Comments

@iherman
Copy link
Member

iherman commented Mar 16, 2020

(This seems to be related to #173 (comment), although not to rdf:RDF, which is the original topic of that issue.)

The W3C SVG Logo is not accepted by epubcheck, complaining about "External DTD entities are not allowed.". That is problematic.

The SVG file itself does not include any external DTD entity. However, the Logo file does contain:

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">

Indeed, the DTD file itself contains loads of entities. It is also correct that the latest version of SVG (SVG2) has removed the reference to a DTD (see change section). But flagging this is as an error does create backward incompatibility issues with bona fide SVG1.1 content which did rely on the DTD (e.g., all examples in SVG1.1 include that DTD reference, but also all SVG content produced by earlier versions of Adobe Illustrator, to take just this example). Note that all browsers accept such SVG 1.1 content without further ado.

I would think that epubcheck should simply ignore the SVG DTD and shouldn't report any error or warning on that one.

@iherman
Copy link
Member Author

iherman commented Mar 16, 2020

B.t.w., formally, the latest Recommendation is still SVG 1.1. SVG2 is a Candidate Recommendation.

@rdeltour
Copy link
Member

I guess this issue should be discussed in the EPUB CG?

EPUBCheck conforms to the EPUB spec here. I personally agree the rule is probably too strict, especially for SVG used as images and not as content docs. But any deviation from the spec should be approved by the CG 😊

@rdeltour rdeltour added spec: EPUB 3.x Impacting the support of EPUB 3.x specifications status: in discussion The issue is being discussed by the development team type: spec The issue is related to a Specification update labels Mar 16, 2020
@mattgarrish
Copy link
Member

This would have to be a change to the core spec, as external references are strictly forbidden from DTDs per https://www.w3.org/publishing/epub3/epub-spec.html#confreq-xml-extmarkupdecl

It was put in at least in part for security reasons, as I recall, but perhaps there's a case to allow specific external references outside of HTML files. Definitely needs discussion.

@iherman
Copy link
Member Author

iherman commented Mar 17, 2020

A similar issue may actually come up with MathML, if the DTD is used: https://www.w3.org/Math/DTD/mathml3/mathml3.dtd has similar constructions.

I would think that the spec (and check) should relax on the standard, official DTD-s for the accepted content types. I am not sure why a reading system would even look at those DTDs in the first place.

@murata2makoto
Copy link
Contributor

Have any publishers complained? If no publishers complain, I do not want to fix EPUB3.

@iherman
Copy link
Member Author

iherman commented Mar 17, 2020

I am not a publisher:-), but I stumbled on this issue through a script that turns W3C recommendations into EPUB.

I realize that is a small and not-very-important example. However, we see (at last!) a number of tools producing EPUB files from, e.g., popular editors (Apple Pages, Google Doc, LibreOffice, soon MS Word). Which is of course great, it makes EPUB becoming some sort of commodity; production of EPUB is not, and should not, be done only by bona fide publishers. However, alongside this evolution, we will also see, for example, more SVG content coming into EPUB from tools like Adobe Illustrator. (The specific SVG file that hit me was an Illustrator output.) That means such issues will come.

B.t.w., such a change can be done in a very mild way, without affecting any deployed content. The essence of the restriction may remain in place; the only exception would be to ignore the official DTD-s of contents that are listed in the EPUB spec (MathML, SVG, SMIL, ...). I do not see any harm doing that.

@dauwhe
Copy link
Contributor

dauwhe commented Apr 28, 2020

It was put in at least in part for security reasons, as I recall, but perhaps there's a case to allow specific external references outside of HTML files. Definitely needs discussion.

Is this also a matter of making EPUBs self-contained? If you need a network request to parse an XML file, how would the EPUB work in the proverbial train tunnel?

The security implications seem non-trivial:

https://en.wikipedia.org/wiki/XML_external_entity_attack
https://en.wikipedia.org/wiki/Billion_laughs_attack
https://insinuator.net/2015/03/xxe-injection-in-apache-batik-library-cve-2015-0250/

@mattgarrish
Copy link
Member

The security implications seem non-trivial:

Ya, but the solution in epub tends to be. Reading systems should be the ones not resolving external entity references. There's not much value in restricting authoring of such references, as if they're being used maliciously are you really running them through epubcheck?

@iherman
Copy link
Member Author

iherman commented Apr 29, 2020

It was put in at least in part for security reasons, as I recall, but perhaps there's a case to allow specific external references outside of HTML files. Definitely needs discussion.

Is this also a matter of making EPUBs self-contained? If you need a network request to parse an XML file, how would the EPUB work in the proverbial train tunnel?

I think maintaining the restriction may be a good idea. However, there may be a set of well known, and standard DTD, whose parsing is probably unnecessary anyway, ie, that serve only some sort of identification. That should include the DTD-s for the content documents that are allowed in EPUB in the first place, ie, SVG, MathML, or SMIL (I did not check whether all of them depend on entities).

@rdeltour
Copy link
Member

Is this also a matter of making EPUBs self-contained? If you need a network request to parse an XML file, how would the EPUB work in the proverbial train tunnel?

In EPUBCheck, we use a URL resolver and local copies of DTDs and other references. I guess a similar approach would be taken by a reading system to prevent unnecessary network requests.

As for security implications, they're real indeed, but again this is the kind of things that are better dealt with at the software level than at the spec level, IMO.
The one benefit of having such restrictions in the spec is that it becomes harder for malicious content to enter curated distribution channels which run EPUBCheck as a gatekeeper, but it doesn't mean RS wouldn't have safeguards anyways (for non-curated content).

For instance, a few years ago, EPUBCheck was reported a vulnerability to XML external entity attack (CVE-2016-9487), which we fixed in version 4.0.2. The spec didn't prevent this happening.

@murata2makoto
Copy link
Contributor

murata2makoto commented Apr 29, 2020

This is a very old and annoying issue to me. The original XML people tried to solve it, but it still bites us.

Allowing external DTD subsets might open a can of worms, but we can probably avoid them.

First, the standard DTD for SVG 1.1 does not define parsed entities (unless people customize the DTD). Thus, XML processors that do not examine the external DTD subset have no problems in parsing SVG documents.

Second, this is not true for MathML. A number of parsed entities (such as &harr; for U+2194 (LEFT RIGHT ARROW)) are defined or borrowed by the MathML DTD. But the restriction in EPUB 3.2 prohibits the use of such parsed entities in EPUB, unless people define parsed entities in the internal DTD subset So, we can safely assume that people use Unicode characters rather than parsed entities of MathML. Is this true? > @TzviyaSiegman

Here is my proposal.

  1. Lift the current restriction for SVG and MathML.
  2. Prohibits the use of references to entities with the exception of those parsed entities (such as &lt;) defined in the XML specification.

P.S. Some XML geeks might use the internal DTD subset for defining parsed entities. We might want to require a declaration in the internal DTD subset for each of the parsed entities used in the XML document (unless they are defined in the XML recommendation.)

@iherman
Copy link
Member Author

iherman commented Apr 29, 2020

@murata2makoto

So, we can safely assume that people do not use parsed entities of MathML and that they use Unicode characters rather than such entities. Is this true?

I would certainly hope so... MathML 3, which is the latest recommendation of MathML, does not require the use of DTD-s (just as SVG1.2 will also do away with a required DTD). As you said, the specific mathematical characters can be taken care of by their Unicode equivalents these days.

(What I do not know, however, is what various converters, like LaTeX->MathML, do. I would think that most of the authors provide their equations in LaTeX, which is then converted into MathML. Some of those tools may be older and rely on DTD-s, just like the SVG files generated by Adobe Illustrator...)

@iherman
Copy link
Member Author

iherman commented Apr 29, 2020

Lift the current restriction for SVG and MathML.

SMIL also uses DTD-s with entities. Do we allow explicit SMIL files in an EPUB?

As far as I can see SSML does not require DTD-s :-)

@murata2makoto
Copy link
Contributor

SMIL also uses DTD-s with entities. Do we allow explicit SMIL files in an EPUB?

Media Overlay uses SMIL documents. But MO defines a very small subset of SMIL, which does not contain entities.

@iherman
Copy link
Member Author

iherman commented Apr 29, 2020

SMIL also uses DTD-s with entities. Do we allow explicit SMIL files in an EPUB?

Media Overlay uses SMIL documents. But MO defines a very small subset of SMIL, which does not contain entities.

But the DTD does... Ie, even if a small subset is used but the specific SMIL file includes the DTD then we have a problem. It is exactly the same problem as with mine with SVG which started this thread :-(

@rdeltour
Copy link
Member

@murata2makoto

Here is my proposal.

  1. Lift the current restriction for SVG and MathML.
  2. Prohibits the use of references to entities with the exception of those parsed entities (such as <) defined in the XML specification.

We could also adopt an approach similar to the HTML Standard’s XML parsing rules:

  1. define a set of well-known public identifiers, and specify exactly what URL they correspond to (for all the relevant specs we could see if it makes sense to use the one specified in the HTML spec, containing the list of well-known named character references).
  2. say that user agents should not attempt to retrieve external entity's content.

@murata2makoto
Copy link
Contributor

@iherman

It is exactly the same problem as with mine with SVG which started this thread :-(

But this problem is theoretical, since nobody would like to create SMIL documents containing document type declarations. I also think that references to parsed entities are not needed in MO documents.

@rdeltour

define a set of well-known public identifiers

Makes sense.

We could also adopt an approach similar to the HTML Standard’s XML parsing rules:

Are you proposing to allow references to entities defined or borrowed by the MathML DTD? If we allow them, XML processors that do not retrieve external DTD subsets cannot handle them.

say that user agents should not attempt to retrieve external entity's content

I also think that XML processors that retrieve external DTD subsets are not very useful for EPUB. But I am not sure if we have to say "should not attempt to retrieve" them, since somebody might want to validate SVG or MathML using DTDs.

@murata2makoto
Copy link
Contributor

Should this issue be addressed by publishing errata to EPUB 3.2 or by the next version of EPUB 3 (by the upcoming WG)?

@iherman
Copy link
Member Author

iherman commented May 1, 2020

Should this issue be addressed by publishing errata to EPUB 3.2 or by the next version of EPUB 3 (by the upcoming WG)?

If there is an erratum for EPUB 3.2, it ought to be taken up by the upcoming WG anyway. I do not think there is an urgency...

@llemeurfr
Copy link

llemeurfr commented Oct 10, 2020

I've discovered this old issue yesterday and may be missing something, but:

for instance for excluding this:

<!DOCTYPE html [ 
<!ENTITY new-release SYSTEM "https://bookseller.com/new-release.html"> 
]> 
<html> 
<body>
&news-release; 
</body>
</html> 
  • in the SVG case presented in this thread, the DTD is referenced in a standard manner; the content document does not present any external entity. The fact that the DTD itself presents external entities (a manner of modularizing DTDs) does not break the EPUB XML conformance constraint IMO.

  • Note that if the XML parser does not validate the SVG structure (and EPUB conformant processors don't need to validate EPUB content), the parser will not even encounter these external entities (nested in the DTD). Only EPUBcheck validates EPUB (+ SVG + MathML) content, for the sake of guarantying a better interoperability.

In conclusion, I don't see the point of this issue.

@iherman
Copy link
Member Author

iherman commented Oct 10, 2020

  • The fact that the DTD itself presents external entities (a manner of modularizing DTDs) does not break the EPUB XML conformance constraint IMO.

This statement is, however, a bit ambiguous:

External identifiers MUST NOT appear in the document type declaration [XML].

Two ways of reading this is that the

  1. The SVG (ie, XML) Document Type Declaration MUST NOT contain external entities. I guess this is the current interpretation of epubcheck, yielding error messages. Or
  2. The SVG (ie, XML) source MUST NOT contain external entities, like your example above.

At the minimum, this ambiguity must be clarified in the spec if indeed only the second item is meant. That should then probably be followed up in the EPUB Issue.

In conclusion, I don't see the point of this issue.

This is the issue on epubcheck and not on the spec. There is a practical problem, as described in the original description of this issue. This issue is very much relevant until epubcheck is not changed to accept bona fide SVG files as valid.

If your interpretation is correct, then maybe the EPUB Issue can be closed although I think that, at the minimum, a clarification in the spec is necessary.


(We should stick to one issue to discuss all this, it is not helpful to switch between two different repositories...)

@murata2makoto
Copy link
Contributor

@iherman

You are not the only person confused by the definition of DTDs.

A DTD is a pair of an external DTD subset and an internal DTD subset. Laurent's example document contains a DTD, which has an internal DTD subset only.

No external entities can occur in XML documents unless they are defined by DTDs.

@iherman
Copy link
Member Author

iherman commented Oct 10, 2020

@iherman

You are not the only person confused by the definition of DTDs.

A DTD is a pair of an external DTD subset and an internal DTD subset. Laurent's example document contains a DTD, which has an internal DTD subset only.

No external entities can occur in XML documents unless they are defined by DTDs.

@murata2makoto I certainly yield to your XML knowledge. But then

  1. I believe a note or explanation in the EPUB spec is warranted to avoid such misunderstandings because implementations may go wrong (this is something for @mattgarrish I presume)
  2. Epubcheck must be updated to avoid this problem

@murata2makoto
Copy link
Contributor

@iherman

I agree with your first point. Do not understand your second.

@iherman
Copy link
Member Author

iherman commented Oct 10, 2020

@iherman

I agree with your first point. Do not understand your second.

If your interpretation of the EPUB Spec text is correct, then there is a bug in epubckeck. It should not flag an SVG DTD as an invalid content; it does today.

@llemeurfr
Copy link

We could solve the issue by looking at it from another point of view: XML external entities can either be used by an XML instance (as in my previous sample) or by the DTD itself, as in the modularized SVG 1.1 DTD.

The first case is harmful, the second is not (the reason being that reading systems don't validate XML instances before display).

We solve the issue by re-phrasing the constraint "External identifiers MUST NOT appear in the document type declaration" as (the context being the Publication Resource) "It MUST NOT make use of XML external entities defined in a document type declaration".

EPUBCheck will have to be modified to accept the presence of external entities in DTD and block only if such entites are used in XML instances.

@iherman
Copy link
Member Author

iherman commented Oct 12, 2020

(Admin comment)

@llemeurfr linking this to the ongoing discussion on w3c/epub-specs#1338, in particular w3c/epub-specs#1338 (comment)...

We should really continue this discussion on one thread only...

@rdeltour rdeltour added status: blocked Blocked by another issue or situation and removed status: in discussion The issue is being discussed by the development team labels Oct 14, 2020
@rdeltour
Copy link
Member

Just a quick clarification, as I read the discussion of the past few days:

The EPUB 3.2 XML Conformance states that for any XML publication resource:

External identifiers MUST NOT appear in the document type declaration.

This unambiguously says that the SVG doctype declaration given by @iherman as an example is disallowed, which EPUBCheck reports (correctly, for now).

I don't see what's to be "interpreted" here, or what's the bug in EPUBCheck. Am I missing something?

Note that I'm not saying the spec is good there. The spec can be made less restrictive (I personally believe it should), but then again, that's for the WG to decide. I'm marking that issue as blocked, as it depends on updating the spec.

@iherman
Copy link
Member Author

iherman commented Oct 14, 2020

This unambiguously says that the SVG doctype declaration given by @iherman as an example is disallowed, which EPUBCheck reports (correctly, for now).

Well, looking at #1114 (comment), #1114 (comment), #1114 (comment) it seems that things are not that clear.

@rdeltour
Copy link
Member

rdeltour commented Oct 14, 2020

@iherman

it seems that things are not that clear.

OK, so let's put this differently 😊:

  • I assume we all agree that <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> is a document type declaration, correct?
  • in this declaration, PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" is an external identifier , composed of a public identifier ("-//W3C//DTD SVG 1.1//EN") and a system identifier ("http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"). Together, the public identifier and system identifier point to the external subset of the DTD.
  • then, the doctype is invalid in EPUB, as an external identifier literally appears in the document type declaration [edited for clarification].

Is it the second bullet above that you find ambiguous? 🤔

All I'm saying is that the EPUB conformance criteria doesn’t even goes into looking at the DTD content. It only says, "the doctype must not have a public identifier nor a system identifier".

Regarding the comments you reference:

Well, looking at #1114 (comment), #1114 (comment), #1114 (comment) it seems that things are not that clear.

  • @llemeurfr’s comment is about the spec intention ("I understand (…) as way to ensure that the content document is self-contained"), it doesn’t really say the spec as it is today is confusing.
  • @murata2makoto’ comment says that DTDs are generally confusing, again not the EPUB spec statement itself.
  • finally, your comment is where the confusion is, if I understand correctly.

You say (slightly reorganizing your comment):

Two ways of reading "External identifiers MUST NOT appear in the document type declaration" are

  • the Document Type Declaration MUST NOT contain external entities
  • The XML content MUST NOT contain external entities

The first one is true, as a corollary to the original statement. But really, the original statement suffice in itself to reject the SVG 1.1 doctype, with no need to a special "way of reading" or "interpretation" 😊

@iherman
Copy link
Member Author

iherman commented Oct 15, 2020

@rdeltour o.k., I agree that the issue here can be closed. I got a bit messed up by some of the reactions on this issue, to be honest. My real issue is w3c/epub-specs#1338, i.e., that the current restrictions in the EPUB spec is (in my view) unreasonably stringent.

@iherman
Copy link
Member Author

iherman commented Oct 24, 2020

This issue was discussed in a meeting of the EPUB WG.

  • RESOLVED: remove bullet “External identifiers MUST NOT appear in the document type declaration [XML].” Add reading system conformance saying that reading systems SHOULD NOT validate against DTDs.
View the transcript Problem with XML content requirements and external entities: make MathML and SVG1.1 invalid
Dave Cramer: w3c/epub-specs#1338
Dave Cramer: #1114
Dave Cramer: External entities.
Dave Cramer: https://www.w3.org/publishing/epub32/epub-spec.html#confreq-xml-extmarkupdecl
Dave Cramer: Lots of XML statements in specs:
External identifiers MUST NOT appear in the document type declaration [XML].
Dave Cramer: This is a problem.
… Breaks normal things — like inline SVG with DOCTYPE.
… I think that’s bad.
… Forcing folks to edit machine created content seems like a waste of effort.
Brady Duga: Agree with the summary and history
Dave Cramer: EPUB RS’s are not validating SVG’s with DOCTYPE of against that.
… Should not expect an RS to run an validator.
… Hope RS’s don’t need to be validating XML parsers.
… Maybe just say “ignore DOCTYPE’s” — don’t use content that depending on processing DTD features
Brady Duga: Stunned if anybody other than epubcheck is doing said validation.
Wendy Reid: +1 we do this too
Dave Cramer: Assuming WV’s doing to such validation.
Brady Duga: What is this XML of which you speak?
… RS’s likely display HTML, not XML, content.
… Natively not XHTML in RS’s.
Wendy Reid: +1
Dave Cramer: Principle in HTML that we should try to match reality — maybe we should do the same in here?
Zheng Xu (徐征): +1 to not restrict
Dave Cramer: Resolve tonight restricting DTDs in XML documents that are part of an epub.
Garth Conboy: I think we should try it!
Wendy Reid: In line with what Dave has say it — try it for now. It will come done to testing too.
… Can that bridge when we come to it.
Proposed resolution: remove bullet “External identifiers MUST NOT appear in the document type declaration [XML].” Add reading system conformance saying that reading systems SHOULD NOT validate against DTDs. (Dave Cramer)
Garth Conboy: +1
Wendy Reid: +1
Dave Cramer: RS’s have no validation requirement against DTDs that may be found.
Shinya Takami (高見真也): +1
Brady Duga: +1
Yu-Wei Chang (Yanni): +1
Ben Schroeter: +1
Wendy Reid: If you use inline SVG’s out of ID (and other commercial tools) you won’t get unexpected errors.
Matthew Chan: 0
Wendy Reid: Found W3C listing DTDs that should be used; seems strange to prohibit.
Laura Brady: +1
Wendy Reid: No objection; relax restriction on external identifiers in DTDs.
Resolution #2: remove bullet “External identifiers MUST NOT appear in the document type declaration [XML].” Add reading system conformance saying that reading systems SHOULD NOT validate against DTDs.

@rdeltour rdeltour added status: ready for implem The issue is ready to be implemented and removed status: blocked Blocked by another issue or situation labels Nov 13, 2021
@rdeltour rdeltour added this to the v5.0.0-beta-1 milestone Nov 13, 2021
@rdeltour rdeltour added status: in progress The issue is being implemented by the development team and removed status: ready for implem The issue is ready to be implemented labels Jan 23, 2022
rdeltour added a commit that referenced this issue Jul 8, 2022
EPUB 3.3. now allows a reserved set of external identifiers in doctype
declarations of documents with select media types.

See: https://www.w3.org/TR/epub-33/#app-identifiers-allowed

This commit:
- adds those as special cases to the XML parser code
- totally removes entity fetching for EPUB 3.3
- keeps forbidding external entities in the internal subset

Fix #1192, Fix #1114
@rdeltour rdeltour added status: completed Work completed, can be closed and removed status: in progress The issue is being implemented by the development team labels Sep 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec: EPUB 3.x Impacting the support of EPUB 3.x specifications status: completed Work completed, can be closed type: spec The issue is related to a Specification update
Projects
None yet
Development

No branches or pull requests

6 participants