-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Davidc/refactor schema #269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Some minor comments.
src/parsing-1.html
Outdated
<h3 id="parsing_wellformed">Validating MathML</h3> | ||
|
||
<p>Presented here are Relax NG schema for MathML. | ||
A Relax NG Schema is most naturally associated with the XML serialization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a definition for naturally associated, I was unable to find one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@physikerwelt relax ng is only defined for xml so "naturally associated with xml" is perhaps a euphemism for "should only be used with xml" however I know that validator.nu based html validators have used relax ng schema internally and any documnt format may be used so long as the parser provides a more or less XML compatible view of the parse tree, so many html parsers can provide an xml compatible dom. But I didn't want to say too much. I'll try to re-word, but perhaps not tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. While this information is helpful for developers of HTML validation services, I think it is more confusing for anybody else. Maybe, we can try to improve the new W3C validator and link to it. Thus, one could write:
The Relax NG schema may be used to check the XML serialization of MathML and serves as foundation for other standards that embed MathML such as HTML. Validators can be found at https://w3c.github.io/developers/tools/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wording reads well, I'll try to adjust ...
src/parsing-1.html
Outdated
allows a fixed attribute, <code>data-extra</code>,so input should be | ||
normalized to remove data attributes before validating, or the schema | ||
should be extended to support th attributes used in a particular | ||
application.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be rewritten in imperative style. Maybe:
Before validating MathML with the RelaxNG schema, the following transformation have to be applied.
1.)
2.)...
An example program is available from Appendix x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@physikerwelt I can try to make this clearer but I was trying to avoid giving a specific imperative normalisation.
The only thing that you could specify that "always worked" would be to delete data-foo attributes before validation, however my main use of Relax NG is to drive emacs context sesitive editing, so working on the live real document. So pre-processing by removing attributes really isn't an option. In practice if I was (as in fact I am) working on a document format making use of data-long and data-short attributes, I would not pre-process to normalise the input, I'd use a modified local schema that allowed (just) these two data- attributes. The case with onclick= and friends is similiar: I could specify a normalisation to lowercase so they validated but that isn't useful in editors doing live validation of the edited file. Again I'll leave this comment open and see if I (or anyone:-) can suggest better wording.
Incidentally I think this issue means that we should not make the schema normative as it was in mathml3, I don't see how a schema that you have to use "with care" in this way, either extending the schema or modifying the document can really be normative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still delete it, if you copy the if you copy the stream and do the bookkeeping for the positions of the copied stream without MathML elements and the rest. In the same way, one could do validation for the CSS part of the style attribute, without to confuse CSS validators with MathML. We used this method in plagiarism detection extensively. However, I think it is generalizable as discussed in https://www.gipp.com/wp-content/papercite-data/pdf/beck2021.pdf
However, I think the primary focus of the document is to clearly define what is valid MathML. A discussion of how it can be used might be better placed at a website, a forum or another resource that can be changed without the effort of creating a new revision of the standard.
I am uncertain about the question if the schema should be normative. Overall, the "code is law" paradigm has some attractively, but maybe it creates more problem than it resolves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@physikerwelt yes I agree with what you write here there are certainly ways to use this in most pipelines but the spec isn't really the place to describe the processing. In practice though I am not going to go into the editor code and change its processing pipeline to remove data-foo attributes while bookeeping the visible edit buffer. That would be a lot of work and a maintenance nightmare, compared to adding 2 lines to a local copy of the schema to add the specific data-foo atributes that I want. If I am not prepared to use the schema as written and would use a locally modified version then I'm wary of specifying that that is what people should do.
Previous versions of the spec could live in an XML world and just assert the schema was normative and any non valid document was by definition not mathml and out of scope for the specification. But in an HTML world that isn't really an option. The html parser (and mathml-core) being valid or not is just a state that is not visible to most users and the procssing is defined for all possible input.
Most MathML will (probably) live in HTML+MathML documents and there is no longer (as far as I know) an official HTML Schema, just various schemas that approximate the HTML parsing algorithm to various extents, so (a) most mathml in the wild may end up being invalid, and (b) any validation tools will need to use some custom code to validate html+mahml that may or may not make use of the schema presented here. But it's still not clear to me how much one can or should say in the spec. I have no particular attachment to the current wording though, I'll try again later to re-word and address some of your points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mh, I get the feeling that we are trying to solve the same problem that for example SVG has as well. In the abstract, the SVG spec says
SVG content is stylable, scalable to different display resolutions, and can be viewed stand-alone, mixed with HTML content, or embedded using XML namespaces within other XML languages.
However, I could not find a RelaxNG schema there. From a first glance, I get the impression that they use IDL instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidcarlisle are their major drawbacks in making relax non-normative and replace it with IDL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think IDL is feasable at all. It's essentially a programming API so for MathML-core in the specific context of a web browser that makes sense but we don't have a DOM API for mathml full at all, we tried to specify one (and had IDL for it) in MathML2 see https://www.w3.org/TR/MathML2/appendixd.html#dom.interfaces But MathML Full isn't a purely web platform language and not all (or even most) implementations are DOM based so it was dropped. Adding one now is certainly put of scope for this iteration of the spec. @physikerwelt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing, but I think the W3C Validator, and/or epubcheck need a schema to validate, and indeed use one of the RelaxNG schemas when it comes to SVG. Random links: nu html checker here, epubcheck here.
They seem to validate against SVG 1.1.
The most recent schema version I could find was the RNG for SVG 1.2 here.
LaTeXML's schema is also maintained in RelaxNG, for what that's worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. The MathML 2 example does not look bad. Anyhow, I see that it cannot be translated from one to the other fully automatic. I wonder if we could specify a validation rule for the intent attribute (with either language).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about intent, relax (and xsd) schema gives you a regex pattern so you could do a light validation of intent but not actually check the full grammar.
@@ -1,66 +1,24 @@ | |||
|
|||
<section> | |||
<h3 id="parsing_usingdtd">Using the MathML DTD</h3> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the paragraph. Maybe it would be good enough to link to a program that can be used to generate a DTD from RelaxNG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cut it down a lot from how it as in mathml3, I'll cut it down a bit more in the following checkin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AndreG-P and I were trying to use DTD based MathML validation in one project, but it was a dead-end road. We restarted with stylesheet-based validation (https:/ag-gipp/MathMLTools), which worked, but was too slow for use on large datasets. Eventually, I gave up the idea of validating everything and tried to deal with less structured material.
I think it would be better to have one good way to fully validate MathML, rather than several approaches that do only some validation.
@physikerwelt @dginev Hi there was some duplicated text in the original version of this PR with some sections I had intended to move appearing twice (in slightly modified form) I have deleted those now so the text is much reduced. I think the wording will need to change later, especially as the group looks at specifying MathML conformance. However as I mentioned on the Group call just now I'd like to merge this is if there are no objections so that the actual referenced schema are at least closer to MathML4 plans and not simply a copy of MathML3. I don't want to merge though while there are open comments here. So asking if you would click approve in review to authorise a merge. If you don't feel you want to do that yet and would rather keep this open for a bit longer that is OK too, but I thought I'd ask... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just left a comment on a minor detail, did not really intend to do a full review myself. So don't consider me blocking anything - thanks for starting the ball rolling!
This Pull Request restructures appendix A (Parsing) removing most of the "helpful advice on how to use schema" and restricting to a basic presenttaion of the schema.
The Schema themselves are directly extracted from the github source (which is currently under mathml-refresh) at
https:/mathml-refresh/mathml-schema
Only the Relax NG schema is updated so far (and is the only one shown in this MathML Specification) The other formats can be mechanically generated when the Relax is finalised.
The schema are lightly syntax highlighted using a custom JavaScript highlighter specified inline in the ReSpec configuration (the default highlight.js does not have a relax ng module)
The Relax Schema has been completely refactored to based on a schema that (should) match MathML Core.
The history of changes to the schema is:
https:/mathml-refresh/mathml-schema/commits/main
Some movement towards removing long deprecated forms has been taken in this update but more could be done.
Also while the schema are being obtained from the mathml-schema github repository the documentation still states they are available at the original Math area at W3C eg http://www.w3.org/Math/RelaxNG/mathml4/mathml4.rnc This has not been done, the Group will need to decide whether to copy such auxiliary files there or to document the github source repository (and if that repositiory should be under w3c or mathml-refresh) . However such decisions can be taken later, they are not critical to this PR.