Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Davidc/refactor schema #269

Merged
merged 6 commits into from
Feb 28, 2022
Merged

Davidc/refactor schema #269

merged 6 commits into from
Feb 28, 2022

Conversation

davidcarlisle
Copy link
Collaborator

This Pull Request restructures appendix A (Parsing) removing most of the "helpful advice on how to use schema" and restricting to a basic presenttaion of the schema.

The Schema themselves are directly extracted from the github source (which is currently under mathml-refresh) at

https:/mathml-refresh/mathml-schema

Only the Relax NG schema is updated so far (and is the only one shown in this MathML Specification) The other formats can be mechanically generated when the Relax is finalised.

The schema are lightly syntax highlighted using a custom JavaScript highlighter specified inline in the ReSpec configuration (the default highlight.js does not have a relax ng module)

The Relax Schema has been completely refactored to based on a schema that (should) match MathML Core.

The history of changes to the schema is:

https:/mathml-refresh/mathml-schema/commits/main

Some movement towards removing long deprecated forms has been taken in this update but more could be done.

Also while the schema are being obtained from the mathml-schema github repository the documentation still states they are available at the original Math area at W3C eg http://www.w3.org/Math/RelaxNG/mathml4/mathml4.rnc This has not been done, the Group will need to decide whether to copy such auxiliary files there or to document the github source repository (and if that repositiory should be under w3c or mathml-refresh) . However such decisions can be taken later, they are not critical to this PR.

Copy link
Member

@physikerwelt physikerwelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Some minor comments.

index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
<h3 id="parsing_wellformed">Validating MathML</h3>

<p>Presented here are Relax NG schema for MathML.
A Relax NG Schema is most naturally associated with the XML serialization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a definition for naturally associated, I was unable to find one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@physikerwelt relax ng is only defined for xml so "naturally associated with xml" is perhaps a euphemism for "should only be used with xml" however I know that validator.nu based html validators have used relax ng schema internally and any documnt format may be used so long as the parser provides a more or less XML compatible view of the parse tree, so many html parsers can provide an xml compatible dom. But I didn't want to say too much. I'll try to re-word, but perhaps not tonight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. While this information is helpful for developers of HTML validation services, I think it is more confusing for anybody else. Maybe, we can try to improve the new W3C validator and link to it. Thus, one could write:
The Relax NG schema may be used to check the XML serialization of MathML and serves as foundation for other standards that embed MathML such as HTML. Validators can be found at https://w3c.github.io/developers/tools/.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wording reads well, I'll try to adjust ...

src/parsing-1.html Outdated Show resolved Hide resolved
src/parsing-1.html Outdated Show resolved Hide resolved
allows a fixed attribute, <code>data-extra</code>,so input should be
normalized to remove data attributes before validating, or the schema
should be extended to support th attributes used in a particular
application.</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be rewritten in imperative style. Maybe:
Before validating MathML with the RelaxNG schema, the following transformation have to be applied.
1.)
2.)...
An example program is available from Appendix x

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@physikerwelt I can try to make this clearer but I was trying to avoid giving a specific imperative normalisation.

The only thing that you could specify that "always worked" would be to delete data-foo attributes before validation, however my main use of Relax NG is to drive emacs context sesitive editing, so working on the live real document. So pre-processing by removing attributes really isn't an option. In practice if I was (as in fact I am) working on a document format making use of data-long and data-short attributes, I would not pre-process to normalise the input, I'd use a modified local schema that allowed (just) these two data- attributes. The case with onclick= and friends is similiar: I could specify a normalisation to lowercase so they validated but that isn't useful in editors doing live validation of the edited file. Again I'll leave this comment open and see if I (or anyone:-) can suggest better wording.

Incidentally I think this issue means that we should not make the schema normative as it was in mathml3, I don't see how a schema that you have to use "with care" in this way, either extending the schema or modifying the document can really be normative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still delete it, if you copy the if you copy the stream and do the bookkeeping for the positions of the copied stream without MathML elements and the rest. In the same way, one could do validation for the CSS part of the style attribute, without to confuse CSS validators with MathML. We used this method in plagiarism detection extensively. However, I think it is generalizable as discussed in https://www.gipp.com/wp-content/papercite-data/pdf/beck2021.pdf

However, I think the primary focus of the document is to clearly define what is valid MathML. A discussion of how it can be used might be better placed at a website, a forum or another resource that can be changed without the effort of creating a new revision of the standard.

I am uncertain about the question if the schema should be normative. Overall, the "code is law" paradigm has some attractively, but maybe it creates more problem than it resolves.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@physikerwelt yes I agree with what you write here there are certainly ways to use this in most pipelines but the spec isn't really the place to describe the processing. In practice though I am not going to go into the editor code and change its processing pipeline to remove data-foo attributes while bookeeping the visible edit buffer. That would be a lot of work and a maintenance nightmare, compared to adding 2 lines to a local copy of the schema to add the specific data-foo atributes that I want. If I am not prepared to use the schema as written and would use a locally modified version then I'm wary of specifying that that is what people should do.

Previous versions of the spec could live in an XML world and just assert the schema was normative and any non valid document was by definition not mathml and out of scope for the specification. But in an HTML world that isn't really an option. The html parser (and mathml-core) being valid or not is just a state that is not visible to most users and the procssing is defined for all possible input.

Most MathML will (probably) live in HTML+MathML documents and there is no longer (as far as I know) an official HTML Schema, just various schemas that approximate the HTML parsing algorithm to various extents, so (a) most mathml in the wild may end up being invalid, and (b) any validation tools will need to use some custom code to validate html+mahml that may or may not make use of the schema presented here. But it's still not clear to me how much one can or should say in the spec. I have no particular attachment to the current wording though, I'll try again later to re-word and address some of your points.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh, I get the feeling that we are trying to solve the same problem that for example SVG has as well. In the abstract, the SVG spec says

SVG content is stylable, scalable to different display resolutions, and can be viewed stand-alone, mixed with HTML content, or embedded using XML namespaces within other XML languages.

However, I could not find a RelaxNG schema there. From a first glance, I get the impression that they use IDL instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidcarlisle are their major drawbacks in making relax non-normative and replace it with IDL?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think IDL is feasable at all. It's essentially a programming API so for MathML-core in the specific context of a web browser that makes sense but we don't have a DOM API for mathml full at all, we tried to specify one (and had IDL for it) in MathML2 see https://www.w3.org/TR/MathML2/appendixd.html#dom.interfaces But MathML Full isn't a purely web platform language and not all (or even most) implementations are DOM based so it was dropped. Adding one now is certainly put of scope for this iteration of the spec. @physikerwelt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing, but I think the W3C Validator, and/or epubcheck need a schema to validate, and indeed use one of the RelaxNG schemas when it comes to SVG. Random links: nu html checker here, epubcheck here.

They seem to validate against SVG 1.1.
The most recent schema version I could find was the RNG for SVG 1.2 here.

LaTeXML's schema is also maintained in RelaxNG, for what that's worth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. The MathML 2 example does not look bad. Anyhow, I see that it cannot be translated from one to the other fully automatic. I wonder if we could specify a validation rule for the intent attribute (with either language).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about intent, relax (and xsd) schema gives you a regex pattern so you could do a light validation of intent but not actually check the full grammar.

@@ -1,66 +1,24 @@

<section>
<h3 id="parsing_usingdtd">Using the MathML DTD</h3>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the paragraph. Maybe it would be good enough to link to a program that can be used to generate a DTD from RelaxNG

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cut it down a lot from how it as in mathml3, I'll cut it down a bit more in the following checkin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndreG-P and I were trying to use DTD based MathML validation in one project, but it was a dead-end road. We restarted with stylesheet-based validation (https:/ag-gipp/MathMLTools), which worked, but was too slow for use on large datasets. Eventually, I gave up the idea of validating everything and tried to deal with less structured material.

I think it would be better to have one good way to fully validate MathML, rather than several approaches that do only some validation.

@davidcarlisle
Copy link
Collaborator Author

@physikerwelt @dginev Hi there was some duplicated text in the original version of this PR with some sections I had intended to move appearing twice (in slightly modified form) I have deleted those now so the text is much reduced.

I think the wording will need to change later, especially as the group looks at specifying MathML conformance. However as I mentioned on the Group call just now I'd like to merge this is if there are no objections so that the actual referenced schema are at least closer to MathML4 plans and not simply a copy of MathML3. I don't want to merge though while there are open comments here. So asking if you would click approve in review to authorise a merge. If you don't feel you want to do that yet and would rather keep this open for a bit longer that is OK too, but I thought I'd ask...

@dginev dginev self-requested a review February 28, 2022 20:26
Copy link
Contributor

@dginev dginev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left a comment on a minor detail, did not really intend to do a full review myself. So don't consider me blocking anything - thanks for starting the ball rolling!

@davidcarlisle davidcarlisle merged commit 66b3b9c into gh-pages Feb 28, 2022
@davidcarlisle davidcarlisle deleted the davidc/refactor-schema branch February 28, 2022 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants