Davidc/refactor schema #269

davidcarlisle · 2022-02-27T15:34:23Z

This Pull Request restructures appendix A (Parsing) removing most of the "helpful advice on how to use schema" and restricting to a basic presenttaion of the schema.

The Schema themselves are directly extracted from the github source (which is currently under mathml-refresh) at

https:/mathml-refresh/mathml-schema

Only the Relax NG schema is updated so far (and is the only one shown in this MathML Specification) The other formats can be mechanically generated when the Relax is finalised.

The schema are lightly syntax highlighted using a custom JavaScript highlighter specified inline in the ReSpec configuration (the default highlight.js does not have a relax ng module)

The Relax Schema has been completely refactored to based on a schema that (should) match MathML Core.

The history of changes to the schema is:

https:/mathml-refresh/mathml-schema/commits/main

Some movement towards removing long deprecated forms has been taken in this update but more could be done.

Also while the schema are being obtained from the mathml-schema github repository the documentation still states they are available at the original Math area at W3C eg http://www.w3.org/Math/RelaxNG/mathml4/mathml4.rnc This has not been done, the Group will need to decide whether to copy such auxiliary files there or to document the github source repository (and if that repositiory should be under w3c or mathml-refresh) . However such decisions can be taken later, they are not critical to this PR.

…y from github

physikerwelt

Thank you. Some minor comments.

index.html

physikerwelt · 2022-02-27T18:54:47Z

src/parsing-1.html

+ <h3 id="parsing_wellformed">Validating MathML</h3>
+
+ <p>Presented here are Relax NG schema for MathML.
+ A Relax NG Schema is most naturally associated with the XML serialization


Can you give a definition for naturally associated, I was unable to find one?

@physikerwelt relax ng is only defined for xml so "naturally associated with xml" is perhaps a euphemism for "should only be used with xml" however I know that validator.nu based html validators have used relax ng schema internally and any documnt format may be used so long as the parser provides a more or less XML compatible view of the parse tree, so many html parsers can provide an xml compatible dom. But I didn't want to say too much. I'll try to re-word, but perhaps not tonight.

Ok. While this information is helpful for developers of HTML validation services, I think it is more confusing for anybody else. Maybe, we can try to improve the new W3C validator and link to it. Thus, one could write:
The Relax NG schema may be used to check the XML serialization of MathML and serves as foundation for other standards that embed MathML such as HTML. Validators can be found at https://w3c.github.io/developers/tools/.

That wording reads well, I'll try to adjust ...

src/parsing-1.html

physikerwelt · 2022-02-27T19:01:17Z

src/parsing-1.html

+ allows a fixed attribute, <code>data-extra</code>,so input should be
+ normalized to remove data attributes before validating, or the schema
+ should be extended to support th attributes used in a particular
+ application.</p>


I think this should be rewritten in imperative style. Maybe:
Before validating MathML with the RelaxNG schema, the following transformation have to be applied.
1.)
2.)...
An example program is available from Appendix x

@physikerwelt I can try to make this clearer but I was trying to avoid giving a specific imperative normalisation.

The only thing that you could specify that "always worked" would be to delete data-foo attributes before validation, however my main use of Relax NG is to drive emacs context sesitive editing, so working on the live real document. So pre-processing by removing attributes really isn't an option. In practice if I was (as in fact I am) working on a document format making use of data-long and data-short attributes, I would not pre-process to normalise the input, I'd use a modified local schema that allowed (just) these two data- attributes. The case with onclick= and friends is similiar: I could specify a normalisation to lowercase so they validated but that isn't useful in editors doing live validation of the edited file. Again I'll leave this comment open and see if I (or anyone:-) can suggest better wording.

Incidentally I think this issue means that we should not make the schema normative as it was in mathml3, I don't see how a schema that you have to use "with care" in this way, either extending the schema or modifying the document can really be normative.

You can still delete it, if you copy the if you copy the stream and do the bookkeeping for the positions of the copied stream without MathML elements and the rest. In the same way, one could do validation for the CSS part of the style attribute, without to confuse CSS validators with MathML. We used this method in plagiarism detection extensively. However, I think it is generalizable as discussed in https://www.gipp.com/wp-content/papercite-data/pdf/beck2021.pdf

However, I think the primary focus of the document is to clearly define what is valid MathML. A discussion of how it can be used might be better placed at a website, a forum or another resource that can be changed without the effort of creating a new revision of the standard.

I am uncertain about the question if the schema should be normative. Overall, the "code is law" paradigm has some attractively, but maybe it creates more problem than it resolves.

@physikerwelt yes I agree with what you write here there are certainly ways to use this in most pipelines but the spec isn't really the place to describe the processing. In practice though I am not going to go into the editor code and change its processing pipeline to remove data-foo attributes while bookeeping the visible edit buffer. That would be a lot of work and a maintenance nightmare, compared to adding 2 lines to a local copy of the schema to add the specific data-foo atributes that I want. If I am not prepared to use the schema as written and would use a locally modified version then I'm wary of specifying that that is what people should do.

Previous versions of the spec could live in an XML world and just assert the schema was normative and any non valid document was by definition not mathml and out of scope for the specification. But in an HTML world that isn't really an option. The html parser (and mathml-core) being valid or not is just a state that is not visible to most users and the procssing is defined for all possible input.

Most MathML will (probably) live in HTML+MathML documents and there is no longer (as far as I know) an official HTML Schema, just various schemas that approximate the HTML parsing algorithm to various extents, so (a) most mathml in the wild may end up being invalid, and (b) any validation tools will need to use some custom code to validate html+mahml that may or may not make use of the schema presented here. But it's still not clear to me how much one can or should say in the spec. I have no particular attachment to the current wording though, I'll try again later to re-word and address some of your points.

Mh, I get the feeling that we are trying to solve the same problem that for example SVG has as well. In the abstract, the SVG spec says

SVG content is stylable, scalable to different display resolutions, and can be viewed stand-alone, mixed with HTML content, or embedded using XML namespaces within other XML languages.

However, I could not find a RelaxNG schema there. From a first glance, I get the impression that they use IDL instead.

@davidcarlisle are their major drawbacks in making relax non-normative and replace it with IDL?

I don't think IDL is feasable at all. It's essentially a programming API so for MathML-core in the specific context of a web browser that makes sense but we don't have a DOM API for mathml full at all, we tried to specify one (and had IDL for it) in MathML2 see https://www.w3.org/TR/MathML2/appendixd.html#dom.interfaces But MathML Full isn't a purely web platform language and not all (or even most) implementations are DOM based so it was dropped. Adding one now is certainly put of scope for this iteration of the spec. @physikerwelt

I'm guessing, but I think the W3C Validator, and/or epubcheck need a schema to validate, and indeed use one of the RelaxNG schemas when it comes to SVG. Random links: nu html checker here, epubcheck here.

They seem to validate against SVG 1.1.
The most recent schema version I could find was the RNG for SVG 1.2 here.

LaTeXML's schema is also maintained in RelaxNG, for what that's worth.

Okay. The MathML 2 example does not look bad. Anyhow, I see that it cannot be translated from one to the other fully automatic. I wonder if we could specify a validation rule for the intent attribute (with either language).

I was wondering about intent, relax (and xsd) schema gives you a regex pattern so you could do a light validation of intent but not actually check the full grammar.

physikerwelt · 2022-02-27T19:04:30Z

src/parsing-2.html

@@ -1,66 +1,24 @@
- 
+
 <section>
 <h3 id="parsing_usingdtd">Using the MathML DTD</h3>


I would remove the paragraph. Maybe it would be good enough to link to a program that can be used to generate a DTD from RelaxNG

I cut it down a lot from how it as in mathml3, I'll cut it down a bit more in the following checkin.

@AndreG-P and I were trying to use DTD based MathML validation in one project, but it was a dead-end road. We restarted with stylesheet-based validation (https:/ag-gipp/MathMLTools), which worked, but was too slow for use on large datasets. Eventually, I gave up the idea of validating everything and tried to deal with less structured material.

I think it would be better to have one good way to fully validate MathML, rather than several approaches that do only some validation.

…ions)

davidcarlisle · 2022-02-28T20:23:20Z

@physikerwelt @dginev Hi there was some duplicated text in the original version of this PR with some sections I had intended to move appearing twice (in slightly modified form) I have deleted those now so the text is much reduced.

I think the wording will need to change later, especially as the group looks at specifying MathML conformance. However as I mentioned on the Group call just now I'd like to merge this is if there are no objections so that the actual referenced schema are at least closer to MathML4 plans and not simply a copy of MathML3. I don't want to merge though while there are open comments here. So asking if you would click approve in review to authorise a merge. If you don't feel you want to do that yet and would rather keep this open for a bit longer that is OK too, but I thought I'd ask...

dginev

I just left a comment on a minor detail, did not really intend to do a full review myself. So don't consider me blocking anything - thanks for starting the ball rolling!

davidcarlisle added 2 commits February 27, 2022 00:11

add custom highligher to respec schema inclusion

dbdb85b

update and simplify text describing the schema, format schema directl…

fe8b55b

…y from github

physikerwelt reviewed Feb 27, 2022

View reviewed changes

davidcarlisle added 4 commits February 27, 2022 21:27

edits after review in PR #269

4bb74ce

delete duplicate/obsolete sections

38f5a10

delete mathml3 -mathml3 2nd ed changelog (references now deleted sect…

ca6c4a0

…ions)

remove 'naturally associated with' wording as suggested in PR #269

c6899b8

dginev self-requested a review February 28, 2022 20:26

dginev approved these changes Feb 28, 2022

View reviewed changes

physikerwelt approved these changes Feb 28, 2022

View reviewed changes

davidcarlisle merged commit 66b3b9c into gh-pages Feb 28, 2022

davidcarlisle deleted the davidc/refactor-schema branch February 28, 2022 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Davidc/refactor schema #269

Davidc/refactor schema #269

davidcarlisle commented Feb 27, 2022

physikerwelt left a comment

physikerwelt Feb 27, 2022

davidcarlisle Feb 27, 2022

physikerwelt Feb 28, 2022

davidcarlisle Feb 28, 2022

physikerwelt Feb 27, 2022

davidcarlisle Feb 27, 2022

physikerwelt Feb 28, 2022

davidcarlisle Feb 28, 2022

physikerwelt Feb 28, 2022

physikerwelt Feb 28, 2022

davidcarlisle Feb 28, 2022

dginev Feb 28, 2022

physikerwelt Feb 28, 2022

davidcarlisle Feb 28, 2022

physikerwelt Feb 27, 2022

davidcarlisle Feb 27, 2022

physikerwelt Feb 28, 2022

davidcarlisle commented Feb 28, 2022

dginev left a comment

		@@ -1,66 +1,24 @@


		<section>
		<h3 id="parsing_usingdtd">Using the MathML DTD</h3>

Davidc/refactor schema #269

Davidc/refactor schema #269

Conversation

davidcarlisle commented Feb 27, 2022

physikerwelt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidcarlisle commented Feb 28, 2022

dginev left a comment

Choose a reason for hiding this comment