Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuation of the ASCII reader PR #136

Merged
merged 26 commits into from
Jun 30, 2024

Conversation

steven-joruk
Copy link
Contributor

@steven-joruk steven-joruk commented Mar 21, 2024

This continues from #44

Some comments brought over from there:

  1. The serde tests highlighted that documents that don't begin with exactly <?xml, with no preceding whitespace, will be treated as ascii, which might not be desirable.
  2. The master branch is broken due to denying warnings and a deprecation, I switched to using swap_remove.
  3. I had to allow escaping \ (\\) because the test that parses netnewswire.pbxproj fails without it.

The fuzzer quickly found an infinite loop in handling block comments. I let it run for another 10 hours, it tried 510 million inputs without finding anything else.

The related issue (#42) contains a suggestion that it should be renamed to OpenStepReader or similar. I don't know the full history of the format (wikipedia discusses it here). If I'm understanding it correctly then NextStep read integers as strings, OpenStep supported integers and real numbers, GNUStep supported NSValue and NSDate. This is missing support for floats.

@ebarnard
Copy link
Owner

Item 1 Is the biggest issue - a well-formed UTF8 XML document can start with a BOM which we must support, and ideally we would also support XML plists that have whitespace before the leading < character.

Can the first character of a reasonable ASCII plist file be a <?

@steven-joruk
Copy link
Contributor Author

Item 1 Is the biggest issue - a well-formed UTF8 XML document can start with a BOM which we must support, and ideally we would also support XML plists that have whitespace before the leading < character.

I agree, I've pushed a fix. If there's any unicode byte order mark or if the first non-whitespace string is "<?xml" then it's considered XML.

@steven-joruk steven-joruk force-pushed the ascii-reader branch 2 times, most recently from 795be0c to b13654e Compare March 24, 2024 12:59
Copy link
Owner

@ebarnard ebarnard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. The only thing to sort out is some testing around what gets picked up as XML vs ASCII as this is the only thing that could break existing code.

}

fn is_xml(reader: &mut R) -> Result<bool, Error> {
const UTF32_BE_BOM: [u8; 4] = [0, 0, 0xfe, 0xff];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything in this crate makes the assumption that the document is ASCII/UTF8. Can these other BOMs appear in valid UTF8 documents?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly nul bytes would be an unusual thing to see in a UTF8 document.

Copy link
Contributor Author

@steven-joruk steven-joruk Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_xml would be accurately be described as should_use_xml_reader.

I've included all BOMs, even those the XML reader doesn't support, because the ASCII reader doesn't support documents with BOMs and the XML reader might gain support for other encodings, I'm deferring to its BOM handling.

We could add support for GNUstep plists at a later date, which supports UTF-8, without an API break.

src/stream/mod.rs Outdated Show resolved Hide resolved
src/stream/ascii_reader.rs Outdated Show resolved Hide resolved
src/stream/mod.rs Outdated Show resolved Hide resolved
@@ -263,12 +341,20 @@ impl<R: Read + Seek> Iterator for Reader<R> {
let mut reader = match self.0 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This auto-detection code really needs some tests around what gets detected as XML vs ASCII, just so its codified somewhere.

Copy link
Contributor Author

@steven-joruk steven-joruk Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests in xml_detection and some updated detection logic to allow for the variety of acceptable ways to start an XML plist.

fstephany and others added 23 commits June 30, 2024 20:51
An unquoted string literal can now start by any character that is not a reserved one.
We now keep the peeked character when we advance in the reader.
Use read_exact() on the Reader instead of read(). 
The setup code in `new()` has been replaced by a check in `advance()`.
i.e., be aware of \" in a string.
The document began with a newline, which caused it to be identified as
an ascii plist.
Also fix an infinite loop discovered with the fuzzer.
@ebarnard ebarnard merged commit 45d9a58 into ebarnard:master Jun 30, 2024
7 checks passed
@ebarnard ebarnard mentioned this pull request Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants