-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuation of the ASCII reader PR #136
Conversation
Item 1 Is the biggest issue - a well-formed UTF8 XML document can start with a BOM which we must support, and ideally we would also support XML plists that have whitespace before the leading Can the first character of a reasonable ASCII plist file be a |
c690a2e
to
d50c6c8
Compare
I agree, I've pushed a fix. If there's any unicode byte order mark or if the first non-whitespace string is "<?xml" then it's considered XML. |
795be0c
to
b13654e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. The only thing to sort out is some testing around what gets picked up as XML vs ASCII as this is the only thing that could break existing code.
src/stream/mod.rs
Outdated
} | ||
|
||
fn is_xml(reader: &mut R) -> Result<bool, Error> { | ||
const UTF32_BE_BOM: [u8; 4] = [0, 0, 0xfe, 0xff]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everything in this crate makes the assumption that the document is ASCII/UTF8. Can these other BOMs appear in valid UTF8 documents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly nul
bytes would be an unusual thing to see in a UTF8 document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_xml
would be accurately be described as should_use_xml_reader
.
I've included all BOMs, even those the XML reader doesn't support, because the ASCII reader doesn't support documents with BOMs and the XML reader might gain support for other encodings, I'm deferring to its BOM handling.
We could add support for GNUstep plists at a later date, which supports UTF-8, without an API break.
src/stream/mod.rs
Outdated
@@ -263,12 +341,20 @@ impl<R: Read + Seek> Iterator for Reader<R> { | |||
let mut reader = match self.0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This auto-detection code really needs some tests around what gets detected as XML vs ASCII, just so its codified somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests in xml_detection
and some updated detection logic to allow for the variety of acceptable ways to start an XML plist.
e866ab3
to
7ffe433
Compare
An unquoted string literal can now start by any character that is not a reserved one.
We now keep the peeked character when we advance in the reader.
Use read_exact() on the Reader instead of read(). The setup code in `new()` has been replaced by a check in `advance()`.
i.e., be aware of \" in a string.
The document began with a newline, which caused it to be identified as an ascii plist.
Also fix an infinite loop discovered with the fuzzer.
This continues from #44
Some comments brought over from there:
The serde tests highlighted that documents that don't begin with exactly<?xml
, with no preceding whitespace, will be treated as ascii, which might not be desirable.The master branch is broken due to denying warnings and a deprecation, I switched to usingswap_remove
.\
(\\
) because the test that parses netnewswire.pbxproj fails without it.The fuzzer quickly found an infinite loop in handling block comments. I let it run for another 10 hours, it tried 510 million inputs without finding anything else.
The related issue (#42) contains a suggestion that it should be renamed to
OpenStepReader
or similar. I don't know the full history of the format (wikipedia discusses it here). If I'm understanding it correctly then NextStep read integers as strings, OpenStep supported integers and real numbers, GNUStep supported NSValue and NSDate. This is missing support for floats.