Continuation of the ASCII reader PR #136

steven-joruk · 2024-03-21T02:27:00Z

This continues from #44

Some comments brought over from there:

~~The serde tests highlighted that documents that don't begin with exactly <?xml, with no preceding whitespace, will be treated as ascii, which might not be desirable.~~
~~The master branch is broken due to denying warnings and a deprecation, I switched to using swap_remove.~~
I had to allow escaping \ (\\) because the test that parses netnewswire.pbxproj fails without it.

The fuzzer quickly found an infinite loop in handling block comments. I let it run for another 10 hours, it tried 510 million inputs without finding anything else.

The related issue (#42) contains a suggestion that it should be renamed to OpenStepReader or similar. I don't know the full history of the format (wikipedia discusses it here). If I'm understanding it correctly then NextStep read integers as strings, OpenStep supported integers and real numbers, GNUStep supported NSValue and NSDate. This is missing support for floats.

ebarnard · 2024-03-23T18:48:26Z

Item 1 Is the biggest issue - a well-formed UTF8 XML document can start with a BOM which we must support, and ideally we would also support XML plists that have whitespace before the leading < character.

Can the first character of a reasonable ASCII plist file be a <?

steven-joruk · 2024-03-24T12:51:58Z

Item 1 Is the biggest issue - a well-formed UTF8 XML document can start with a BOM which we must support, and ideally we would also support XML plists that have whitespace before the leading < character.

I agree, I've pushed a fix. If there's any unicode byte order mark or if the first non-whitespace string is "<?xml" then it's considered XML.

ebarnard

This looks good. The only thing to sort out is some testing around what gets picked up as XML vs ASCII as this is the only thing that could break existing code.

ebarnard · 2024-03-25T17:47:46Z

src/stream/mod.rs

+ }
+
+ fn is_xml(reader: &mut R) -> Result<bool, Error> {
+ const UTF32_BE_BOM: [u8; 4] = [0, 0, 0xfe, 0xff];


I think everything in this crate makes the assumption that the document is ASCII/UTF8. Can these other BOMs appear in valid UTF8 documents?

Certainly nul bytes would be an unusual thing to see in a UTF8 document.

is_xml would be accurately be described as should_use_xml_reader.

I've included all BOMs, even those the XML reader doesn't support, because the ASCII reader doesn't support documents with BOMs and the XML reader might gain support for other encodings, I'm deferring to its BOM handling.

We could add support for GNUstep plists at a later date, which supports UTF-8, without an API break.

src/stream/mod.rs

src/stream/ascii_reader.rs

src/stream/mod.rs

ebarnard · 2024-03-28T18:00:13Z

src/stream/mod.rs

@@ -263,12 +341,20 @@ impl<R: Read + Seek> Iterator for Reader<R> {
 let mut reader = match self.0 {


This auto-detection code really needs some tests around what gets detected as XML vs ASCII, just so its codified somewhere.

Added tests in xml_detection and some updated detection logic to allow for the variety of acceptable ways to start an XML plist.

…scanner

An unquoted string literal can now start by any character that is not a reserved one.

We now keep the peeked character when we advance in the reader.

Use read_exact() on the Reader instead of read(). The setup code in `new()` has been replaced by a check in `advance()`.

i.e., be aware of \" in a string.

The document began with a newline, which caused it to be identified as an ascii plist.

Also fix an infinite loop discovered with the fuzzer.

…der mark

steven-joruk force-pushed the ascii-reader branch from c690a2e to d50c6c8 Compare March 24, 2024 12:51

steven-joruk force-pushed the ascii-reader branch 2 times, most recently from 795be0c to b13654e Compare March 24, 2024 12:59

ebarnard reviewed Mar 28, 2024

View reviewed changes

steven-joruk force-pushed the ascii-reader branch from e866ab3 to 7ffe433 Compare March 29, 2024 16:17

fstephany and others added 23 commits June 30, 2024 20:51

First step towards ASCII parser

1b5bcac

Cont'd

aa6dd27

Get rid of unecessray stuff and start to nail the pull nature of the …

0d03b93

…scanner

Functional parser, first time the example test passes

dc8cfb7

Start to implement comments

8c1ef56

Handle comments

4e55247

Handling non-ASCII inputs

8144452

Track current position in asciireader

ae6da0a

Add real pbxproj file for testing

50d4007

An unquoted string literal can now start by any character that is not a reserved one.

Add Integer as first class type

2bfcee7

Remove dependency on Seek for AsciiReader

19cceff

We now keep the peeked character when we advance in the reader.

Bubble up io::Error when reading a char from the Reader

0ee3817

Use read_exact() on the Reader instead of read(). The setup code in `new()` has been replaced by a check in `advance()`.

Avoid double allocation when parsing unquoted string litteral

b570029

Handle escaped quote in quoted strings

15a2f00

i.e., be aware of \" in a string.

Update AsciiReader now that Event::String contains Cow<'_, str>

b315cd7

Update AsciiReader to use OwnedEvent

9a75f23

Add missing Path import for xml_reader tests

add86e3

Fix the dictionary_deserialize_dictionary_in_struct test

44e36f6

The document began with a newline, which caused it to be identified as an ascii plist.

Add support for escape sequences in ASCII plists

fe4e320

Address existing code review comments

fa527fa

Fuzz AsciiReader

919967a

Also fix an infinite loop discovered with the fuzzer.

Detect XML even if it has leading whitespace and/or a unicode byte or…

e6d3bf6

…der mark

Address code review

2488036

ebarnard added 3 commits June 30, 2024 21:59

Detect XML plists by trying to read as XML

23839c2

Add tests for plist type autodetection code

b5846b0

Add functions for reading a non-seekable ASCII plist

cdf0aa0

ebarnard force-pushed the ascii-reader branch from 7ffe433 to cdf0aa0 Compare June 30, 2024 21:26

ebarnard merged commit 45d9a58 into ebarnard:master Jun 30, 2024
7 checks passed

ebarnard mentioned this pull request Jun 30, 2024

Add Ascii Reader #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuation of the ASCII reader PR #136

Continuation of the ASCII reader PR #136

steven-joruk commented Mar 21, 2024 •

edited

Loading

ebarnard commented Mar 23, 2024

steven-joruk commented Mar 24, 2024

ebarnard left a comment

ebarnard Mar 25, 2024

ebarnard Mar 25, 2024

steven-joruk Mar 29, 2024 •

edited

Loading

ebarnard Mar 28, 2024

steven-joruk Mar 29, 2024 •

edited

Loading

		@@ -263,12 +341,20 @@ impl<R: Read + Seek> Iterator for Reader<R> {
		let mut reader = match self.0 {

Continuation of the ASCII reader PR #136

Continuation of the ASCII reader PR #136

Conversation

steven-joruk commented Mar 21, 2024 • edited Loading

ebarnard commented Mar 23, 2024

steven-joruk commented Mar 24, 2024

ebarnard left a comment

Choose a reason for hiding this comment

ebarnard Mar 25, 2024

Choose a reason for hiding this comment

ebarnard Mar 25, 2024

Choose a reason for hiding this comment

steven-joruk Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

ebarnard Mar 28, 2024

Choose a reason for hiding this comment

steven-joruk Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

steven-joruk commented Mar 21, 2024 •

edited

Loading

steven-joruk Mar 29, 2024 •

edited

Loading

steven-joruk Mar 29, 2024 •

edited

Loading