Remove byte-order mark from JSON stream and first CSV chunk #53

mint-thompson · 2024-08-11T15:28:11Z

Problem

When validating JSON with a readable stream, the presence of a byte-order mark causes a parser error.

When validating CSV, an initial BOM will be part of the first column name. If the first column name is also quoted, this means that the value will not be recognized as quoted. This causes the first column to not match any of the expected column names.

Solution

When reading the first chunk of a JSON stream, check for the presence of a byte-order mark. If it is present, remove it.

When reading the first chunk of CSV, check for the presence of a byte-order mark. If it is present, remove it.

Test Plan

A test case that uses a JSON file encoded as utf-8 with BOM is added to test this change.

A test case that uses CSV file encoded as utf-8 with BOM is added to test this change. The first column name in the file is quoted.

EDIT: Updated 2024/08/16 with information about the CSV changes (984cb6d)

When JSON input is a readable stream, a byte-order mark may be present at the beginning of the stream. When reading the first chunk of the stream, check for the presence of the byte-order mark. Remove it if it is present.

When parsing a CSV from a source that includes a byte-order mark (BOM), the BOM is present at the time the parser attempts to determine if the first column name is quoted. When the BOM is present, the parser does not recognize that the column is quoted, resulting in a failure to match an expected column name. Remove the BOM (if present) from the first chunk so that a quoted column name will be recognized and parsed as a quoted value.

shaselton-usds

Thanks! Looks good!

mint-thompson added 2 commits August 11, 2024 11:22

Remove BOM from JSON stream

629886a

When JSON input is a readable stream, a byte-order mark may be present at the beginning of the stream. When reading the first chunk of the stream, check for the presence of the byte-order mark. Remove it if it is present.

mint-thompson changed the title ~~Remove byte-order mark from JSON stream~~ Remove byte-order mark from JSON stream and first CSV chunk Aug 16, 2024

shaselton added 2 commits August 16, 2024 12:33

minor refactor to DRY up BOM

4a531a9

always forgetting to pretty fix

f2ff8b3

shaselton-usds approved these changes Aug 16, 2024

View reviewed changes

eoverly approved these changes Aug 19, 2024

View reviewed changes

shaselton-usds merged commit d20f463 into main Aug 20, 2024
4 checks passed

daniel-eckel mentioned this pull request Sep 5, 2024

Results between web validator and CLI validator do not match CMSgov/hpt-validator-cli#16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove byte-order mark from JSON stream and first CSV chunk #53

Remove byte-order mark from JSON stream and first CSV chunk #53

mint-thompson commented Aug 11, 2024 •

edited

Loading

shaselton-usds left a comment

Remove byte-order mark from JSON stream and first CSV chunk #53

Remove byte-order mark from JSON stream and first CSV chunk #53

Conversation

mint-thompson commented Aug 11, 2024 • edited Loading

Problem

Solution

Test Plan

shaselton-usds left a comment

Choose a reason for hiding this comment

mint-thompson commented Aug 11, 2024 •

edited

Loading