Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON parsing fails on "lone leading surrogate in hex escape" while normal json.loads don't #120

Open
lindycoder opened this issue Jul 5, 2024 · 2 comments
Assignees

Comments

@lindycoder
Copy link

Hello,

In out migration to pydantic 2, we found a JSON document that pydantic 1 was able to load and pydantic 2 can't with the error:

Invalid JSON: lone leading surrogate in hex escape at line...

Here's a simple way of reproducing:

import json

from pydantic_core import from_json

data = b'{"test": "text\udce2\udc80\udc99text"}'

print(json.loads(data))
print(from_json(data))

This first print from python's json works:

{'test': 'text\udce2\udc80\udc99text'}

The second one using pydantic_core (used by pydantic2) raises

Traceback (most recent call last):
  File "check.py", line 7, in <module>
    print(from_json(data))
          ^^^^^^^^^^^^^^^
ValueError: lone leading surrogate in hex escape at line 1 column 20

Here's some versions

Python 3.12.2
pydantic 2.8.2
pydantic-core 2.20.1

Thank you!

@samuelcolvin
Copy link
Member

Moving this to jiter.

We need to check what serde-json does.

@samuelcolvin samuelcolvin transferred this issue from pydantic/pydantic-core Jul 5, 2024
@davidhewitt
Copy link
Collaborator

Serde fails with the same error message:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=97bd7df54428c3e668c287b59565cd67

Part of the problem will be that a Python str is allowed to have invalid unicode sequences (see e.g. PEP 383 and the 'surrogateescape' handler) to contain (encoded) arbitrary byte payloads. Decoding to UTF8 (and any UTF8 operations) on these strings will fail.

Rust String data, on the other hand, strictly requires valid UTF8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants