Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard-code UTF-8 encoding for the input file #56

Merged
merged 1 commit into from
Jun 21, 2022

Conversation

olivierphi
Copy link

@olivierphi olivierphi commented Jun 14, 2022

This is a first draft to fix the encoding errors described in #55 - i.e. we're on Windows and the input file is encoded in UTF-8.

The role of PYTHONIOENCODING, as described in the issue

The bug report mentions that even when the PYTHONIOENCODING env var is set, the preferred encoding determined by Python on Windows is still a Windows-specific encoding.
However, it seems that this is actually pretty much the expected behaviour, as the behaviour of this env var seems to only affect stdin/stdout/stderr?
PYTHONIOENCODING-not-changing-preferred-encoding

Potential breaking changes brought by this PR

Of course, assuming that the input file is always encoded in UTF-8, like we do with this PR, could break some existing usages of Rich-CLI. Especially on Windows, where UTF-8 is still not the default encoding if I'm not wrong?
Not sure what would be the safest way to handle that issue? 🤔

Potential ways to handler that better

  • Maybe we could try to use the file using the system's default encoding first, and then, only if that failed, fall back to UTF-8?
    e.g. something like this: (pseudo-code)

    for encoding in (None, "utf8"):
       try:
         with open(path, "rt", encoding=encoding) as resource_file:
       except EncodingError:
         continue
  • As pointed out by @darrenburns , there is now the possibility to use a PYTHONUTF8 env var, which seems to work:
    using-PYTHONUTF8

  • Add a flag and/or an env var specific to Rich-CLI to let the user tell Rich-CLI which encoding we should use to open the input file?
    Maybe that would be the most flexible option, combined to a "before raising an exception, fall back to UTF-8 if the default encoding didn't work" strategy? What do you think @willmcgugan @darrenburns ? 🙂

Before / After

Before this fix:

before-utf8-fix

After this fix:

after-utf8-fix cmd

fixes #55

@olivierphi olivierphi force-pushed the force-utf8-encoding-for-input-file branch from 1c03cbc to ae1b084 Compare June 14, 2022 11:20
README.md Show resolved Hide resolved
@@ -102,7 +102,7 @@ def read_resource(path: str, lexer: Optional[str]) -> Tuple[str, Optional[str]]:
if path == "-":
return (sys.stdin.read(), None)

with open(path, "rt") as resource_file:
with open(path, "rt", encoding="utf8") as resource_file:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fixes the described issue, but... Is doing this really a good idea? I'm not sure, to be honest 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annoying truth is that you can't be sure what the encoding will be. Can you also add errors="replace" in the off change there are utf-8 encoding errors.

At some point we might use chardet to give us extra confidence.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL: 🎓 the Python open function can indeed accept a errors arg that specifies how encoding and decoding errors are to be handled - which is really handy! 🙂
Done in 53cb0e9

@olivierphi olivierphi marked this pull request as ready for review June 14, 2022 11:24
Copy link
Collaborator

@willmcgugan willmcgugan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small request

@olivierphi olivierphi force-pushed the force-utf8-encoding-for-input-file branch from ae1b084 to 53cb0e9 Compare June 21, 2022 10:30
@willmcgugan willmcgugan merged commit 0721cb1 into main Jun 21, 2022
@willmcgugan willmcgugan deleted the force-utf8-encoding-for-input-file branch June 21, 2022 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants