Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Encoding still being overridden even after fix to #371. #377

Closed
2 tasks done
xuxoramos opened this issue Mar 7, 2024 · 5 comments · Fixed by #378
Closed
2 tasks done

[BUG] Encoding still being overridden even after fix to #371. #377

xuxoramos opened this issue Mar 7, 2024 · 5 comments · Fixed by #378

Comments

@xuxoramos
Copy link

Summary

Still having issues with encoding even after fix to #371. Passing "latin-1", "cp1252" and "ISO-8859-1" encoding options to all three of java, tabula-py and pandas still returns an error saying UTF-8 is unable to encode.

Did you read the FAQ?

  • I have read the FAQ

Did you search GitHub Discussions?

  • I have searched the discussions

(Optional) PDF URL

https://edd.ca.gov/siteassets/files/jobs_and_training/pubs/wsd19-06att1.pdf

About your environment

Python version:
 3.11.7 | Packaged by Anaconda, Inc. | (main, Dec. 15 2023 18:05:47) [MSC v.1916 64 bit (AMD)]
Java version
 openjdk version "21.0.2" 2024-01-16
OpenJDK Runtime Environment (build 21.0.2+13-58)
OpenJDK 64-Bit Server VM (build 21.0.2+13-58, mixed mode, sharing)
tabula-py version: 2.9.0
platform: Windows-10-10.0.17763-SP0

What did you do when you faced the problem?

Looked at source code, and fix to #371 is still there. Passed "ISO-8859-1", "latin-1", "cp1252" and "windows-1252" encoding options to all three of Java, Pandas and tabula-py, both separately and all together, as follows:

  • tabula.read_pdf("../path/to.pdf", java_options="-Dfile.encoding=ISO-8859-1", pandas_options={"encoding":"ISO-8859-1"}, encoding="ISO-8859-1")

Code

tabula.read_pdf("../path/to.pdf", java_options="-Dfile.encoding=ISO-8859-1", pandas_options={"encoding":"ISO-8859-1"}, encoding="ISO-8859-1")

Expected behavior

Obtain all tables in the PDF

Actual behavior

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position N: invalid start byte

Related issues

#371

@chezou
Copy link
Owner

chezou commented Mar 7, 2024

@xuxoramos Thanks for reporting it.

Can you paste the actual code and full error message without trimming? I can't reproduce your error on my end.

Also, can you tell me how to install tabula-py? Please share me pip freeze output. I'm wondering if you installed it with jpype option or not. #371 is a patch for jpype, and jpype is not installed by default. Note that using jpype doesn't allow to change the encoding in a single Python process. To change it, you need to reboot the Python process.

Here is my result: I tried to parse the PDF you provided. No error happens.

>>> import tabula
>>> tabula.read_pdf("tmp.pdf", java_options="-Dfile.encoding=ISO-8859-1", pandas_options={"encoding":"ISO-8859-1"}, encoding="ISO-8859-1", pages="all")
[                                             Activity
0              Activity Code Name and Definition Code
1   002 Self-Service AJCC Employment and Workforce...
2                                                 NaN
3   This activity is system generated when an indi...
4   workforce information available in CalJOBS. Wo...
5   as: local performance, availability of support...
6   compensation, and performance and program cost...
7                                                 NaN
...snip...

@xuxoramos
Copy link
Author

xuxoramos commented Mar 8, 2024

This is the entire error output:

java_options is ignored until rebooting the Python process.
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[14], line 1
----> 1 dfs = tb.read_pdf("../sourcedata/caljobs_activity_codes_dictionary.pdf", 
      2                   pages="all", 
      3                   encoding="windows-1252", 
      4                   pandas_options={"encoding":"windows-1252"},
      5                   java_options=["-Dfile.encoding=windows-1252"]
      6                  )

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\io.py:395, in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, use_raw_url, pages, guess, area, relative_area, lattice, stream, password, silent, columns, relative_columns, format, batch, output_path, force_subprocess, options)
    392     raise ValueError(f"{path} is empty. Check the file, or download it manually.")
    394 try:
--> 395     output = _run(
    396         tabula_options,
    397         java_options,
    398         path,
    399         encoding=encoding,
    400         force_subprocess=force_subprocess,
    401     )
    402 finally:
    403     if temporary:

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\io.py:82, in _run(options, java_options, path, encoding, force_subprocess)
     79 elif set(java_options) - IGNORED_JAVA_OPTIONS:
     80     logger.warning("java_options is ignored until rebooting the Python process.")
---> 82 return _tabula_vm.call_tabula_java(options, path)

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\backend.py:117, in SubprocessTabula.call_tabula_java(self, options, path)
    115     if result.stderr:
    116         logger.warning(f"Got stderr: {result.stderr.decode(self.encoding)}")
--> 117     return result.stdout.decode(self.encoding)
    118 except FileNotFoundError:
    119     raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 3978: invalid start byte

I installed tabula-py from the wheel file to an Anaconda🐍 setup (and hence no requirements.txt, only a ton of dependencies), on a Windows 10 machine that cannot access the internet (for security reaseons), and hence without jpype, so the best way to emulate this env would be to complete shut down your network interface and attempt to replicate the error, I guess.

@chezou
Copy link
Owner

chezou commented Mar 8, 2024

Hmm, that sounds weird. I can find that conda-forge's latest version is still v2.7.0. https://anaconda.org/conda-forge/tabula-py

Anyway, your log shows that you are using the subprocess, not jpype. Hence, #371 is unrelated because it is jpype related issue.

Also, I tried Jupyter and ipython on my Windows machine, but I can't reproduce the issue.

In [1]: import tabula
   ...:
   ...: tabula.read_pdf("tmp.pdf", pages="all", encoding="windows-1252", pandas_options={"encoding":"windows-1252"},jav
   ...: a_options=["-Dfile.encoding=windows-1252"])
Error importing jpype dependencies. Fallback to subprocess.
No module named 'jpype'
Out[1]:
[                                             Activity
 0              Activity Code Name and Definition Code
 1   002 Self-Service AJCC Employment and Workforce...
 2                                                 NaN
...snip...

Does it happen just after launching jupyter/ipython? I guess you changed the encoding in the same Python process since the error shows as:

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\backend.py:117, in SubprocessTabula.call_tabula_java(self, options, path)
    115     if result.stderr:
    116         logger.warning(f"Got stderr: {result.stderr.decode(self.encoding)}")
--> 117     return result.stdout.decode(self.encoding)
    118 except FileNotFoundError:
    119     raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 3978: invalid start byte

This suggests that restult.stdout.decode(self.encoding) causes error, i.e., trying decoding with utf-8. Your log shows the cell number is In[14], so I doubt you set utf-8 initially, but you changed to windows-1252.

After supporting jpype in tabula-py, tabula doesn't allow the change of encoding argument after the first read_xxx calling. If you want to change, you can pass force_subprocess=True option, which recreates SubprocessTabula instance.

@chezou
Copy link
Owner

chezou commented Mar 10, 2024

Made a potential mitigation on #378. Please try the master branch code and give me a feedback if any.

@chezou
Copy link
Owner

chezou commented May 14, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants