[BUG] Encoding still being overridden even after fix to #371. #377

xuxoramos · 2024-03-07T08:00:29Z

Summary

Still having issues with encoding even after fix to #371. Passing "latin-1", "cp1252" and "ISO-8859-1" encoding options to all three of java, tabula-py and pandas still returns an error saying UTF-8 is unable to encode.

Did you read the FAQ?

I have read the FAQ

Did you search GitHub Discussions?

I have searched the discussions

(Optional) PDF URL

https://edd.ca.gov/siteassets/files/jobs_and_training/pubs/wsd19-06att1.pdf

About your environment

Python version:
 3.11.7 | Packaged by Anaconda, Inc. | (main, Dec. 15 2023 18:05:47) [MSC v.1916 64 bit (AMD)]
Java version
 openjdk version "21.0.2" 2024-01-16
OpenJDK Runtime Environment (build 21.0.2+13-58)
OpenJDK 64-Bit Server VM (build 21.0.2+13-58, mixed mode, sharing)
tabula-py version: 2.9.0
platform: Windows-10-10.0.17763-SP0

What did you do when you faced the problem?

Looked at source code, and fix to #371 is still there. Passed "ISO-8859-1", "latin-1", "cp1252" and "windows-1252" encoding options to all three of Java, Pandas and tabula-py, both separately and all together, as follows:

tabula.read_pdf("../path/to.pdf", java_options="-Dfile.encoding=ISO-8859-1", pandas_options={"encoding":"ISO-8859-1"}, encoding="ISO-8859-1")

Code

tabula.read_pdf("../path/to.pdf", java_options="-Dfile.encoding=ISO-8859-1", pandas_options={"encoding":"ISO-8859-1"}, encoding="ISO-8859-1")

Expected behavior

Obtain all tables in the PDF

Actual behavior

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position N: invalid start byte

Related issues

#371

The text was updated successfully, but these errors were encountered:

chezou · 2024-03-07T21:24:50Z

@xuxoramos Thanks for reporting it.

Can you paste the actual code and full error message without trimming? I can't reproduce your error on my end.

Also, can you tell me how to install tabula-py? Please share me pip freeze output. I'm wondering if you installed it with jpype option or not. #371 is a patch for jpype, and jpype is not installed by default. Note that using jpype doesn't allow to change the encoding in a single Python process. To change it, you need to reboot the Python process.

Here is my result: I tried to parse the PDF you provided. No error happens.

>>> import tabula
>>> tabula.read_pdf("tmp.pdf", java_options="-Dfile.encoding=ISO-8859-1", pandas_options={"encoding":"ISO-8859-1"}, encoding="ISO-8859-1", pages="all")
[                                             Activity
0              Activity Code Name and Definition Code
1   002 Self-Service AJCC Employment and Workforce...
2                                                 NaN
3   This activity is system generated when an indi...
4   workforce information available in CalJOBS. Wo...
5   as: local performance, availability of support...
6   compensation, and performance and program cost...
7                                                 NaN
...snip...

xuxoramos · 2024-03-08T03:54:00Z

This is the entire error output:

java_options is ignored until rebooting the Python process.
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[14], line 1
----> 1 dfs = tb.read_pdf("../sourcedata/caljobs_activity_codes_dictionary.pdf", 
      2                   pages="all", 
      3                   encoding="windows-1252", 
      4                   pandas_options={"encoding":"windows-1252"},
      5                   java_options=["-Dfile.encoding=windows-1252"]
      6                  )

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\io.py:395, in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, use_raw_url, pages, guess, area, relative_area, lattice, stream, password, silent, columns, relative_columns, format, batch, output_path, force_subprocess, options)
    392     raise ValueError(f"{path} is empty. Check the file, or download it manually.")
    394 try:
--> 395     output = _run(
    396         tabula_options,
    397         java_options,
    398         path,
    399         encoding=encoding,
    400         force_subprocess=force_subprocess,
    401     )
    402 finally:
    403     if temporary:

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\io.py:82, in _run(options, java_options, path, encoding, force_subprocess)
     79 elif set(java_options) - IGNORED_JAVA_OPTIONS:
     80     logger.warning("java_options is ignored until rebooting the Python process.")
---> 82 return _tabula_vm.call_tabula_java(options, path)

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\backend.py:117, in SubprocessTabula.call_tabula_java(self, options, path)
    115     if result.stderr:
    116         logger.warning(f"Got stderr: {result.stderr.decode(self.encoding)}")
--> 117     return result.stdout.decode(self.encoding)
    118 except FileNotFoundError:
    119     raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 3978: invalid start byte

I installed tabula-py from the wheel file to an Anaconda🐍 setup (and hence no requirements.txt, only a ton of dependencies), on a Windows 10 machine that cannot access the internet (for security reaseons), and hence without jpype, so the best way to emulate this env would be to complete shut down your network interface and attempt to replicate the error, I guess.

chezou · 2024-03-08T04:28:28Z

Hmm, that sounds weird. I can find that conda-forge's latest version is still v2.7.0. https://anaconda.org/conda-forge/tabula-py

Anyway, your log shows that you are using the subprocess, not jpype. Hence, #371 is unrelated because it is jpype related issue.

Also, I tried Jupyter and ipython on my Windows machine, but I can't reproduce the issue.

In [1]: import tabula
   ...:
   ...: tabula.read_pdf("tmp.pdf", pages="all", encoding="windows-1252", pandas_options={"encoding":"windows-1252"},jav
   ...: a_options=["-Dfile.encoding=windows-1252"])
Error importing jpype dependencies. Fallback to subprocess.
No module named 'jpype'
Out[1]:
[                                             Activity
 0              Activity Code Name and Definition Code
 1   002 Self-Service AJCC Employment and Workforce...
 2                                                 NaN
...snip...

Does it happen just after launching jupyter/ipython? I guess you changed the encoding in the same Python process since the error shows as:

File C:\ProgramData\anaconda3\Lib\site-packages\tabula\backend.py:117, in SubprocessTabula.call_tabula_java(self, options, path)
    115     if result.stderr:
    116         logger.warning(f"Got stderr: {result.stderr.decode(self.encoding)}")
--> 117     return result.stdout.decode(self.encoding)
    118 except FileNotFoundError:
    119     raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 3978: invalid start byte

This suggests that restult.stdout.decode(self.encoding) causes error, i.e., trying decoding with utf-8. Your log shows the cell number is In[14], so I doubt you set utf-8 initially, but you changed to windows-1252.

After supporting jpype in tabula-py, tabula doesn't allow the change of encoding argument after the first read_xxx calling. If you want to change, you can pass force_subprocess=True option, which recreates SubprocessTabula instance.

chezou · 2024-03-10T20:39:01Z

Made a potential mitigation on #378. Please try the master branch code and give me a feedback if any.

chezou · 2024-05-14T17:05:13Z

Released https://pypi.org/manage/project/tabula-py/release/2.9.1/

xuxoramos added bug triage labels Mar 7, 2024

chezou added can't reproduce and removed bug labels Mar 7, 2024

chezou added the help wanted label Mar 7, 2024

chezou mentioned this issue Mar 10, 2024

Update encoding everytime when SubprocessTabule is initialized #378

Merged

7 tasks

chezou closed this as completed in #378 Mar 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Encoding still being overridden even after fix to #371. #377

[BUG] Encoding still being overridden even after fix to #371. #377

xuxoramos commented Mar 7, 2024

chezou commented Mar 7, 2024 •

edited

Loading

xuxoramos commented Mar 8, 2024 •

edited

Loading

chezou commented Mar 8, 2024 •

edited

Loading

chezou commented Mar 10, 2024

chezou commented May 14, 2024

[BUG] Encoding still being overridden even after fix to #371. #377

[BUG] Encoding still being overridden even after fix to #371. #377

Comments

xuxoramos commented Mar 7, 2024

Summary

Did you read the FAQ?

Did you search GitHub Discussions?

(Optional) PDF URL

About your environment

What did you do when you faced the problem?

Code

Expected behavior

Actual behavior

Related issues

chezou commented Mar 7, 2024 • edited Loading

xuxoramos commented Mar 8, 2024 • edited Loading

chezou commented Mar 8, 2024 • edited Loading

chezou commented Mar 10, 2024

chezou commented May 14, 2024

chezou commented Mar 7, 2024 •

edited

Loading

xuxoramos commented Mar 8, 2024 •

edited

Loading

chezou commented Mar 8, 2024 •

edited

Loading