Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use JPype instead of subprocess #352

Closed
mara004 opened this issue Jul 23, 2023 · 11 comments · Fixed by #356
Closed

Use JPype instead of subprocess #352

mara004 opened this issue Jul 23, 2023 · 11 comments · Fixed by #356

Comments

@mara004
Copy link

mara004 commented Jul 23, 2023

Is your feature request related to a problem? Please describe.
Repeatedly calling the Jar as java subprocess causes overhead (start java, load modules, etc.)

Describe the solution you'd like
Switch to JPype to launch the VM only once and call into the APIs directly.
For a start, the CLI main functions could be used to simplify the transition, maybe switching to the actual API level eventually.

Describe alternatives you've considered
Sticking with subprocess is possible but has the drawbacks described above.

Additional context
The python-pdfbox project had a similar transition from subprocess to JPype.

@chezou
Copy link
Owner

chezou commented Jul 26, 2023

Thanks for suggestion. Initially I thought to use py4j, but at that time, tabula-java development was active, so I pended the idea for maintainability.

I'd think tweaking io._run function and TabulaOption could help to implement it.

tabula-py/tabula/io.py

Lines 58 to 102 in b24e3bd

def _run(
java_options: List[str],
options: TabulaOption,
path: Optional[str] = None,
encoding: str = "utf-8",
) -> bytes:
"""Call tabula-java with the given lists of Java options and tabula-py
options, as well as an optional path to pass to tabula-java as a regular
argument and an optional encoding to use for any required output sent to
stderr.
tabula-py options are translated into tabula-java options, see
:func:`build_options` for more information.
"""
# Workaround to enforce the silent option. See:
# https:/tabulapdf/tabula-java/issues/231#issuecomment-397281157
if options.silent:
java_options.extend(
(
"-Dorg.slf4j.simpleLogger.defaultLogLevel=off",
"-Dorg.apache.commons.logging.Log"
"=org.apache.commons.logging.impl.NoOpLog",
)
)
args = ["java"] + java_options + ["-jar", _jar_path()] + options.build_option_list()
if path:
args.append(path)
try:
result = subprocess.run(
args,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
stdin=subprocess.DEVNULL,
check=True,
)
if result.stderr:
logger.warning(f"Got stderr: {result.stderr.decode(encoding)}")
return result.stdout
except FileNotFoundError:
raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)
except subprocess.CalledProcessError as e:
logger.error(f"Error from tabula-java:\n{e.stderr.decode(encoding)}\n")
raise

Or, we may be able to do something similar as tabula-rs tabulapdf/tabula-java#444

Anyway, I'm looking forward to contribution for the change.

@mara004
Copy link
Author

mara004 commented Jul 27, 2023

Thanks for the considerations!

Initially I thought to use py4j

I think py4j uses a remote tunnel to operate the JVM, which results in a data transfer penalty, while JPype has a native interface with shared memory approach.

but at that time, tabula-java development was active, so I pended the idea for maintainability.

If you call into the CLI rather than using the actual API, you get the same interface as with subprocess (i.e. no maintainability impact), only in a more performant way.
tabulapdf/tabula-java#444 looks like a nice enhancement that would help skip an unnecessary layer of string serialization.

Anyway, I'm looking forward to contribution for the change.

Disclaimer: Unfortunately, I'm fairly busy with some other projects (pypdfium2 & co.) and thus don't have an intent to work on this myself.

@chezou
Copy link
Owner

chezou commented Aug 27, 2023

I tried jpype on #355

However, I encountered when I run pytest with multiple files, it always fails.
https:/chezou/tabula-py/actions/runs/5987819184

This is a huge blocker for introducing jpype, and I'm about to give up.

@mara004
Copy link
Author

mara004 commented Aug 27, 2023

I'm still a bit tired, but all the test suite failures seem to be caused by TypeError: _run() takes from 2 to 3 positional arguments but 4 were given.
That's because the encoding argument was removed. Assuming that is correct, just adapt the calling code accordingly. Please don't give up just yet and at least leave the PR open as draft so it is more visible for others to take a look.

@chezou
Copy link
Owner

chezou commented Aug 27, 2023

I guess you're trying the old PR. Can you check the new one? #356

@mara004
Copy link
Author

mara004 commented Aug 27, 2023

I don't currently have the env to test this locally, I can only try to help with concrete questions.
The benchmark you shared looks promising. Any problems left?

@chezou
Copy link
Owner

chezou commented Aug 27, 2023

Ah, sorry for the confusion. I solved the issue with separating test files and processes. #356 (comment)

It's okay to merge into master for now.

@chezou
Copy link
Owner

chezou commented Sep 22, 2023

It presumably is a specific environment issue with jpype.
https://community.databricks.com/t5/data-engineering/what-is-the-best-way-to-read-in-a-ms-access-accdb-database-into/td-p/3217

I will implement a workaround to enable subprocess as an option.

@mara004
Copy link
Author

mara004 commented Sep 22, 2023

Sorry for the problems, I'm afraid I don't know the cause either.
I'd suggest you ask about this upstream on jpype's discussion page, maybe they have seen this already and know how to fix it.

@chezou
Copy link
Owner

chezou commented Sep 22, 2023

@mara004 Thanks for your suggestion. However, I can't reproduce issue by my side and I'm concerned to raise on behalf of them since I don't know what is the trigger.

I released v2.8.2 to automatically fallback to subprocess if there's any import error on jpype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants