Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple PyTorch engines across threads appear to be sharing native instance #2825

Closed
tklinchik opened this issue Oct 30, 2023 · 7 comments
Closed
Labels
bug Something isn't working

Comments

@tklinchik
Copy link

Description

Running on a very lager server with lots of cores and using PyTorch engine on CPU I'm trying to parallelize very much independent jobs across multiple instances of PtEngine/NDManager allocated one per thread.
I assumed each engine was independent of one another and was setting environment variable "ai.djl.pytorch.num_interop_threads" to limit number of threads to 1 and got the following error message on when creating subsequent NDManager instances.
It appears as if underlying PtEngine created using PyTorchLibrary is shared as subsequent creation of NDManager appears to throw an exception with below error.
I couldn't find any documentation on how exactly resources are shared across thread in the same JVM/ClassLoader and would appreciate some guidance on this.

Expected Behavior

Each PyTorch engine instance to be completely independent of one another

Error Message

ai.djl.engine.EngineException: Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
at ai.djl.pytorch.jni.PyTorchLibrary.torchSetNumInteropThreads(Native Method)
at ai.djl.pytorch.jni.JniUtils.setNumInteropThreads(JniUtils.java:102)
at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:56)
at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40)
at ai.djl.engine.Engine.getEngine(Engine.java:190)
at ai.djl.engine.Engine.getInstance(Engine.java:145)
at ai.djl.ndarray.NDManager.newBaseManager(NDManager.java:120)

How to Reproduce?

Set the following property and create NDManager per thread:

System.setProperty("ai.djl.pytorch.num_interop_threads", "1")
@tklinchik tklinchik added the bug Something isn't working label Oct 30, 2023
@frankfliu
Copy link
Contributor

PyTorch interop thread and intraop thread are global settings. You should not set change it at runtime.

Our recommendation is to set both of them to 1 at the beginning.

@frankfliu
Copy link
Contributor

PtEngine is a singleton, It only initialized once. Are you loading it in different ClassLoader?

@tklinchik
Copy link
Author

No, all are in the same class loader. I'm calling NDManager.newBaseManager() per thread which ends up invoking PtEngineProvider.getEngine(), etc

@frankfliu
Copy link
Contributor

Can you initialize PtEngine before you start the thread? It looks like there is bug in getEngine() call.

@frankfliu
Copy link
Contributor

I created a PR to address your issue: #2826

@tklinchik
Copy link
Author

Can you initialize PtEngine before you start the thread? It looks like there is bug in getEngine() call.

That seems to have worked without issues.
I see you already have a fix. Appreciate your help fixing this bug.

@tklinchik
Copy link
Author

After upgrading to the latest and removing previously suggested workaround I'm getting a different error when I'm instantiating NDManager in each thread:

Caused by: java.lang.IllegalStateException: The engine PyTorch was not able to initialize
	at ai.djl.engine.Engine.getEngine(Engine.java:218)
	at ai.djl.engine.Engine.getInstance(Engine.java:149)
	at ai.djl.ndarray.NDManager.newBaseManager(NDManager.java:120)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants