Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SegmentMIF fails and causes program to hang #299

Open
surya-narayanan opened this issue Feb 10, 2022 · 4 comments
Open

SegmentMIF fails and causes program to hang #299

surya-narayanan opened this issue Feb 10, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@surya-narayanan
Copy link
Contributor

I think this is related to #211, but the error message is different on gpu. There are two types of error, both shown below:

1

distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Restarting worker                                                                                                                                                                                                         
distributed.nanny - WARNING - Restarting worker                                                                                                                                                                                                         
distributed.nanny - WARNING - Restarting worker                                                                                                                                                                                                         
distributed.nanny - WARNING - Restarting worker                                                                                                                                                                                                         
distributed.nanny - WARNING - Restarting worker                                                                                                                                                                                                         
distributed.nanny - WARNING - Restarting worker                                                                                                                                                                                                         
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Restarting worker                                                                                                                                                                                                         
distributed.nanny - WARNING - Restarting worker                                                                                                                                                                                                         
distributed.nanny - WARNING - Restarting worker                                                                                                                                                                                                         
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting                                                                                                                                                                             
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process still alive after 3.999999237060547 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999996185302735 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.999999237060547 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.999999237060547 seconds, killing

Traceback (most recent call last):
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 520, in handle_comm
    result = await result
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/scheduler.py", line 5832, in scatter
    raise TimeoutError("No valid workers found")
asyncio.exceptions.TimeoutError: No valid workers found
Traceback (most recent call last):
  File "mif-slidedataset-to-tiledataset-to-dataloader-test-via-dataset.py", line 60, in <module>
    slide_dataset.run(pipeline = pipeline, client = client, write_dir = write_dir, distributed = True, tile_size = 512)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/pathml/core/slide_dataset.py", line 57, in run
    slide.run(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/pathml/core/slide_data.py", line 320, in run
    big_future = client.scatter(tile)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/client.py", line 2354, in scatter
    return self.sync(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 309, in sync
    return sync(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 363, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 348, in f
    result[0] = yield future
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/client.py", line 2239, in _scatter
    await self.scheduler.scatter(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 900, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 693, in send_recv
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 520, in handle_comm
    result = await result
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/scheduler.py", line 5832, in scatter
    raise TimeoutError("No valid workers found")
asyncio.exceptions.TimeoutError: No valid workers found

2


distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 691.
31 MiB -- Worker memory limit: 738.85 MiB                                                                                                                                                                                                               
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 696.
19 MiB -- Worker memory limit: 738.85 MiB                                                                                                                                                                                                               
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 686.
67 MiB -- Worker memory limit: 738.85 MiB                                                                                                                                                                                                               
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 695.
79 MiB -- Worker memory limit: 738.85 MiB                                                                                                                                                                                                               
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 691.
31 MiB -- Worker memory limit: 738.85 MiB                                                                                                                                                                                                               
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 696.
19 MiB -- Worker memory limit: 738.85 MiB                                                                                                                                                                                                               
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
42 MiB -- Worker memory limit: 738.85 MiB                                                                                                                                                                                                               
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 695.
79 MiB -- Worker memory limit: 738.85 MiB                                                                                                                                                                                                               
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 691.
31 MiB -- Worker memory limit: 738.85 MiB                                                                                                                                                                                                               
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 696.
19 MiB -- Worker memory limit: 738.85 MiB                                                                                                                                                                                                               
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 695.
44 MiB -- Worker memory limit: 738.85 MiB    
Traceback (most recent call last):                                                                                                                                                                                                                      
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1067, in connect                                                                                                                                                  
    comm = await fut                                                                                                                                                                                                                                    
asyncio.exceptions.CancelledError                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                        
The above exception was the direct cause of the following exception:                                                                                                                                                                                    
                                                                                                                                                                                                                                                        
Traceback (most recent call last):                                                                                                                                                                                                                      
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 3011, in gather_dep                                                                                                                                             
    response = await get_data_from_worker(                                                                                                                                                                                                              
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 4305, in get_data_from_worker                                                                                                                                   
    return await retry_operation(_get_data, operation="get_data_from_worker")                                                                                                                                                                           
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation                                                                                                                                     
    return await retry(                                                                                                                                                                                                                                 
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry                                                                                                                                               
    return await coro()                                                                                                                                                                                                                                 
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 4282, in _get_data                                                                                                                                              
    comm = await rpc.connect(worker)                                                                                                                                                                                                                    
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1078, in connect                                                                                                                                                  
    raise CommClosedError(                                                                                                                                                                                                                              
distributed.comm.core.CommClosedError: ConnectionPool not running. Status: Status.closed                                                                                                                                                                
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fce64a99c70>>, <Task finished name='Task-14' coro=<Worker.gather_dep() done, de$
ined at /opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py:2955> exception=CommClosedError('Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:46876 remote=tcp://127.0.0.1:35715> already closed.')>)                    
Traceback (most recent call last):                                                                                                                                                                                                                      
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback                                                                                                                                               
    ret = callback()                                                                                                                                                                                                                                    
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result                                                                                                                                      
    future.result()                                                                                                                                                                                                                                     
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 3077, in gather_dep                                                                                                                                             
    self.batched_stream.send(                                                                                                                                                                                                                           
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/batched.py", line 137, in send                                                                                                                                                   
    raise CommClosedError(f"Comm {self.comm!r} already closed.")                                                                                                                                                                                        
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:46876 remote=tcp://127.0.0.1:35715> already closed.                                                                                                   
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fce64a99c70>>, <Task finished name='Task-11' coro=<Worker.handle_scheduler() do$
e, defined at /opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py:1318> exception=CommClosedError('ConnectionPool not running. Status: Status.closed')>)
Traceback (most recent call last):
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1067, in connect
    comm = await fut
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1331, in handle_scheduler
    await self.close(report=False)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1561, in close
    await r.close_gracefully()
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 897, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1078, in connect
    raise CommClosedError(
distributed.comm.core.CommClosedError: ConnectionPool not running. Status: Status.closed
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f3015274c70>>, <Task finished name='Task-17' coro=<Worker.close() done, defined
at /opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py:1537> exception=CommClosedError('ConnectionPool not running. Status: Status.closed')>)
Traceback (most recent call last):
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1067, in connect
    comm = await fut
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1561, in close
    await r.close_gracefully()
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 897, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1078, in connect
    raise CommClosedError(
distributed.comm.core.CommClosedError: ConnectionPool not running. Status: Status.closed
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 660.
48 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 661.
51 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 662.
43 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 664.
92 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 668.
09 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 672.
64 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 682.
02 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Heartbeat to scheduler failed
Traceback (most recent call last):
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1273, in heartbeat
    response = await retry_operation(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 897, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1054, in connect
    raise CommClosedError(
distributed.comm.core.CommClosedError: ConnectionPool not running. Status: Status.closed
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 688.
68 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:38615 -> tcp://127.0.0.1:34199
Traceback (most recent call last):
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1752, in get_data
    response = await comm.read(deserializers=serializers)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/comm/tcp.py", line 220, in read
    convert_stream_closed_error(self, e)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:38615 remote=tcp://127.0.0.1:39154>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 692.
82 MiB -- Worker memory limit: 738.85 MiB

33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Heartbeat to scheduler failed
Traceback (most recent call last):
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1273, in heartbeat
    response = await retry_operation(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 897, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1054, in connect
    raise CommClosedError(
distributed.comm.core.CommClosedError: ConnectionPool not running. Status: Status.closed
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
57 MiB -- Worker memory limit: 738.85 MiB
distributed.nanny - WARNING - Worker process still alive after 3.999999237060547 seconds, killing
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
2022-02-10 20:04:05.835765: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 535.
11 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 542.
93 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 558.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 571.
01 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 577.
54 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 589.
02 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 594.77 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 594.
77 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 596.
30 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 605.
31 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 622.
14 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 676.
49 MiB -- Worker memory limit: 738.85 MiB
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
Traceback (most recent call last):
  File "/opt/conda/envs/pathml/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
    send_bytes(obj)
  File "/opt/conda/envs/pathml/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/envs/pathml/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/envs/pathml/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe


^[[A^[[A^[[A^[[A^[[A^[[A^[[Adistributed.core - ERROR - Exception while handling op scatter
Traceback (most recent call last):
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 520, in handle_comm
    result = await result
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/scheduler.py", line 5832, in scatter
    raise TimeoutError("No valid workers found")
asyncio.exceptions.TimeoutError: No valid workers found
Traceback (most recent call last):
  File "mif-slidedataset-to-tiledataset-to-dataloader-test-via-dataset.py", line 60, in <module>
    slide_dataset.run(pipeline = pipeline, client = client, write_dir = write_dir, distributed = True, tile_size = 512)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/pathml/core/slide_dataset.py", line 57, in run
    slide.run(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/pathml/core/slide_data.py", line 320, in run
    big_future = client.scatter(tile)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/client.py", line 2354, in scatter
    return self.sync(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 309, in sync
    return sync(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 363, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 348, in f
    result[0] = yield future
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/client.py", line 2239, in _scatter
    await self.scheduler.scatter(
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 900, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 693, in send_recv
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 520, in handle_comm
    result = await result
  File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/scheduler.py", line 5832, in scatter
    raise TimeoutError("No valid workers found")
asyncio.exceptions.TimeoutError: No valid workers found

I think this is because we are using dask. I'm not sure how to proceed. I can try using pathos and seeing whether it works.

@surya-narayanan surya-narayanan added the bug Something isn't working label Feb 10, 2022
@jacob-rosenthal
Copy link
Collaborator

Thanks for posting this Surya, something is clearly going wrong with the Dask cluster.
We'll need to dig into what is causing this - I'm guessing that part of it is caused on the pathml side by not making optimal use of dask, and part is probably caused by some details of how that specific dask cluster is configured

Are you able to proceed with distributed=False?

@surya-narayanan
Copy link
Contributor Author

surya-narayanan commented Feb 11, 2022

not using gpu, since I need cudnn==8.1.x. I'm still sorting that out now. Do you have any knowledge of how to install cudnn 8.1.x? I have a machine with cuda 11.0 and mesmer needs cudnn 8.1.x, but I have 8.0.5. I find the cudnn installation instructions to be rather complicated. Do you have any knowledge of how to install cudnn?

@surya-narayanan
Copy link
Contributor Author

Btw, with distributed=False, I am able to get it work, but it might take over 4.5 hours per slide. Do you have any ideas on how to leverage the fact that the mesmer model can handle batches of size > 1?

@jacob-rosenthal
Copy link
Collaborator

Yeah that's a good point. Batching tiles would probably make it more efficient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants