Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Cannot store message argument of type <enum 'BatchJobExitReason'>: BatchJobExitReason.MEMLIMIT #4321

Closed
muffato opened this issue Jan 20, 2023 · 4 comments · Fixed by #4322
Assignees

Comments

@muffato
Copy link
Contributor

muffato commented Jan 20, 2023

Hello,

I'm running Cactus on a LSF farm, and a bunch of jobs consumed more memory than they had requested, and got killed by LSF ("MEMLIMIT")

[2023-01-20T12:09:50+0000] [Thread-2  ] [I] [toil.batchSystems.lsf] [job ID 9986865, Command singularity exec /software/treeoflife/shpc/0.1.16/container_base/quay.io/comparative-genomics-toolkit/cactus/v2.4.0/quay.io-comparative-genomics-toolkit-cactus-v2.4.0-sha256:8c677dfccd0dfd4b6e645775a5d64dfffe86345753edaeba74a31821cebca74c.sif _toil_worker run_lastz file:/nfs/users/nfs_m/mm49/nfs/scratch123/cactus/runtime/lsf_js kind-run_lastz/instance-q9shyj5b --context gASVzgAAAAAAAACMIXRvaWwuYmF0Y2hTeXN0ZW1zLmNsZWFudXBfc3VwcG9ydJSMFFdvcmtlckNsZWFudXBDb250ZXh0lJOUKYGUfZSMEXdvcmtlckNsZWFudXBJbmZvlIwldG9pbC5iYXRjaFN5c3RlbXMuYWJzdHJhY3RCYXRjaFN5c3RlbZSMEVdvcmtlckNsZWFudXBJbmZvlJOUKE5OjCRhNmY4ZTU2Yi01NzU5LTQ5ODUtODE4YS1iMDM5MjZlMThjNTSUjAZhbHdheXOUdJSBlHNiLg==] Max Memory Used: 309 Mbytes
[2023-01-20T12:09:50+0000] [Thread-2  ] [E] [toil.batchSystems.lsf] bjobs detected job failed with:
exit code: 130
exit reason: TERM_MEMLIMIT: job killed after reaching LSF memory usage limit
for job: 9986865

This resulted in Cactus/toil failing with this error:

Traceback (most recent call last):
  File "/home/cactus/cactus_env/bin/cactus", line 8, in <module>   
    sys.exit(main())
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/progressive/cactus_progressive.py", line 400, in main
    hal_id = toil.start(Job.wrapJobFn(progressive_workflow, options, config_node, mc_tree, og_map, input_seq_id_map))
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1017, in start
    return self._runMainLoop(rootJobDescription)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1461, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 292, in run
    self.innerLoop()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 789, in innerLoop
    self._gatherUpdatedJobs(updatedJobTuple)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 746, in _gatherUpdatedJobs
    self._messages.publish(JobCompletedMessage(updatedJob.get_job_kind(), updatedJob.jobStoreID, exitStatus))
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/bus.py", line 610, in publish
    self._bus.publish(message)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/bus.py", line 296, in publish
    self._deliver(message)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/bus.py", line 329, in _deliver
    self._pubsub.sendMessage(topic, message=message)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/pubsub/core/publisher.py", line 216, in sendMessage
    topicObj.publish(**msgData)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/pubsub/core/topicobj.py", line 452, in publish
    self.__sendMessage(msgData, topicObj, msgDataSubset)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/pubsub/core/topicobj.py", line 482, in __sendMessage
    listener(data, self, allData)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/pubsub/core/listener.py", line 237, in __call__
    cb(**kwargs)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/bus.py", line 404, in handler
    stream.write(message_to_bytes(message))
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/bus.py", line 213, in message_to_bytes
    raise RuntimeError(f"Cannot store message argument of type {item_type}: {item}")
RuntimeError: Cannot store message argument of type <enum 'BatchJobExitReason'>: BatchJobExitReason.MEMLIMIT
Command exited with non-zero status 1

Job ID 9986865 was not the only one to fail with error 130 at the same time, but it's the last one in the log before the stack-trace. Here is its report:

Job <9986865>, Job Name <toil_job_255>, User <mm49>, Project <default>, Status
                     <EXIT>, Queue <normal>, Command <singularity exec /softwar
                     e/treeoflife/shpc/0.1.16/container_base/quay.io/comparativ
                     e-genomics-toolkit/cactus/v2.4.0/quay.io-comparative-genom
                     ics-toolkit-cactus-v2.4.0-sha256:8c677dfccd0dfd4b6e645775a
                     5d64dfffe86345753edaeba74a31821cebca74c.sif _toil_worker r
                     un_lastz file:/nfs/users/nfs_m/mm49/nfs/scratch123/cactus/
                     runtime/lsf_js kind-run_lastz/instance-q9shyj5b --context
                     gASVzgAAAAAAAACMIXRvaWwuYmF0Y2hTeXN0ZW1zLmNsZWFudXBfc3VwcG
                     9ydJSMFFdvcmtlckNsZWFudXBDb250ZXh0lJOUKYGUfZSMEXdvcmtlckNs
                     ZWFudXBJbmZvlIwldG9pbC5iYXRjaFN5c3RlbXMuYWJzdHJhY3RCYXRjaF
                     N5c3RlbZSMEVdvcmtlckNsZWFudXBJbmZvlJOUKE5OjCRhNmY4ZTU2Yi01
                     NzU5LTQ5ODUtODE4YS1iMDM5MjZlMThjNTSUjAZhbHdheXOUdJSBlHNiLg
                     ==>, Share group charged </WTSI/tol/tol-ops/team328-grp/mm
                     49>
Fri Jan 20 12:08:40: Submitted from host <farm5-os0000001>, CWD </lustre/scratc
                     h123/tol/teams/tolit/users/mm49/cactus/runtime>, Specified
                      CWD </lustre/scratch123/tol/teams/tolit/users/mm49/cactus
                     /runtime/.>, Output File </tmp/toil_a6f8e56b-5759-4985-818
                     a-b03926e18c54.255.%J.out.log>, Error File </tmp/toil_a6f8
                     e56b-5759-4985-818a-b03926e18c54.255.%J.err.log>;
Fri Jan 20 12:08:41: Dispatched 1 Task(s) on Host(s) <node-14-12>, Allocated 1
                     Slot(s) on Host(s) <node-14-12>, Effective RES_REQ <select
                     [(mem>279.00) && (type == any )] order[r15s:pg] rusage[mem
                     =279.00] span[hosts=1] affinity[thread(1)*1] >;
Fri Jan 20 12:09:14: Completed <exit>; TERM_MEMLIMIT: job killed after reaching
                      LSF memory usage limit.

 EXCEPTION STATUS:  underrun

Accounting information about this job:
     Share group charged </WTSI/tol/tol-ops/team328-grp/mm49>
     CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
     14.46        1             34     exit         0.4253   309M      0M
     CPU_PEAK     CPU_EFFICIENCY      MEM_EFFICIENCY
      0.44                43.75%             110.75%

I'm using the official Docker image of Cactus v2.4.0, converted to Singularity. This is toil 5.8.0, according to their changelog.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1264

@adamnovak
Copy link
Member

I think the problem is here:

toil/src/toil/bus.py

Lines 206 to 207 in 332eeed

item_type = type(item)
if item_type in [int, float, bool] or item is None:

BatchJobExitReason.MEMLIMIT is an int, but type() on it is a type that is a subclass of int. So we need to match subclasses and not just on the final type.

adamnovak added a commit that referenced this issue Jan 20, 2023
This should fix #4321.

It could probably use a test.
adamnovak added a commit that referenced this issue Jan 24, 2023
* Allow for subclasses of base types in messages

This should fix #4321.

It could probably use a test.

* Add a test for enums in bus log files

* Actually start tuple

* Fix test to actually write to file

* Fix enum to be IntEnum so it isinstance int
@muffato
Copy link
Contributor Author

muffato commented Jan 25, 2023

Thank you @adamnovak ! As far as you know, should Cactus be compatible with the master branch of toil ?

@adamnovak
Copy link
Member

I believe it should; we have a CI test in Toil that does an integration test against Cactus, so we know when we make a breaking change, and I don't recall any since the last release.

@muffato
Copy link
Contributor Author

muffato commented Jan 25, 2023

Super, thanks, I'll give it a try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants