Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fullscan thread failed after running for 2 minutes #8287

Open
2 tasks
yarongilor opened this issue Aug 7, 2024 · 7 comments · May be fixed by #8705
Open
2 tasks

fullscan thread failed after running for 2 minutes #8287

yarongilor opened this issue Aug 7, 2024 · 7 comments · May be fixed by #8705
Assignees

Comments

@yarongilor
Copy link
Contributor

Packages

Scylla version: 6.1.0~dev-20240625.c80dc5715668 with build-id bf0032dbaafe5e4d3e01ece0dcb7785d2ec7a098

Kernel Version: 5.15.0-1063-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

A (side effect) followup of scylladb/scylladb#19573 (comment)

the run_fullscan is configured to run every minute:

run_fullscan: ['{"mode": "table_and_aggregate", "ks_cf": "keyspace1.standard1", "interval": 60}']

So probably the thread failed for some reason (not expected).

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-04447069-7 (52.49.241.60 | 10.4.22.189) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-04447069-6 (108.128.184.168 | 10.4.22.96) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-04447069-5 (54.220.255.133 | 10.4.22.193) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-04447069-4 (34.241.17.128 | 10.4.22.222) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-04447069-3 (52.210.187.188 | 10.4.23.139) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-04447069-2 (52.16.185.73 | 10.4.23.195) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-04447069-1 (54.155.186.240 | 10.4.21.83) (shards: 14)

OS / Image: ami-09006ca344092e50b (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: 04447069-14c9-47f9-8cf8-fe71a0ec1276
Test name: scylla-master/tier1/longevity-50gb-3days-test

Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 04447069-14c9-47f9-8cf8-fe71a0ec1276
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 04447069-14c9-47f9-8cf8-fe71a0ec1276

Logs:

Jenkins job URL
Argus

@yarongilor yarongilor self-assigned this Aug 7, 2024
@fruch
Copy link
Contributor

fruch commented Aug 7, 2024

@yarongilor from whom wrote that feature, I would expect to have a bit more insights into why this has failed, and why the check aren't continuing.

i.e. I don't expect to ever see ThreadFailedEvent, if they pop up, it mean the code didn't handled some exception, and that thread stop executing

@yarongilor
Copy link
Contributor Author

probably failed for the same root cause of #8292

@fruch
Copy link
Contributor

fruch commented Sep 12, 2024

probably failed for the same root cause of #8292

@yarongilor
#8292 is now closed, that meaning this can be closed ?

@yarongilor
Copy link
Contributor Author

probably failed for the same root cause of #8292

@yarongilor #8292 is now closed, that meaning this can be closed ?

No, sorry, it was mistakenly not related issue.
As for what happened here, it seems that during a refactor by @temichus , in this commit:
64ded4b#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52

the line of result.fetch_next_page() is deleted from def fetch_result_pages(self, result, read_pages)
so it should probably put back in its place.
The thread failed quietly like:

< t:2024-06-29 03:30:13,846 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:31:13,899 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..
< t:2024-06-29 03:31:14,428 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 0 result pages..

@temichus
Copy link
Contributor

temichus commented Sep 15, 2024

2 minutes?

i see 1st result report

 t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |        op_type         |      duration      | exceptions | nemesis_at_start | nemesis_at_end | success |                                    cmd                                    |
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent | 24.877952337265015 |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:22:52,894 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+--------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+


ant last

< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > Thread stats:
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |        op_type         |       duration      | exceptions | nemesis_at_start | nemesis_at_end | success |                                    cmd                                    |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  24.877952337265015 |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      |  0.3228580951690674 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  28.062369346618652 |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  21.24626350402832  |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > | FullScanAggregateEvent |  28.70813512802124  |            |       None       |      None      |   True  | SELECT count(*) FROM keyspace1.standard1 BYPASS CACHE USING TIMEOUT 1800s |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      | 0.17992639541625977 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      | 0.12150192260742188 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > |     FullScanEvent      | 0.17914700508117676 |            |       None       |      None      |   True  |     SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s     |
< t:2024-06-29 03:31:14,429 f:operations_thread.py l:156  c:ScanOperationThread  p:DEBUG > +------------------------+---------------------+------------+------------------+----------------+---------+---------------------------------------------------------------------------+
<

last executing attempt that I see

< t:2024-06-29 03:32:14,467 f:operations_thread.py l:140  c:ScanOperationThread  p:DEBUG > Operation instance map: {'table_and_aggregate': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e96dd0>, 'random': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e97910>, 'table': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e975b0>, 'partition': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e96ef0>, 'aggregate': <function ScanOperationThread.__init__.<locals>.<lambda> at 0x7f5e66e979a0>}
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,469 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:32:14.467: (FullScanEvent Severity.NORMAL) period_type=begin event_id=709bd269-6371-49a2-9fb0-5bd0e4a5f8e0 node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s

after that - silence, not clear why

but thread is alive at the end of tests

2024-06-30 04:17:59,475 f:threads_and_processes_alive.py l:66   c:sdcm.utils.threads_and_processes_alive p:ERROR > There are some threads left alive from following modules: sdcm.tester,multiprocessing.queues,socketserver,cassandra.io.libevreactor,sdcm.utils.operations_thread,sdcm.nemesis,cassandra.cluster,cassandra.connection,concurrent.futures.thread,sdcm.remote.libssh2_client

@temichus
Copy link
Contributor

probably failed for the same root cause of #8292

@yarongilor #8292 is now closed, that meaning this can be closed ?

No, sorry, it was mistakenly not related issue. As for what happened here, it seems that during a refactor by @temichus , in this commit: 64ded4b#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52

the line of result.fetch_next_page() is deleted from def fetch_result_pages(self, result, read_pages) so it should probably put back in its place. The thread failed quietly like:

< t:2024-06-29 03:30:13,846 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:31:13,899 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..
< t:2024-06-29 03:31:14,428 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 0 result pages..

this commit was merge 1.5 years ago

and before this log thread is working

< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..

@yarongilor
Copy link
Contributor Author

probably failed for the same root cause of #8292

@yarongilor #8292 is now closed, that meaning this can be closed ?

No, sorry, it was mistakenly not related issue. As for what happened here, it seems that during a refactor by @temichus , in this commit: 64ded4b#diff-75fbcbc6f5d81770abbb95177a9f3b80c7a7f809503152dcc17d64a779637a52
the line of result.fetch_next_page() is deleted from def fetch_result_pages(self, result, read_pages) so it should probably put back in its place. The thread failed quietly like:

< t:2024-06-29 03:30:13,846 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:30:13.844: (FullScanEvent Severity.NORMAL) period_type=end event_id=9df0e7db-f66b-487d-8b67-316809a0aa05 duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:31:13,899 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:31:14,221 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..
< t:2024-06-29 03:31:14,428 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-06-29 03:31:14.426: (FullScanEvent Severity.NORMAL) period_type=end event_id=bdd695d8-55ab-4795-8730-d3a42f60b4ef duration=0s node=longevity-tls-50gb-3d-master-db-node-04447069-6 select_from=keyspace1.standard1 message=FullScanOperation operation ended successfully
< t:2024-06-29 03:32:14,467 f:operations_thread.py l:147  c:ScanOperationThread  p:DEBUG > Going to run operation FullScanOperation
< t:2024-06-29 03:32:14,811 f:scan_operation_thread.py l:123  c:FullScanOperation    p:DEBUG > Will run command SELECT * from keyspace1.standard1 BYPASS CACHE USING TIMEOUT 300s
< t:2024-06-29 03:32:15,044 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 0 result pages..

this commit was merge 1.5 years ago

and before this log thread is working

< t:2024-06-29 03:31:14,400 f:scan_operation_thread.py l:188  c:FullScanOperation    p:DEBUG > Will fetch up to 1000 result pages..

@temichus , it only get the problem when read_pages = 0.
It is left running in an endless loop of:

        while result.has_more_pages and pages <= read_pages:
            if read_pages > 0:
                pages += 1

sending a fix now

yarongilor pushed a commit to yarongilor/scylla-cluster-tests that referenced this issue Sep 15, 2024
…pages

fetch_result_pages had a "while" loop where a call to result.fetch_next_page()
is missing. Adding it back.
Fixes: scylladb#8287
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants