storage engine initialization takes too long #2215

532910 · 2023-01-26T14:05:41Z

8:16:39  neofs-node/main.go:76  initializing storage engine service...
8:23:01  writecache/init.go:14  filling flush marks for objects in FSTree        {"shard_id": "YWVQLsdMz451ShJzUUKqD2"}
8:23:01  writecache/init.go:25  filling flush marks for objects in database        {"shard_id": "YWVQLsdMz451ShJzUUKqD2"}
8:23:01  writecache/init.go:60  finished updating flush marks        {"shard_id": "YWVQLsdMz451ShJzUUKqD2"}
8:23:55  writecache/init.go:14  filling flush marks for objects in FSTree        {"shard_id": "XidSLGUWrwRTFALB3tpD6p"}
8:23:55  writecache/init.go:25  filling flush marks for objects in database        {"shard_id": "XidSLGUWrwRTFALB3tpD6p"}
8:23:55  writecache/init.go:60  finished updating flush marks        {"shard_id": "XidSLGUWrwRTFALB3tpD6p"}
8:23:55  neofs-node/main.go:78  storage engine service has been successfully initialized
8:23:55  neofs-node/main.go:76  initializing gRPC service...

The text was updated successfully, but these errors were encountered:

roman-khimov · 2023-01-26T20:37:35Z

How much data does this node have? Number of shards, sizes, number of files in them?

fyrchik · 2023-01-27T08:14:41Z

It is a known problem with writecache:

After hard-reset some objects may be present in cache but not in the main storage (actually, also true for a simple shutdown).
To know which of them can be removed, we must check individually whether each of them was flushed.
This check takes time proportional to the amount of objects in a write-cache.

I thought there was a GH issue, but I cannot find it now. One of the approaches to fix this is to initialize cache asynchronously (start check in a background -> return ErrInitializing for modifying operations during it -> finish check).

fyrchik · 2023-01-27T08:16:43Z

To give you some numbers, it can take ~40 minutes on a dedicated server with ~10 shards and ~100GB cache, so it is worth fixing IMO.

532910 · 2023-01-27T12:51:52Z

this issue is about the first start, when no write-cache exist at all

How much data does this node have? Number of shards, sizes, number of files in them?

it has no data, 2 shards

roman-khimov · 2023-01-27T12:55:16Z

@fyrchik, I don't see any issue tracking this, do we have it? Because @532910 problem is a different one.

fyrchik · 2023-02-27T10:57:47Z

@532910 has the initialization finished at some point? If not, this can be related to a recent deadlock in the morph client (aka subscribe without channel readers).

532910 · 2023-02-27T11:03:10Z

yes, it works fine, only initialization takes too long

vvarg229 · 2023-04-18T12:59:24Z

This is reproduced on dev-env of the current version:
https:/nspcc-dev/neofs-dev-env/blob/2d67bc26f76ed072a5a093d03661b9a072f35360/.env
Logs

On the dev-env issue is solved by changing the storage healthchecks timeouts in https:/nspcc-dev/neofs-dev-env/blob/2d67bc26f76ed072a5a093d03661b9a072f35360/services/storage/docker-compose.yml file to a larger value, for example:

    healthcheck:
      test: ["CMD-SHELL", "/healthcheck.sh"]
      interval: 5s
      timeout: 5s
      retries: 5
      start_period: 20s

It is important to note that this only reproduced on a Debian 11 machine with kernel version 5.10.0-21-amd64.
On a more recent Ubuntu 22.10 with kernel 5.19.0-38-generic the issue does not reproduce.

Temporarily adjust healthcheck settings in docker-compose file to mitigate issue #2215, until a root cause is found and resolved. Changes include: - Increase interval from 2s to 5s - Increase timeout from 1s to 5s - Increase start_period from 10s to 20s Refs: nspcc-dev/neofs-node#2215 Signed-off-by: Oleg Kulachenko <[email protected]>

roman-khimov · 2023-12-21T16:39:56Z

Is it still an issue? Shouldn't be.

vvarg229 · 2023-12-22T09:56:02Z

Is it still an issue? Shouldn't be.

Looks like fixed at nspcc-dev/neofs-dev-env#251

roman-khimov · 2023-12-22T10:22:14Z

Seems to be obsolete in init part. Write cache is #2337.

532910 added triage labels Jan 26, 2023

roman-khimov removed the community label Mar 21, 2023

vvarg229 mentioned this issue Apr 18, 2023

storage: Increase healthcheck intervals and timeouts nspcc-dev/neofs-dev-env#251

Merged

roman-khimov mentioned this issue Jul 25, 2023

blobovnicza: Add benchmark to test different tree settings #2457

Merged

roman-khimov removed the triage label Dec 21, 2023

roman-khimov added bug Something isn't working U3 Regular S4 Routine I4 No visible changes neofs-storage Storage node application issues labels Dec 21, 2023

roman-khimov closed this as not planned Won't fix, can't repro, duplicate, stale Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage engine initialization takes too long #2215

storage engine initialization takes too long #2215

532910 commented Jan 26, 2023

roman-khimov commented Jan 26, 2023

fyrchik commented Jan 27, 2023 •

edited

Loading

fyrchik commented Jan 27, 2023

532910 commented Jan 27, 2023

roman-khimov commented Jan 27, 2023

fyrchik commented Feb 27, 2023

532910 commented Feb 27, 2023

vvarg229 commented Apr 18, 2023

roman-khimov commented Dec 21, 2023

vvarg229 commented Dec 22, 2023

roman-khimov commented Dec 22, 2023

storage engine initialization takes too long #2215

storage engine initialization takes too long #2215

Comments

532910 commented Jan 26, 2023

roman-khimov commented Jan 26, 2023

fyrchik commented Jan 27, 2023 • edited Loading

fyrchik commented Jan 27, 2023

532910 commented Jan 27, 2023

roman-khimov commented Jan 27, 2023

fyrchik commented Feb 27, 2023

532910 commented Feb 27, 2023

vvarg229 commented Apr 18, 2023

roman-khimov commented Dec 21, 2023

vvarg229 commented Dec 22, 2023

roman-khimov commented Dec 22, 2023

fyrchik commented Jan 27, 2023 •

edited

Loading