Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage engine initialization takes too long #2215

Closed
532910 opened this issue Jan 26, 2023 · 11 comments
Closed

storage engine initialization takes too long #2215

532910 opened this issue Jan 26, 2023 · 11 comments
Labels
bug Something isn't working I4 No visible changes neofs-storage Storage node application issues S4 Routine U3 Regular

Comments

@532910
Copy link

532910 commented Jan 26, 2023

8:16:39  neofs-node/main.go:76  initializing storage engine service...
8:23:01  writecache/init.go:14  filling flush marks for objects in FSTree        {"shard_id": "YWVQLsdMz451ShJzUUKqD2"}
8:23:01  writecache/init.go:25  filling flush marks for objects in database        {"shard_id": "YWVQLsdMz451ShJzUUKqD2"}
8:23:01  writecache/init.go:60  finished updating flush marks        {"shard_id": "YWVQLsdMz451ShJzUUKqD2"}
8:23:55  writecache/init.go:14  filling flush marks for objects in FSTree        {"shard_id": "XidSLGUWrwRTFALB3tpD6p"}
8:23:55  writecache/init.go:25  filling flush marks for objects in database        {"shard_id": "XidSLGUWrwRTFALB3tpD6p"}
8:23:55  writecache/init.go:60  finished updating flush marks        {"shard_id": "XidSLGUWrwRTFALB3tpD6p"}
8:23:55  neofs-node/main.go:78  storage engine service has been successfully initialized
8:23:55  neofs-node/main.go:76  initializing gRPC service...
@roman-khimov
Copy link
Member

How much data does this node have? Number of shards, sizes, number of files in them?

@fyrchik
Copy link
Contributor

fyrchik commented Jan 27, 2023

It is a known problem with writecache:

  1. After hard-reset some objects may be present in cache but not in the main storage (actually, also true for a simple shutdown).
  2. To know which of them can be removed, we must check individually whether each of them was flushed.
  3. This check takes time proportional to the amount of objects in a write-cache.

I thought there was a GH issue, but I cannot find it now. One of the approaches to fix this is to initialize cache asynchronously (start check in a background -> return ErrInitializing for modifying operations during it -> finish check).

@fyrchik
Copy link
Contributor

fyrchik commented Jan 27, 2023

To give you some numbers, it can take ~40 minutes on a dedicated server with ~10 shards and ~100GB cache, so it is worth fixing IMO.

@532910
Copy link
Author

532910 commented Jan 27, 2023

this issue is about the first start, when no write-cache exist at all

How much data does this node have? Number of shards, sizes, number of files in them?

it has no data, 2 shards

@roman-khimov
Copy link
Member

@fyrchik, I don't see any issue tracking this, do we have it? Because @532910 problem is a different one.

@fyrchik
Copy link
Contributor

fyrchik commented Feb 27, 2023

@532910 has the initialization finished at some point? If not, this can be related to a recent deadlock in the morph client (aka subscribe without channel readers).

@532910
Copy link
Author

532910 commented Feb 27, 2023

yes, it works fine, only initialization takes too long

@vvarg229
Copy link
Collaborator

This is reproduced on dev-env of the current version:
https:/nspcc-dev/neofs-dev-env/blob/2d67bc26f76ed072a5a093d03661b9a072f35360/.env
Logs

On the dev-env issue is solved by changing the storage healthchecks timeouts in https:/nspcc-dev/neofs-dev-env/blob/2d67bc26f76ed072a5a093d03661b9a072f35360/services/storage/docker-compose.yml file to a larger value, for example:

    healthcheck:
      test: ["CMD-SHELL", "/healthcheck.sh"]
      interval: 5s
      timeout: 5s
      retries: 5
      start_period: 20s

It is important to note that this only reproduced on a Debian 11 machine with kernel version 5.10.0-21-amd64.
On a more recent Ubuntu 22.10 with kernel 5.19.0-38-generic the issue does not reproduce.

vvarg229 added a commit to vvarg229/neofs-dev-env that referenced this issue Apr 18, 2023
Temporarily adjust healthcheck settings in docker-compose file to mitigate
issue #2215, until a root cause is found and resolved.
Changes include:
- Increase interval from 2s to 5s
- Increase timeout from 1s to 5s
- Increase start_period from 10s to 20s

Refs: nspcc-dev/neofs-node#2215

Signed-off-by: Oleg Kulachenko <[email protected]>
@roman-khimov
Copy link
Member

Is it still an issue? Shouldn't be.

@roman-khimov roman-khimov added bug Something isn't working U3 Regular S4 Routine I4 No visible changes neofs-storage Storage node application issues labels Dec 21, 2023
@vvarg229
Copy link
Collaborator

Is it still an issue? Shouldn't be.

Looks like fixed at nspcc-dev/neofs-dev-env#251

@roman-khimov
Copy link
Member

Seems to be obsolete in init part. Write cache is #2337.

@roman-khimov roman-khimov closed this as not planned Won't fix, can't repro, duplicate, stale Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working I4 No visible changes neofs-storage Storage node application issues S4 Routine U3 Regular
Projects
None yet
Development

No branches or pull requests

4 participants