-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v0.16.0] All node labels removed while informer cache failed to sync #1802
Comments
I think this happens when upgrading from NFD v0.13 or earlier. The specific reason is the NodeFeature CRD API that was introduced in v0.14 (earlier versions of NFD relied on gRPC communication between nfd-master and nfd-worker instances). What happens (I think) is that nfd-master starts quickly (and syncs NodeFeature cache) as there are no NodeFeature objects in the cluster. And, bc of that it starts removing the labels from nodes (as it sees no features for them published by the nfd-worker instances). At the same time the nfd-worker instances start and start pushing NodeFeature objects to the apiserver -> there are ton of them and the cache throws timeout errors. Eventually things "stabilize" and node labels are back. The thing to "fix" this would be upgrading nfd-workers first, so that the NodeFeature objects are created before nfd-master kicks in. Maybe a migration guide from pre-v0.14? |
Hi @marquiz, Let me provide more context about the situation:
The logs indicate that the nfd-master pod failed to list the NodeFeature CRD due to a 1-minute request timeout, although the NodeFeature CRDs do exist in the cluster. The large size of the CRD likely contributed to this timeout.
In this situation, the nfd-master removes all the labels from nodes and does not add them back. The nfd-master invokes the node label updates here. The labels were removed because an empty NodeFeature CRD was created here and passed to that function. It seems the current logic assumes that the NodeFeature CRD missing in the sharedInformer cache is equivalent to the NodeFeature being missing in the cluster. This causes all the labels to be removed and not added back. |
@lxlxok I'm still not convinced that this is the issue. The nfd-master blocks until all the caches are synced (here). It will never get to the node update part if/when the informer cache sync keeps timeouting. The timeout is another issue, though. That needs to be investigated/solved. Looking at the log that @ahmetb linked nfd-master tried syncing caches for 1,5 hours without succeeding. |
@marquiz any ideas why the logs show "timeout" for informer sync many minutes later the logs say "node updated"? I'm trying to understand if logs show any indication of cache sync ever succeeding, because the program goes into updating nodes shortly after booting. |
Hi @marquiz It doesn't block nfd-master to run node update part. The WaitForCacheSync function return the res map immediately at here after it loop all the informer. This also explain why updaterPool start immediately after m.startNfdApiController at here at 04:29:46. The cache of NodeFeature was populated at 04:30:01. We tested node-feature-discovery 0.16.0 in kind. Please see the verbose log below for more detail.
|
That DOES block until the caches have synced. However, in v0.16.0 we don't call
You won't see this kind of log with the latest version (v0.16.3). However, there are other issues you are likely to face. I submitted #1810 and #1811 to mitigate those. |
@marquiz thanks for the responses. Would the lack of I see v0.16.3 actually does have a By the way node-feature-discovery/pkg/nfd-gc/nfd-gc.go Line 211 in bd8d74d
|
Yes, exactly that.
It should've. But I suspect that in a BIG cluster with v0.16.3 you'll hit the "timeout problem" (that #1811 addresses) and
Thanks for pointing out that. It's not critical there but I think it's good to be consistent and start from a known/synced state. I'll submit a PR to fix that one |
What happened:
While migrating from 0.10 to 0.16.0:
kubectl get nodefeatures -n node-feature-discovery
was unresponsive at the time (likely because the cluster size is 4000 nodes and the NodeFeature CR objects are 130kb each by default)What you expected to happen:
How to reproduce it (as minimally and precisely as possible): Run nfd chart by default on a 4000 node cluster.
Anything else we need to know?:
There were extensive informer sync errors in
nfd-master
logs (seeming to be timing out after 60s). This is likely because the LIST NodeFeatures is a very expensive call (each object is very large + a lot of Nodes in the cluster).Attaching logs: nfd-master.log
My suspicion is that the
nfd-master
somehow does not wait for informer cache to sync (as the first informer sync error occurs exactly 60s after the process starts) –and it treats lack of response as "empty set of labels" and clearing the labels. (But I'm not familiar with the inner workings of the codebase, it's just a theory.)💡 We don't see the issue on much smaller clusters.
💡 We have not yet tried v0.16.2 (release notes mention it fixes a node removal issue, but it's clear what was the root cause there).
Environment:
kubectl version
): v1.23.17cat /etc/os-release
): Not applicableuname -a
): Not applicableThe text was updated successfully, but these errors were encountered: