Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TargetDown alert #190

Open
rbo opened this issue Jul 12, 2024 · 4 comments
Open

TargetDown alert #190

rbo opened this issue Jul 12, 2024 · 4 comments
Labels
cluster/isar BareMetal COE Cluter

Comments

@rbo
Copy link
Member

rbo commented Jul 12, 2024

100% of the alertmanager-metrics/alertmanager-metrics targets in open-cluster-management-observability namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.

/cc @DanielFroehlich

@rbo rbo added the cluster/isar BareMetal COE Cluter label Jul 12, 2024
@rbo
Copy link
Member Author

rbo commented Jul 12, 2024

$ oc describe -n open-cluster-management-observability  svc alertmanager-metrics
Name:              alertmanager-metrics
Namespace:         open-cluster-management-observability
Labels:            app=multicluster-observability-alertmanager-metrics
Annotations:       service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1700045229
                   service.beta.openshift.io/serving-cert-secret-name: alertmanager-tls-metrics
                   service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1700045229
Selector:          alertmanager=observability,app=multicluster-observability-alertmanager
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                172.30.136.253
IPs:               172.30.136.253
Port:              metrics  9096/TCP
TargetPort:        metrics/TCP
Endpoints:         10.128.11.126:9096,10.130.12.5:9096,10.131.15.7:9096
Session Affinity:  None
Events:            <none>
$ oc get pods -l alertmanager=observability,app=multicluster-observability-alertmanager
NAME                           READY   STATUS    RESTARTS   AGE
observability-alertmanager-0   4/4     Running   0          16d
observability-alertmanager-1   4/4     Running   0          16d
observability-alertmanager-2   4/4     Running   0          16d
$ oc logs observability-alertmanager-0
Defaulted container "alertmanager" out of: alertmanager, config-reloader, alertmanager-proxy, kube-rbac-proxy
ts=2024-06-26T08:34:51.585Z caller=main.go:240 level=info msg="Starting Alertmanager" version="(version=0.25.0, branch=non-git, revision=non-git)"
ts=2024-06-26T08:34:51.585Z caller=main.go:241 level=info build_context="(go=go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime, platform=linux/amd64, user=root@cd720cdc1cd3, date=20240502-09:12:20, tags=netgo)"
ts=2024-06-26T08:34:51.630Z caller=cluster.go:261 level=warn component=cluster msg="failed to join cluster" err="3 errors occurred:\n\t* Failed to resolve observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:34:51.630Z caller=cluster.go:263 level=info component=cluster msg="will retry joining cluster every 10s"
ts=2024-06-26T08:34:51.630Z caller=main.go:338 level=warn msg="unable to join gossip mesh" err="3 errors occurred:\n\t* Failed to resolve observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:34:51.630Z caller=cluster.go:681 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s
ts=2024-06-26T08:34:51.678Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
ts=2024-06-26T08:34:51.678Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml
ts=2024-06-26T08:34:51.682Z caller=tls_config.go:274 level=info msg="Listening on" address=127.0.0.1:9093
ts=2024-06-26T08:34:51.682Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=127.0.0.1:9093
ts=2024-06-26T08:34:53.631Z caller=cluster.go:706 level=info component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000810406s
ts=2024-06-26T08:35:01.634Z caller=cluster.go:698 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.003464146s
ts=2024-06-26T08:35:06.648Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:06.651Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:06.654Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:21.659Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:21.663Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:21.667Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:36.648Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:36.651Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:51.653Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:35:51.661Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:36:06.649Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"
ts=2024-06-26T08:36:21.649Z caller=cluster.go:471 level=warn component=cluster msg=refresh result=failure addr=observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094 err="1 error occurred:\n\t* Failed to resolve observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc:9094: lookup observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc on 172.30.0.10:53: no such host\n\n"

@rbo
Copy link
Member Author

rbo commented Jul 12, 2024

$ oc get svc -n open-cluster-management-observability alertmanager-operated
NAME                    TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
alertmanager-operated   ClusterIP   None         <none>        9094/TCP,9094/UDP   16d
$ oc describe svc -n open-cluster-management-observability alertmanager-operated
Name:              alertmanager-operated
Namespace:         open-cluster-management-observability
Labels:            <none>
Annotations:       <none>
Selector:          alertmanager=observability,app=multicluster-observability-alertmanager
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                None
IPs:               None
Port:              tcp-mesh  9094/TCP
TargetPort:        9094/TCP
Endpoints:         10.128.11.126:9094,10.130.12.5:9094,10.131.15.7:9094
Port:              udp-mesh  9094/UDP
TargetPort:        9094/UDP
Endpoints:         10.128.11.126:9094,10.130.12.5:9094,10.131.15.7:9094
Session Affinity:  None
Events:            <none>
$ oc get pods -l alertmanager=observability,app=multicluster-observability-alertmanager
NAME                           READY   STATUS    RESTARTS   AGE
observability-alertmanager-0   4/4     Running   0          16d
observability-alertmanager-1   4/4     Running   0          16d
observability-alertmanager-2   4/4     Running   0          16d
$ 

@rbo
Copy link
Member Author

rbo commented Jul 12, 2024

DNS Looks good:

$ oc rsh observability-alertmanager-0
Defaulted container "alertmanager" out of: alertmanager, config-reloader, alertmanager-proxy, kube-rbac-proxy
sh-5.1$ getent hosts observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc
10.128.11.126   observability-alertmanager-2.alertmanager-operated.open-cluster-management-observability.svc.cluster.local
sh-5.1$ getent hosts observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc
10.131.15.7     observability-alertmanager-1.alertmanager-operated.open-cluster-management-observability.svc.cluster.local
sh-5.1$ getent hosts observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc
10.130.12.5     observability-alertmanager-0.alertmanager-operated.open-cluster-management-observability.svc.cluster.local
sh-5.1$ getent hosts alertmanager-operated.open-cluster-management-observability.svc
10.130.12.5     alertmanager-operated.open-cluster-management-observability.svc.cluster.local
10.131.15.7     alertmanager-operated.open-cluster-management-observability.svc.cluster.local
10.128.11.126   alertmanager-operated.open-cluster-management-observability.svc.cluster.local
sh-5.1$ 

@DanielFroehlich
Copy link

Restarting of the pod ( oc delete pod observability-alertmanager-0) does not help.
Restarting the DNS pod on the node also does not help.
Restarting all the pods in ns observability-alertmanager-0 also does not help
Delete the servuce (oc delete service alertmanager-operated ) and let it re-create also does not help.

I get the feeling this is a bug. Looking at the service, it does not get a cluster IP assigned to it (compare to e.g. alertmanager-metrics, which targets the same pods). WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster/isar BareMetal COE Cluter
Projects
None yet
Development

No branches or pull requests

2 participants