[ClusterLabs] [External] : Grafana with ClusterLabs HA dashboard

Sun Feb 7 10:10:53 EST 2021

I decided to do one more change, I added targets with both port 9100 and 9664 and then most of the Grafana UI shows data, except 2 of them  (Note attributes,  system units).

- job_name: 'nfs-ha'
    scrape_interval: 5s
    static_configs:
      - targets: ['nfs-server-1.storage.nfs.oraclevcn.com:9664', 'nfs-server-2.storage.nfs.oraclevcn.com:9664', 'qdevice.storage.nfs.oraclevcn.com:9664', 'nfs-server-1.storage.nfs.oraclevcn.com:9100', 'nfs-server-2.storage.nfs.oraclevcn.com:9100', 'qdevice.storage.nfs.oraclevcn.com:9100']
        labels:
          group: 'nfs-ha'

[A screenshot of a computer  Description automatically generated with medium confidence]

Thanks,
Pinkesh Valdria
Principal Solutions Architect – HPC
Oracle Cloud Infrastructure
+65-8932-3639 (m) - Singapore
+1-425-205-7834 (m) - USA

From: Users <users-bounces at clusterlabs.org> on behalf of Pinkesh Valdria <pinkesh.valdria at oracle.com>
Reply-To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
Date: Sunday, February 7, 2021 at 6:34 AM
To: "users at clusterlabs.org" <users at clusterlabs.org>
Subject: [External] : [ClusterLabs] Grafana with ClusterLabs HA dashboard

This is my first attempt to use Grafana with ClusterLabs HA dashboard.   I got Grafana, Prometheus and Prometheus node_exporter to work and I am able to see those metrics.   Next step was to make  “ClusterLabs HA Cluster details”  Grafana dashboard to work and I am unable to make it work.   Appreciate if you can point me in the right direction.

https://grafana.com/grafana/dashboards/12229<https://urldefense.com/v3/__https:/grafana.com/grafana/dashboards/12229__;!!GqivPVa7Brio!M6Dih2Jls86nkDZpN0wAaB8gtLh8xT5rateXdvhwkBrYrF72uKY1xi4YFulWd4F1uaRu$>

I am running Grafana on default 3000 port.  Similarly using defaults for Prometheus.   Prometheus node_exporter is using 9100.

I installed “ha_cluster_exporter” on all NFS-HA nodes (node1, node2 and corosync quorum (qdevice) node.    I see they use 9664 port.

nfs-server-1
[root at nfs-server-1 ha_cluster_exporter]# systemctl status ha_cluster_exporter
● ha_cluster_exporter.service - Prometheus exporter for Pacemaker HA clusters metrics
   Loaded: loaded (/usr/lib/systemd/system/ha_cluster_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2021-02-07 11:39:07 GMT; 1h 3min ago
Main PID: 18547 (ha_cluster_expo)
   Memory: 6.7M
   CGroup: /system.slice/ha_cluster_exporter.service
           └─18547 /root/go/bin/ha_cluster_exporter

Feb 07 11:39:07 nfs-server-1 systemd[1]: Started Prometheus exporter for Pacemaker HA clusters metrics.
Feb 07 11:39:07 nfs-server-1 ha_cluster_exporter[18547]: time="2021-02-07T11:39:07Z" level=warning msg="Config File \"ha_cluster_exporter\" Not Found in \"[/ /.config /etc /usr/etc]\""
Feb 07 11:39:07 nfs-server-1 ha_cluster_exporter[18547]: time="2021-02-07T11:39:07Z" level=info msg="Default config values will be used"
Feb 07 11:39:07 nfs-server-1 ha_cluster_exporter[18547]: time="2021-02-07T11:39:07Z" level=warning msg="Registration failure: could not initialize 'drbd' collector: '/sbin/drbdsetup' does not exist"
Feb 07 11:39:07 nfs-server-1 ha_cluster_exporter[18547]: time="2021-02-07T11:39:07Z" level=info msg="'pacemaker' collector registered."
Feb 07 11:39:07 nfs-server-1 ha_cluster_exporter[18547]: time="2021-02-07T11:39:07Z" level=info msg="'corosync' collector registered."
Feb 07 11:39:07 nfs-server-1 ha_cluster_exporter[18547]: time="2021-02-07T11:39:07Z" level=info msg="'sbd' collector registered."
Feb 07 11:39:07 nfs-server-1 ha_cluster_exporter[18547]: time="2021-02-07T11:39:07Z" level=info msg="Serving metrics on 0.0.0.0:9664"
[root at nfs-server-1 ha_cluster_exporter]#

Similarly on
nfs-server-2 and qdevice node.

[root at nfs-server-2 ha_cluster_exporter]# systemctl status ha_cluster_exporter
…….
…..
Feb 07 12:15:10 nfs-server-2 ha_cluster_exporter[11895]: time="2021-02-07T12:15:10Z" level=warning msg="Registration failure: could not initialize 'drbd' collector: '/sbin/drbdsetup' does not exist"
Feb 07 12:15:10 nfs-server-2 ha_cluster_exporter[11895]: time="2021-02-07T12:15:10Z" level=info msg="'pacemaker' collector registered."
Feb 07 12:15:10 nfs-server-2 ha_cluster_exporter[11895]: time="2021-02-07T12:15:10Z" level=info msg="'corosync' collector registered."
Feb 07 12:15:10 nfs-server-2 ha_cluster_exporter[11895]: time="2021-02-07T12:15:10Z" level=info msg="'sbd' collector registered."
Feb 07 12:15:10 nfs-server-2 ha_cluster_exporter[11895]: time="2021-02-07T12:15:10Z" level=info msg="Serving metrics on 0.0.0.0:9664"

I copied this file https://github.com/ClusterLabs/ha_cluster_exporter/blob/master/dashboards/provider-sleha.yaml to /etc/grafana/provisioning/dashboards/ and copied ha-cluster-details_rev2.json<https://urldefense.com/v3/__https:/github.com/ClusterLabs/ha_cluster_exporter/blob/master/dashboards/provider-sleha.yaml*20to*20/etc/grafana/provisioning/dashboards/*20and*20copied*20ha-cluster-details_rev2.json__;JSUlJSU!!GqivPVa7Brio!M6Dih2Jls86nkDZpN0wAaB8gtLh8xT5rateXdvhwkBrYrF72uKY1xi4YFulWd2eErioZ$>  file to /etc/grafana/dashboards/sleha  as mentioned in manual steps section of this page:  https://github.com/ClusterLabs/ha_cluster_exporter/tree/master/dashboards<https://urldefense.com/v3/__https:/github.com/ClusterLabs/ha_cluster_exporter/tree/master/dashboards__;!!GqivPVa7Brio!M6Dih2Jls86nkDZpN0wAaB8gtLh8xT5rateXdvhwkBrYrF72uKY1xi4YFulWd1vSNuSE$>

I see these errors in Grafana log  and the Grafana UI says “No data”.
t=2021-02-07T12:51:18+0000 lvl=eror msg="Data proxy error" logger=data-proxy-log userId=1 orgId=1 uname=admin path=/api/datasources/proxy/1/api/v1/query_range remote_addr=10.0.0.2 referer="http://localhost:3000/d/Q5YJpwtZk1/clusterlabs-ha-cluster-details?orgId=1"<https://urldefense.com/v3/__http:/localhost:3000/d/Q5YJpwtZk1/clusterlabs-ha-cluster-details?orgId=1*22__;JQ!!GqivPVa7Brio!M6Dih2Jls86nkDZpN0wAaB8gtLh8xT5rateXdvhwkBrYrF72uKY1xi4YFulWd6GfwcJ1$> error="http: proxy error: context canceled"
t=2021-02-07T12:51:18+0000 lvl=info msg="Request Completed" logger=context userId=1 orgId=1 uname=admin method=GET path=/api/datasources/proxy/1/api/v1/query_range status=502 remote_addr=10.0.0.2 time_ms=13 size=0 referer="http://localhost:3000/d/Q5YJpwtZk1/clusterlabs-ha-cluster-details?orgId=1"<https://urldefense.com/v3/__http:/localhost:3000/d/Q5YJpwtZk1/clusterlabs-ha-cluster-details?orgId=1*22__;JQ!!GqivPVa7Brio!M6Dih2Jls86nkDZpN0wAaB8gtLh8xT5rateXdvhwkBrYrF72uKY1xi4YFulWd6GfwcJ1$>
t=2021-02-07T12:51:48+0000 lvl=eror msg="Data proxy error" logger=data-proxy-log userId=1 orgId=1 uname=admin path=/api/datasources/proxy/1/api/v1/query remote_addr=10.0.0.2 referer="http://localhost:3000/d/Q5YJpwtZk1/clusterlabs-ha-cluster-details?orgId=1&var-DS_PROMETHEUS=Prometheus&var-cluster=nfs-ha&var-dc_instance="<https://urldefense.com/v3/__http:/localhost:3000/d/Q5YJpwtZk1/clusterlabs-ha-cluster-details?orgId=1&var-DS_PROMETHEUS=Prometheus&var-cluster=nfs-ha&var-dc_instance=*22__;JQ!!GqivPVa7Brio!M6Dih2Jls86nkDZpN0wAaB8gtLh8xT5rateXdvhwkBrYrF72uKY1xi4YFulWdyOQTAS9$> error="http: proxy error: context canceled"
t=2021-02-07T12:51:48+0000 lvl=info msg="Request Completed" logger=context userId=1 orgId=1 uname=admin method=GET path=/api/datasources/proxy/1/api/v1/query status=502 remote_addr=10.0.0.2 time_ms=4 size=0 referer=http://localhost:3000/d/Q5YJpwtZk1/clusterlabs-ha-cluster-details?orgId=1&var-DS_PROMETHEUS=Prometheus&var-cluster=nfs-ha&var-dc_instance=<https://urldefense.com/v3/__http:/localhost:3000/d/Q5YJpwtZk1/clusterlabs-ha-cluster-details?orgId=1&var-DS_PROMETHEUS=Prometheus&var-cluster=nfs-ha&var-dc_instance=__;!!GqivPVa7Brio!M6Dih2Jls86nkDZpN0wAaB8gtLh8xT5rateXdvhwkBrYrF72uKY1xi4YFulWdx03gt3o$>

[A screenshot of a computer  Description automatically generated with medium confidence]

This is my monitoring node running Grafana:
[root at client-1 log]# ls -l  /etc/grafana/dashboards
total 92
-rw-r--r--. 1 root root 94021 Feb  7 10:15 node-exporter.json
drwxr-xr-x. 2 root root    42 Feb  7 12:58 sleha
[root at client-1 log]# ls -l  /etc/grafana/dashboards/sleha/
total 44
-rw-r--r--. 1 root root 41665 Feb  7 12:27 ha-cluster-details_rev2.json
[root at client-1 log]#

cat /etc/grafana/provisioning/dashboards/provider-sleha.yaml
apiVersion: 1

providers:
  - name: SUSE Linux Enterprise High Availability Extension
    folder: SUSE Linux Enterprise
    folderUid: 3b1e0b26-fc28-4254-88a1-2d3516b5e404
    type: file
    allowUiUpdates: true
    editable: true
    options:
      path: /etc/grafana/dashboards/sleha

Copied file (ha-cluster-details_rev2.json) from here:  https://grafana.com/grafana/dashboards/12229/revisions<https://urldefense.com/v3/__https:/grafana.com/grafana/dashboards/12229/revisions__;!!GqivPVa7Brio!M6Dih2Jls86nkDZpN0wAaB8gtLh8xT5rateXdvhwkBrYrF72uKY1xi4YFulWd7C_nPtD$>   to   /etc/grafana/dashboards/sleha

cat /etc/grafana/provisioning/dashboards/node_exporter.yaml
apiVersion: 1
providers:
  - name: 'NFS HA Dashboard'
    type: file
    updateIntervalSeconds: 10
    options:
      path: /etc/grafana/dashboards/node-exporter.json
[root at client-1 log]#

I added the job_name: 'nfs-ha' section to /etc/prometheus/prometheus.yml ,  since the ClusterLabs HA Grafan dashboard says :  “It is built on top of ha_cluster_exporter, but it also requires Prometheus node_exporter to be configured on the target nodes, and it also assumes that the target nodes in each cluster are grouped via the job label.”
https://grafana.com/grafana/dashboards/12229<https://urldefense.com/v3/__https:/grafana.com/grafana/dashboards/12229__;!!GqivPVa7Brio!M6Dih2Jls86nkDZpN0wAaB8gtLh8xT5rateXdvhwkBrYrF72uKY1xi4YFulWd4F1uaRu$>

  - job_name: 'quorum'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
      - targets: ['qdevice.storage.nfs.oraclevcn.com:9100']
        labels:
          group: 'quorum'

  - job_name: 'nfs_server'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
      - targets: ['nfs-server-1.storage.nfs.oraclevcn.com:9100', 'nfs-server-2.storage.nfs.oraclevcn.com:9100']
        labels:
          group: 'nfs_server'

  - job_name: 'nfs-ha'

    scrape_interval: 5s
    static_configs:
      - targets: ['nfs-server-1.storage.nfs.oraclevcn.com:9100', 'nfs-server-2.storage.nfs.oraclevcn.com:9100', 'qdevice.storage.nfs.oraclevcn.com:9100']
        labels:
          group: 'nfs-ha'

Thanks,
Pinkesh Valdria
Principal Solutions Architect – HPC

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210207/a605e2c6/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 284218 bytes
Desc: image001.png
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210207/a605e2c6/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 133406 bytes
Desc: image002.png
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210207/a605e2c6/attachment-0003.png>