[ClusterLabs] cron-suitable cluster status check

Mon Feb 29 15:52:38 UTC 2016

On 02/27/2016 03:56 PM, Devin Reade wrote:
> Right now in a test cluster on CentOS 7 I'm occasionally seeing
> resource monitoring failures and, just today, a failure to start
> a fencing agent.  While I need to track those down problems, the
> issue I want to discuss here is being notified when there is a
> problem with the cluster, where there is not a nagios-type monitoring
> system in place.
> 
> On an older CentOS 5 cluster I have a cron job that periodically runs
> 'crm_verify -LV'.  If the return code is non-zero, the output of
> that command (and some other info) is mailed to the operator.  That
> mechanism has been working well for years.
> 
> However on CentOS 7, when the cluster gets into this state 'crm_verify -LV'
> returns zero, and its output claims there is no problem.  However in
> 'crm_mon -f' I can see that I've got resource failures and nonzero
> failcounts.
> 
> I tried 'pcs cluster status', however when the cluster is properly
> working (no failures), that command still has a return code of '1',
> probably because I get the 'Error: no nodes found in corosync.conf'
> which is an ignorable condition per
> <https://access.redhat.com/solutions/663283>.
> 
> Is there a command that I can run from cron in the current cluster
> tools to tell me the simple answer of whether there is *anything*
> failed in the cluster, preferably based on its return code?

I'm not sure about the CentOS 5 days, but at least now, crm_verify is
intended to verify the syntax of a cluster's configuration rather than
its status.

The simplest method is "crm_mon -s", which gives a one-line
nagios-compatible output with return code 0=success and 1=problem.
However. it returns 1 for cluster not running, no DC, or offline nodes.

Back in the day, I used check_crm with nagios/icinga. It's a perl script
that parses the output of crm_mon -1rf and crm configure show. It's
trivial to use such a check outside a monitoring system, and it could be
modified to work with pcs and current crm_mon output, so maybe it could
help:

https://exchange.nagios.org/directory/Plugins/Clustering-and-High-2DAvailability/Check-CRM/details

> The CentOS 7 cluster is running:
>    corosync 2.3.4
>    pacemaker 1.1.13
> 
> The CentOS 5 cluster is running:
>    corosync 1.2.7
>    pacemaker 1.0.12
> 
> The corosync.conf is included below:
> 
> --------- cut here and be careful of pointy scissors ---------
> totem {
>     version: 2
>     #secauth: off
>     cluster_name: somecluster
>     #transport: udpu
>     rrp_mode: passive
>     crypto_hash: sha256
>     clear_node_high_bit: yes
> 
>     interface {
>         ringnumber: 0
>         bindnetaddr: 192.168.1.0
>         mcastaddr: 239.192.0.5
>         mcastport: 5406
>     }
>     interface {
>         ringnumber: 1
>         bindnetaddr: 192.168.2.0
>         mcastaddr: 239.192.0.6
>         mcastport: 5408
>     }
> }
> 
> quorum {
>     provider: corosync_votequorum
>     two_node: 1
>     expected_votes: 2
> }
> 
> logging {
>     to_syslog: yes
> }
> 
> --------- cut here and be careful of pointy scissors ---------
> 
> Devin