[ClusterLabs] Pacemaker stopped monitoring the resource

Sat Sep 2 05:52:34 EDT 2017

On 09/01/2017 11:45 PM, Ken Gaillot wrote:
> On Fri, 2017-09-01 at 15:06 +0530, Abhay B wrote:
>>         Are you sure the monitor stopped? Pacemaker only logs
>>         recurring monitors
>>         when the status changes. Any successful monitors after this
>>         wouldn't be
>>         logged.  
>>  
>> Yes. Since there  were no logs which said "RecurringOp:  Start
>> recurring monitor" on the node after it had failed.
>> Also there were no logs for any actions pertaining to 
>> The problem was that even though the one node was failing, the
>> resources were never moved to the other node(the node on which I
>> suspect monitoring had stopped).
>>
>>
>>         There are a lot of resource action failures, so I'm not sure
>>         where the
>>         issue is, but I'm guessing it has to do with
>>         migration-threshold=1 --
>>         once a resource has failed once on a node, it won't be allowed
>>         back on
>>         that node until the failure is cleaned up. Of course you also
>>         have
>>         failure-timeout=1s, which should clean it up immediately, so
>>         I'm not
>>         sure.
>>
>>
>> migration-threshold=1
>> failure-timeout=1s
>>
>> cluster-recheck-interval=2s
>>
>>
>>         first, set "two_node:
>>         1" in corosync.conf and let no-quorum-policy default in
>>         pacemaker
>>
>>
>> This is already configured.
>> # cat /etc/corosync/corosync.conf
>> totem {
>>     version: 2
>>     secauth: off
>>     cluster_name: SVSDEHA
>>     transport: udpu
>>     token: 5000
>> }
>>
>>
>> nodelist {
>>     node {
>>         ring0_addr: 2.0.0.10
>>         nodeid: 1
>>     }
>>
>>
>>     node {
>>         ring0_addr: 2.0.0.11
>>         nodeid: 2
>>     }
>> }
>>
>>
>> quorum {
>>     provider: corosync_votequorum
>>     two_node: 1
>> }
>>
>>
>> logging {
>>     to_logfile: yes
>>     logfile: /var/log/cluster/corosync.log
>>     to_syslog: yes
>> }
>>
>>
>>         let no-quorum-policy default in pacemaker; then,
>>         get stonith configured, tested, and enabled
>>
>>
>> By not configuring no-quorum-policy, would it ignore quorum for a 2
>> node cluster? 
> With two_node, corosync always provides quorum to pacemaker, so
> pacemaker doesn't see any quorum loss. The only significant difference
> from ignoring quorum is that corosync won't form a cluster from a cold
> start unless both nodes can reach each other (a safety feature).
>
>> For my use case I don't need stonith enabled. My intention is to have
>> a highly available system all the time.
> Stonith is the only way to recover from certain types of failure, such
> as the "split brain" scenario, and a resource that fails to stop.
>
> If your nodes are physical machines with hardware watchdogs, you can set
> up sbd for fencing without needing any extra equipment.

Small caveat here:
If I get it right you have a 2-node-setup. In this case the watchdog-only
sbd-setup would not be usable as it relies on 'real' quorum.
In 2-node-setups sbd needs at least a single shared disk.
For the sbd-single-disk-setup working with 2-node
you need the patch from https://github.com/ClusterLabs/sbd/pull/23
in place. (Saw you mentioning RHEL documentation - RHEL-7.4 has
it in since GA)

Regards,
Klaus

>
>> I will test my RA again as suggested with no-quorum-policy=default.
>>
>>
>> One more doubt. 
>> Why do we see this is 'pcs property' ?
>> last-lrm-refresh: 1504090367
>>
>>
>>
>> Never seen this on a healthy cluster.
>> From RHEL documentation: 
>> last-lrm-refresh
>>  
>> Last refresh of the
>> Local Resource Manager,
>> given in units of
>> seconds since epoca.
>> Used for diagnostic
>> purposes; not
>> user-configurable. 
>>
>>
>> Doesn't explain much.
> Whenever a cluster property changes, the cluster rechecks the current
> state to see if anything needs to be done. last-lrm-refresh is just a
> dummy property that the cluster uses to trigger that. It's set in
> certain rare circumstances when a resource cleanup is done. You should
> see a line in your logs like "Triggering a refresh after ... deleted ...
> from the LRM". That might give some idea of why.
>
>> Also. does avg. CPU load impact resource monitoring ? 
>>
>>
>> Regards,
>> Abhay
> Well, it could cause the monitor to take so long that it times out. The
> only direct effect of load on pacemaker is that the cluster might lower
> the number of agent actions that it can execute simultaneously.
>
>
>> On Thu, 31 Aug 2017 at 20:11 Ken Gaillot <kgaillot at redhat.com> wrote:
>>
>>         On Thu, 2017-08-31 at 06:41 +0000, Abhay B wrote:
>>         > Hi,
>>         >
>>         >
>>         > I have a 2 node HA cluster configured on CentOS 7 with pcs
>>         command.
>>         >
>>         >
>>         > Below are the properties of the cluster :
>>         >
>>         >
>>         > # pcs property
>>         > Cluster Properties:
>>         >  cluster-infrastructure: corosync
>>         >  cluster-name: SVSDEHA
>>         >  cluster-recheck-interval: 2s
>>         >  dc-deadtime: 5
>>         >  dc-version: 1.1.15-11.el7_3.5-e174ec8
>>         >  have-watchdog: false
>>         >  last-lrm-refresh: 1504090367
>>         >  no-quorum-policy: ignore
>>         >  start-failure-is-fatal: false
>>         >  stonith-enabled: false
>>         >
>>         >
>>         > PFA the cib.
>>         > Also attached is the corosync.log around the time the below
>>         issue
>>         > happened.
>>         >
>>         >
>>         > After around 10 hrs and multiple failures, pacemaker stops
>>         monitoring
>>         > resource on one of the nodes in the cluster.
>>         >
>>         >
>>         > So even though the resource on other node fails, it is never
>>         migrated
>>         > to the node on which the resource is not monitored.
>>         >
>>         >
>>         > Wanted to know what could have triggered this and how to
>>         avoid getting
>>         > into such scenarios.
>>         > I am going through the logs and couldn't find why this
>>         happened.
>>         >
>>         >
>>         > After this log the monitoring stopped.
>>         >
>>         > Aug 29 11:01:44 [16500] TPC-D12-10-002.phaedrus.sandvine.com
>>         > crmd:     info: process_lrm_event:   Result of monitor
>>         operation for
>>         > SVSDEHA on TPC-D12-10-002.phaedrus.sandvine.com: 0 (ok) |
>>         call=538
>>         > key=SVSDEHA_monitor_2000 confirmed=false cib-update=50013
>>         
>>         Are you sure the monitor stopped? Pacemaker only logs
>>         recurring monitors
>>         when the status changes. Any successful monitors after this
>>         wouldn't be
>>         logged.
>>         
>>         > Below log says the resource is leaving the cluster.
>>         > Aug 29 11:01:44 [16499] TPC-D12-10-002.phaedrus.sandvine.com
>>         > pengine:     info: LogActions:  Leave   SVSDEHA:0
>>          (Slave
>>         > TPC-D12-10-002.phaedrus.sandvine.com)
>>         
>>         This means that the cluster will leave the resource where it
>>         is (i.e. it
>>         doesn't need a start, stop, move, demote, promote, etc.).
>>         
>>         > Let me know if anything more is needed.
>>         >
>>         >
>>         > Regards,
>>         > Abhay
>>         >
>>         >
>>         > PS:'pcs resource cleanup' brought the cluster back into good
>>         state.
>>         
>>         There are a lot of resource action failures, so I'm not sure
>>         where the
>>         issue is, but I'm guessing it has to do with
>>         migration-threshold=1 --
>>         once a resource has failed once on a node, it won't be allowed
>>         back on
>>         that node until the failure is cleaned up. Of course you also
>>         have
>>         failure-timeout=1s, which should clean it up immediately, so
>>         I'm not
>>         sure.
>>         
>>         My gut feeling is that you're trying to do too many things at
>>         once. I'd
>>         start over from scratch and proceed more slowly: first, set
>>         "two_node:
>>         1" in corosync.conf and let no-quorum-policy default in
>>         pacemaker; then,
>>         get stonith configured, tested, and enabled; then, test your
>>         resource
>>         agent manually on the command line to make sure it conforms to
>>         the
>>         expected return values
>>         ( http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf ); then add your resource to the cluster without migration-threshold or failure-timeout, and work out any issues with frequent failures; then finally set migration-threshold and failure-timeout to reflect how you want recovery to proceed.
>>         --
>>         Ken Gaillot <kgaillot at redhat.com>
>>         
>>         
>>         
>>         
>>         
>>         _______________________________________________
>>         Users mailing list: Users at clusterlabs.org
>>         http://lists.clusterlabs.org/mailman/listinfo/users
>>         
>>         Project Home: http://www.clusterlabs.org
>>         Getting started:
>>         http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>         Bugs: http://bugs.clusterlabs.org