[ClusterLabs] Ocassionally IPaddr2 resource fails to start

Fri Sep 20 08:43:08 EDT 2019

Hi there!

I've got a tricky case, when my IpAddr2 resource fails to start with
literally no-reason:
"IPSHARED_monitor_20000 on my-master-1 'not running' (7): call=11,
status=complete, exitreason='',
   last-rc-change='Wed Sep 4 06:08:07 2019', queued=0ms, exec=0ms"

Resource IpAddr2 managed to fix itself and continued to work properly
further after that.

What I've done after, was setting 'Failure-timeout=900' seconds for my
IpAddr2 resource, to prevent working of
the resource on a node where it fails. I also set the
'migration-threshold=2' so IpAddr2 can fail only 2 times, and goes to a
Slave side after that. Meanwhile Master gets banned for 900 seconds.

After 900 seconds cluster tries to start IpAddr2 again at Master, in case
it's ok, fail counter gets cleared.
That's how I avoid appearing of the error I mentioned above.

I tried to get so hard, why this can happen, but still no idea on the
count. Any clue how to find a reason?
And another question, can snap-shoting of VM machines have any impact on
such?

And my configurations:
-------------------------------
node 000001: my-master-1
node 000002: my-master-2

primitive IPSHARED IPaddr2 \
params ip=10.10.10.5 nic=eth0 cidr_netmask=24 \
meta migration-threshold=2 failure-timeout=900 target-role=Started \
op monitor interval=20 timeout=60 on-fail=restart

location PREFER_MASTER IPSHARED 100: my-master-1

property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.18-2b07d5c5a9 \
cluster-infrastructure=corosync \
cluster-name=wall \
cluster-recheck-interval=5s \
start-failure-is-fatal=false \
stonith-enabled=false \
no-quorum-policy=ignore \
last-lrm-refresh=1554982967
-------------------------------

Thanks in advance!

-- 
-- 
BR, Donat Zenichev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190920/52f40b65/attachment.html>