[ClusterLabs] Antw: Ocassionally IPaddr2 resource fails to start

Mon Oct 7 07:36:27 EDT 2019

Hi!

I can't remember the exact reason, but probably it was exactly that what made us remove any monitor operation from IPaddr2 (back in 2011). So far no problems doing so ;-)

Regards,
Ulrich
P.S.: Of cource it would be nice if the real issue could be found and fixed.

>>> Donat Zenichev <donat.zenichev at gmail.com> schrieb am 20.09.2019 um 14:43 in
Nachricht
<CANLwQCmVjcaTzHkcJsNOXLJghtYFLvbP3fD_d4NXrNQpM_JLWw at mail.gmail.com>:
> Hi there!
> 
> I've got a tricky case, when my IpAddr2 resource fails to start with
> literally no-reason:
> "IPSHARED_monitor_20000 on my-master-1 'not running' (7): call=11,
> status=complete, exitreason='',
>    last-rc-change='Wed Sep 4 06:08:07 2019', queued=0ms, exec=0ms"
> 
> Resource IpAddr2 managed to fix itself and continued to work properly
> further after that.
> 
> What I've done after, was setting 'Failure-timeout=900' seconds for my
> IpAddr2 resource, to prevent working of
> the resource on a node where it fails. I also set the
> 'migration-threshold=2' so IpAddr2 can fail only 2 times, and goes to a
> Slave side after that. Meanwhile Master gets banned for 900 seconds.
> 
> After 900 seconds cluster tries to start IpAddr2 again at Master, in case
> it's ok, fail counter gets cleared.
> That's how I avoid appearing of the error I mentioned above.
> 
> I tried to get so hard, why this can happen, but still no idea on the
> count. Any clue how to find a reason?
> And another question, can snap-shoting of VM machines have any impact on
> such?
> 
> And my configurations:
> -------------------------------
> node 000001: my-master-1
> node 000002: my-master-2
> 
> primitive IPSHARED IPaddr2 \
> params ip=10.10.10.5 nic=eth0 cidr_netmask=24 \
> meta migration-threshold=2 failure-timeout=900 target-role=Started \
> op monitor interval=20 timeout=60 on-fail=restart
> 
> location PREFER_MASTER IPSHARED 100: my-master-1
> 
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.18-2b07d5c5a9 \
> cluster-infrastructure=corosync \
> cluster-name=wall \
> cluster-recheck-interval=5s \
> start-failure-is-fatal=false \
> stonith-enabled=false \
> no-quorum-policy=ignore \
> last-lrm-refresh=1554982967
> -------------------------------
> 
> Thanks in advance!
> 
> -- 
> -- 
> BR, Donat Zenichev