[ClusterLabs] Antw: Ocassionally IPaddr2 resource fails to start

Mon Oct 7 10:24:29 EDT 2019

On Mon, 2019-10-07 at 14:40 +0300, Donat Zenichev wrote:
> Hello and thank you for your answer!
> 
> So should I just disable "monitor" options at all? In my case  I'd
> better delete the whole "op" row:
> "op monitor interval=20 timeout=60 on-fail=restart"
> 
> am I correct?

Personally I wouldn't delete the monitor -- at most, I'd configure it
with on-fail=ignore. That way you can still see failures in the cluster
status, even if the cluster doesn't react to them.

If this always happens when the VM is being snapshotted, you can put
the cluster in maintenance mode (or even unmanage just the IP resource)
while the snapshotting is happening. I don't know of any reason why
snapshotting would affect only an IP, though.

Most resource agents send some logs to the system log. If that doesn't
give any clue, you could set OCF_TRACE_RA=1 in the pacemaker
environment to get tons more logs from resource agents.

> 
> On Mon, Oct 7, 2019 at 2:36 PM Ulrich Windl <
> Ulrich.Windl at rz.uni-regensburg.de> wrote:
> > Hi!
> > 
> > I can't remember the exact reason, but probably it was exactly that
> > what made us remove any monitor operation from IPaddr2 (back in
> > 2011). So far no problems doing so ;-)
> > 
> > 
> > Regards,
> > Ulrich
> > P.S.: Of cource it would be nice if the real issue could be found
> > and fixed.
> > 
> > >>> Donat Zenichev <donat.zenichev at gmail.com> schrieb am 20.09.2019
> > um 14:43 in
> > Nachricht
> > <CANLwQCmVjcaTzHkcJsNOXLJghtYFLvbP3fD_d4NXrNQpM_JLWw at mail.gmail.com
> > >:
> > > Hi there!
> > > 
> > > I've got a tricky case, when my IpAddr2 resource fails to start
> > with
> > > literally no-reason:
> > > "IPSHARED_monitor_20000 on my-master-1 'not running' (7):
> > call=11,
> > > status=complete, exitreason='',
> > >    last-rc-change='Wed Sep 4 06:08:07 2019', queued=0ms,
> > exec=0ms"
> > > 
> > > Resource IpAddr2 managed to fix itself and continued to work
> > properly
> > > further after that.
> > > 
> > > What I've done after, was setting 'Failure-timeout=900' seconds
> > for my
> > > IpAddr2 resource, to prevent working of
> > > the resource on a node where it fails. I also set the
> > > 'migration-threshold=2' so IpAddr2 can fail only 2 times, and
> > goes to a
> > > Slave side after that. Meanwhile Master gets banned for 900
> > seconds.
> > > 
> > > After 900 seconds cluster tries to start IpAddr2 again at Master,
> > in case
> > > it's ok, fail counter gets cleared.
> > > That's how I avoid appearing of the error I mentioned above.
> > > 
> > > I tried to get so hard, why this can happen, but still no idea on
> > the
> > > count. Any clue how to find a reason?
> > > And another question, can snap-shoting of VM machines have any
> > impact on
> > > such?
> > > 
> > > And my configurations:
> > > -------------------------------
> > > node 000001: my-master-1
> > > node 000002: my-master-2
> > > 
> > > primitive IPSHARED IPaddr2 \
> > > params ip=10.10.10.5 nic=eth0 cidr_netmask=24 \
> > > meta migration-threshold=2 failure-timeout=900 target-
> > role=Started \
> > > op monitor interval=20 timeout=60 on-fail=restart
> > > 
> > > location PREFER_MASTER IPSHARED 100: my-master-1
> > > 
> > > property cib-bootstrap-options: \
> > > have-watchdog=false \
> > > dc-version=1.1.18-2b07d5c5a9 \
> > > cluster-infrastructure=corosync \
> > > cluster-name=wall \
> > > cluster-recheck-interval=5s \
> > > start-failure-is-fatal=false \
> > > stonith-enabled=false \
> > > no-quorum-policy=ignore \
> > > last-lrm-refresh=1554982967
> > > -------------------------------
> > > 
> > > Thanks in advance!
> > > 
> > > -- 
> > > -- 
> > > BR, Donat Zenichev
-- 
Ken Gaillot <kgaillot at redhat.com>