[ClusterLabs] query on pacemaker monitor timeout

Ken Gaillot kgaillot at redhat.com
Mon Dec 21 10:18:41 EST 2020


Hi Sathish,

Re-adding users at clusterlabs.org ...

On Fri, 2020-12-18 at 12:50 +0000, S Sathish S wrote:
> Hi Ken/Team,
>  
> Thanks for the response.
>  
> Due to system resource unavailability, the pcs monitor operation got
> timed out for below resource after 120s and went to recover as we
> have set on-fail=restart for monitor operation.  But, at time of
> recover the stop operation also got timed out after 120s and
> pacemaker put that resource into blocked state as default value for
> on-fail is block for stop operation. Because of this pacemaker didn’t
> monitor the resource for longtime. 

The default on-fail for stop is block only if stonith is disabled. The
default is fence if stonith is enabled.

If a stop fails, Pacemaker can't know what state the service is now in.
It could have stopped cleanly, or it could be still running, or it
could have stopped but left some unclean state behind. Therefore
Pacemaker can't safely recover the resource elsewhere. Fencing the node
is the only way to be sure that the service is no longer active and can
be safely recovered.


> To avoid that we have set on-fail=restart for stop operation post
> that retrying happening when stop operation got timed out but when
> second attempt succeed for stop operation it is not starting the
> resource. Even resource monitoring also not happening as per interval
> time.

Restart isn't allowed as the on-fail for stop actions.

A restart is a stop followed by a start. A restart can't recover from a
failed stop because a successful stop is the first step in a restart.

>  
> pcs resource show SERVER_dl360x2739
>  
> [root at dl360x2739 ~]# pcs resource show SERVER_dl360x2739
> Resource: SERVER_dl360x2739 (class=ocf provider=provider
> type=SERVER_RA)
>   Meta Attrs: priority=30 failure-timeout=120s migration-threshold=5
> target-role=Stopped
>   Operations: migrate_from interval=0s timeout=20 (SERVER_dl360x2739-
> migrate_from-interval-0s)
>               migrate_to interval=0s timeout=20 (SERVER_dl360x2739-
> migrate_to-interval-0s)
>               monitor interval=10s on-fail=restart timeout=120s
> (SERVER_dl360x2739-monitor-interval-10s)
>               reload interval=0s timeout=20 (SERVER_dl360x2739-
> reload-interval-0s)
>               start interval=0s on-fail=restart timeout=120s
> (SERVER_dl360x2739-start-interval-0s)
>               stop interval=0s on-fail=restart timeout=120s
> (SERVER_dl360x2739-stop-interval-0s) à we changed on-fail to
> restart. 
> 
> Queries: 
> 
> 1) Why on-fail=restart is not working for stop operation, is this
> excepted behavior or our resource operation setting is invalid.(refer
> above config settings).? 
> 2)Any other parameter that can help to avoid this issue..? 
> 
> Thanks and Regards,
> S Sathish S
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list