[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Fri Feb 26 12:26:36 EST 2021

26.02.2021 19:19, Eric Robinson пишет:
> At 5:16 am Pacific time Monday, one of our cluster nodes failed and its mysql services went down. The cluster did not automatically recover.
> 
> We're trying to figure out:
> 
> 
>   1.  Why did it fail?

Pacemaker only registered loss of connection between two nodes. You need
to investigate why it happened.

>   2.  Why did it not automatically recover?
> 
> The cluster did not recover until we manually executed...
> 

*Cluster* never failed in the first place. Specific resource may. Do not
confuse things more than is necessary.

> # pcs resource cleanup p_mysql_622
> 

Because this resource failed to stop and this is fatal.

> Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:        * Stop       p_mysql_622      (                 001db01a )   due to no quorum

Remaining node lost quorum and decided to stop resources

> Feb 22 05:16:30 [91683] 001db01a       crmd:   notice: te_rsc_command:  Initiating stop operation p_mysql_622_stop_0 locally on 001db01a | action 76
...
> Feb 22 05:16:30 [91680] 001db01a       lrmd:     info: log_execute:     executing - rsc:p_mysql_622 action:stop call_id:308
...
> Feb 22 05:16:45 [91680] 001db01a       lrmd:  warning: child_timeout_callback:  p_mysql_622_stop_0 process (PID 19225) timed out
> Feb 22 05:16:45 [91680] 001db01a       lrmd:  warning: operation_finished:      p_mysql_622_stop_0:19225 - timed out after 15000ms
> Feb 22 05:16:45 [91680] 001db01a       lrmd:     info: log_finished:    finished - rsc:p_mysql_622 action:stop call_id:308 pid:19225 exit-code:1 exec-time:15002ms queue-time:0ms
> Feb 22 05:16:45 [91683] 001db01a       crmd:    error: process_lrm_event:       Result of stop operation for p_mysql_622 on 001db01a: Timed Out | call=308 key=p_mysql_622_stop_0 timeout=15000ms
...
> Feb 22 05:16:38 [112948] 001db01b    pengine:     info: LogActions:     Leave   p_mysql_622     (Started unmanaged)

At this point pacemaker stops managing this resource because its status
is unknown. Normal reaction to stop failure is to fence node and fail
resource over, but apparently you also do not ave working stonith.

Loss of quorum may be related to network issue so that nodes both lost
connection to each other and to quorum device.