[ClusterLabs] Failure to configure iface-bridge resource causes cluster node fence action.

Thu Feb 2 15:28:20 EST 2017

On 02/02/2017 02:14 PM, Scott Greenlese wrote:
> Hi folks,
> 
> I'm testing iface-bridge resource support on a Linux KVM on System Z
> pacemaker cluster.
> 
> pacemaker-1.1.13-10.el7_2.ibm.1.s390x
> corosync-2.3.4-7.el7_2.ibm.1.s390x
> 
> I created an iface-bridge resource, but specified a non-existent
> bridge_slaves value, vlan1292 (i.e. vlan1292 doesn't exist).
> 
> [root at zs95kj VD]# date;pcs resource create br0_r1
> ocf:heartbeat:iface-bridge bridge_name=br0 bridge_slaves=vlan1292 op
> monitor timeout="20s" interval="10s" --disabled
> Wed Feb 1 17:49:16 EST 2017
> [root at zs95kj VD]#
> 
> [root at zs95kj VD]# pcs resource show |grep br0
> br0_r1 (ocf::heartbeat:iface-bridge): FAILED zs93kjpcs1
> [root at zs95kj VD]#
> 
> As you can see, the resource was created, but failed to start on the
> target node zs93kppcs1.
> 
> To my surprise, the target node zs93kppcs1 was unceremoniously fenced.
> 
> pacemaker.log shows a fence (off) action initiated against that target
> node, "because of resource failure(s)" :
> 
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2719 ) debug:
> determine_op_status: br0_r1_stop_0 on zs93kjpcs1 returned 'not
> configured' (6) instead of the expected value: 'ok' (0)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2602 ) warning:
> unpack_rsc_op_failure: Processing failed op stop for br0_r1 on
> zs93kjpcs1: not configured (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:3244 ) error:
> unpack_rsc_op: Preventing br0_r1 from re-starting anywhere: operation
> stop failed 'not configured' (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2719 ) debug:
> determine_op_status: br0_r1_stop_0 on zs93kjpcs1 returned 'not
> configured' (6) instead of the expected value: 'ok' (0)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2602 ) warning:
> unpack_rsc_op_failure: Processing failed op stop for br0_r1 on
> zs93kjpcs1: not configured (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:3244 ) error:
> unpack_rsc_op: Preventing br0_r1 from re-starting anywhere: operation
> stop failed 'not configured' (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:96 ) warning:
> pe_fence_node: Node zs93kjpcs1 will be fenced because of resource failure(s)
> 
> 
> Thankfully, I was able to successfully create a iface-bridge resource
> when I changed the bridge_slaves value to an existent vlan interface.
> 
> My main concern is, why would the response to a failed bridge config
> operation warrant a node fence (off) action? Isn't it enough to just
> fail the resource and try another cluster node,
> or at most, give up if it can't be started / configured on any node?
> 
> Is there any way to control this harsh recovery action in the cluster?
> 
> Thanks much..
> 
> 
> Scott Greenlese ... IBM KVM on System Z Solutions Test, Poughkeepsie, N.Y.
> INTERNET: swgreenl at us.ibm.com

It's actually the stop operation failure that leads to the fence.

If a resource fails to stop, fencing is the only way pacemaker can
recover the resource elsewhere. Consider a database master -- if it
doesn't stop, starting the master elsewhere could lead to severe data
inconsistency.

You can tell pacemaker to not attempt recovery, by setting on-fail=block
on the stop operation, so it doesn't need to fence. Obviously, that
prevents high availability, as manual intervention is required to do
anything further with the service.