[ClusterLabs] Failure to configure iface-bridge resource causes cluster node fence action.
Ken Gaillot
kgaillot at redhat.com
Thu Feb 2 15:28:20 EST 2017
On 02/02/2017 02:14 PM, Scott Greenlese wrote:
> Hi folks,
>
> I'm testing iface-bridge resource support on a Linux KVM on System Z
> pacemaker cluster.
>
> pacemaker-1.1.13-10.el7_2.ibm.1.s390x
> corosync-2.3.4-7.el7_2.ibm.1.s390x
>
> I created an iface-bridge resource, but specified a non-existent
> bridge_slaves value, vlan1292 (i.e. vlan1292 doesn't exist).
>
> [root at zs95kj VD]# date;pcs resource create br0_r1
> ocf:heartbeat:iface-bridge bridge_name=br0 bridge_slaves=vlan1292 op
> monitor timeout="20s" interval="10s" --disabled
> Wed Feb 1 17:49:16 EST 2017
> [root at zs95kj VD]#
>
> [root at zs95kj VD]# pcs resource show |grep br0
> br0_r1 (ocf::heartbeat:iface-bridge): FAILED zs93kjpcs1
> [root at zs95kj VD]#
>
> As you can see, the resource was created, but failed to start on the
> target node zs93kppcs1.
>
> To my surprise, the target node zs93kppcs1 was unceremoniously fenced.
>
> pacemaker.log shows a fence (off) action initiated against that target
> node, "because of resource failure(s)" :
>
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2719 ) debug:
> determine_op_status: br0_r1_stop_0 on zs93kjpcs1 returned 'not
> configured' (6) instead of the expected value: 'ok' (0)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2602 ) warning:
> unpack_rsc_op_failure: Processing failed op stop for br0_r1 on
> zs93kjpcs1: not configured (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:3244 ) error:
> unpack_rsc_op: Preventing br0_r1 from re-starting anywhere: operation
> stop failed 'not configured' (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2719 ) debug:
> determine_op_status: br0_r1_stop_0 on zs93kjpcs1 returned 'not
> configured' (6) instead of the expected value: 'ok' (0)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2602 ) warning:
> unpack_rsc_op_failure: Processing failed op stop for br0_r1 on
> zs93kjpcs1: not configured (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:3244 ) error:
> unpack_rsc_op: Preventing br0_r1 from re-starting anywhere: operation
> stop failed 'not configured' (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:96 ) warning:
> pe_fence_node: Node zs93kjpcs1 will be fenced because of resource failure(s)
>
>
> Thankfully, I was able to successfully create a iface-bridge resource
> when I changed the bridge_slaves value to an existent vlan interface.
>
> My main concern is, why would the response to a failed bridge config
> operation warrant a node fence (off) action? Isn't it enough to just
> fail the resource and try another cluster node,
> or at most, give up if it can't be started / configured on any node?
>
> Is there any way to control this harsh recovery action in the cluster?
>
> Thanks much..
>
>
> Scott Greenlese ... IBM KVM on System Z Solutions Test, Poughkeepsie, N.Y.
> INTERNET: swgreenl at us.ibm.com
It's actually the stop operation failure that leads to the fence.
If a resource fails to stop, fencing is the only way pacemaker can
recover the resource elsewhere. Consider a database master -- if it
doesn't stop, starting the master elsewhere could lead to severe data
inconsistency.
You can tell pacemaker to not attempt recovery, by setting on-fail=block
on the stop operation, so it doesn't need to fence. Obviously, that
prevents high availability, as manual intervention is required to do
anything further with the service.
More information about the Users
mailing list