[ClusterLabs] Failure to configure iface-bridge resource causes cluster node fence action.

Fri Feb 3 15:22:27 EST 2017

Ken,

Thanks for the explanation.

One other thing, relating to the iface-bridge resource creation.   I
specified --disabled flag:

> [root at zs95kj VD]# date;pcs resource create br0_r1
> ocf:heartbeat:iface-bridge bridge_name=br0 bridge_slaves=vlan1292 op
> monitor timeout="20s" interval="10s" --disabled

Does the bridge device have to be successfully configured by pacemaker
before disabling the resource?   It seems
that that was the behavior, since it failed the resource and fenced the
node instead of disabling the resource.
Just checking with you to be sure.

Thanks again..

Scott Greenlese ... IBM KVM on System Z Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgreenl at us.ibm.com

From:	Ken Gaillot <kgaillot at redhat.com>
To:	users at clusterlabs.org
Date:	02/02/2017 03:29 PM
Subject:	Re: [ClusterLabs] Failure to configure iface-bridge resource
            causes cluster node fence action.

On 02/02/2017 02:14 PM, Scott Greenlese wrote:
> Hi folks,
>
> I'm testing iface-bridge resource support on a Linux KVM on System Z
> pacemaker cluster.
>
> pacemaker-1.1.13-10.el7_2.ibm.1.s390x
> corosync-2.3.4-7.el7_2.ibm.1.s390x
>
> I created an iface-bridge resource, but specified a non-existent
> bridge_slaves value, vlan1292 (i.e. vlan1292 doesn't exist).
>
> [root at zs95kj VD]# date;pcs resource create br0_r1
> ocf:heartbeat:iface-bridge bridge_name=br0 bridge_slaves=vlan1292 op
> monitor timeout="20s" interval="10s" --disabled
> Wed Feb 1 17:49:16 EST 2017
> [root at zs95kj VD]#
>
> [root at zs95kj VD]# pcs resource show |grep br0
> br0_r1 (ocf::heartbeat:iface-bridge): FAILED zs93kjpcs1
> [root at zs95kj VD]#
>
> As you can see, the resource was created, but failed to start on the
> target node zs93kppcs1.
>
> To my surprise, the target node zs93kppcs1 was unceremoniously fenced.
>
> pacemaker.log shows a fence (off) action initiated against that target
> node, "because of resource failure(s)" :
>
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2719 ) debug:
> determine_op_status: br0_r1_stop_0 on zs93kjpcs1 returned 'not
> configured' (6) instead of the expected value: 'ok' (0)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2602 ) warning:
> unpack_rsc_op_failure: Processing failed op stop for br0_r1 on
> zs93kjpcs1: not configured (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:3244 ) error:
> unpack_rsc_op: Preventing br0_r1 from re-starting anywhere: operation
> stop failed 'not configured' (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2719 ) debug:
> determine_op_status: br0_r1_stop_0 on zs93kjpcs1 returned 'not
> configured' (6) instead of the expected value: 'ok' (0)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:2602 ) warning:
> unpack_rsc_op_failure: Processing failed op stop for br0_r1 on
> zs93kjpcs1: not configured (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:3244 ) error:
> unpack_rsc_op: Preventing br0_r1 from re-starting anywhere: operation
> stop failed 'not configured' (6)
> Feb 01 17:55:56 [52941] zs95kj crm_resource: ( unpack.c:96 ) warning:
> pe_fence_node: Node zs93kjpcs1 will be fenced because of resource failure
(s)
>
>
> Thankfully, I was able to successfully create a iface-bridge resource
> when I changed the bridge_slaves value to an existent vlan interface.
>
> My main concern is, why would the response to a failed bridge config
> operation warrant a node fence (off) action? Isn't it enough to just
> fail the resource and try another cluster node,
> or at most, give up if it can't be started / configured on any node?
>
> Is there any way to control this harsh recovery action in the cluster?
>
> Thanks much..
>
>
> Scott Greenlese ... IBM KVM on System Z Solutions Test, Poughkeepsie,
N.Y.
> INTERNET: swgreenl at us.ibm.com

It's actually the stop operation failure that leads to the fence.

If a resource fails to stop, fencing is the only way pacemaker can
recover the resource elsewhere. Consider a database master -- if it
doesn't stop, starting the master elsewhere could lead to severe data
inconsistency.

You can tell pacemaker to not attempt recovery, by setting on-fail=block
on the stop operation, so it doesn't need to fence. Obviously, that
prevents high availability, as manual intervention is required to do
anything further with the service.

_______________________________________________
Users mailing list: Users at clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170203/1bc586b8/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170203/1bc586b8/attachment-0003.gif>