[ClusterLabs] Recoveing from node failure

Fri Dec 11 05:40:16 EST 2020

Hi, Gabriele. It sounds like you don't have quorum on node 1.
Resources won't start unless the node is part of a quorate cluster
partition.

You probably have "two_node: 1" configured by default in
corosync.conf. This setting automatically enables wait_for_all.

>From the votequorum(5) man page:

       NOTES:  enabling  two_node:  1  automatically  enables
wait_for_all. It is still possible to override wait_for_all by
explicitly setting it to 0.  If more than 2 nodes join the cluster,
the two_node
       option is automatically disabled.

       wait_for_all: 1

       Enables Wait For All (WFA) feature (default: 0).

       The general behaviour of votequorum is to switch a cluster from
inquorate to quorate as soon as possible. For example, in an 8 node
cluster, where every node has 1 vote, expected_votes is set  to  8
       and quorum is (50% + 1) 5. As soon as 5 (or more) nodes are
visible to each other, the partition of 5 (or more) becomes quorate
and can start operating.

       When WFA is enabled, the cluster will be quorate for the first
time only after all nodes have been visible at least once at the same
time.

       This feature has the advantage of avoiding some startup race
conditions, with the cost that all nodes need to be up at the same
time at least once before the cluster can operate.

You can either unblock quorum (`pcs quorum unblock` with pcs -- not
sure how to do it with crmsh) or set `wait_for_all: 0` in
corosync.conf and restart the cluster services.

On Fri, Dec 11, 2020 at 2:23 AM Gabriele Bulfon <gbulfon at sonicle.com> wrote:
>
> Hi, I finally could manage stonith with IPMI in my 2 nodes XStreamOS/illumos storage cluster.
> I have NFS IPs and shared storage zpool moving from one node or the other, and stonith controllin ipmi powering off when something is not clear.
>
> What happens now is that if I shutdown 2nd node, I see the OFFLINE status from node 1 and everything is up and running, and this is ok:
>
>
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
>
> Full list of resources:
>
>  xstha1_san0_IP      (ocf::heartbeat:IPaddr):        Started xstha1
>  xstha2_san0_IP      (ocf::heartbeat:IPaddr):        Started xstha1
>  xstha1-stonith      (stonith:external/ipmi):        Started xstha1
>  xstha2-stonith      (stonith:external/ipmi):        Started xstha1
>  zpool_data  (ocf::heartbeat:ZFS):   Started xstha1
>
> But if also reboot 1st node, it starts with the UNCLEAN state, nothing is running, so I clearstate of node 2, but resources are not started:
>
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
>
> Full list of resources:
>
>  xstha1_san0_IP      (ocf::heartbeat:IPaddr):        Stopped
>  xstha2_san0_IP      (ocf::heartbeat:IPaddr):        Stopped
>  xstha1-stonith      (stonith:external/ipmi):        Stopped
>  xstha2-stonith      (stonith:external/ipmi):        Stopped
>  zpool_data  (ocf::heartbeat:ZFS):   Stopped
>
> I tried restarting zpool_data or other resources:
>
> # crm resource start zpool_data
>
> but nothing happens!
> How can I recover from this state? Node2 needs to stay down, but I want node1 to work.
>
> Thanks!
> Gabriele
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA