[ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

Mon Dec 14 02:52:16 EST 2020

>>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 11.12.2020 um 15:51 in
Nachricht <1053095478.6540.1607698288628 at www>:
> I cannot "use wait_for_all: 0", cause this would move automatically a powered 
> off node from UNCLEAN to OFFLINE and mount the ZFS pool (total risk!): I want 
> to manually move from UNCLEAN to OFFLINE, when I know that 2nd node is 
> actually off!

Personally I think when you'll have to confirm that a node is down you need no cluster, because all actions would wait until the node is no longer unclean. I wouldn't want to be alerted in the middle of the night at weekends just to confirm that there was some problem, when the cluster could handle that automatically while I sleep.

>  
> Actually with wait_for_all to default (1) that was the case, so node1 would 
> wait for my intervention when booting and node2 is down.
> So what think I need is some way to manually override the quorum in such a 
> case (node 2 down for maintenance, node 1 reboot), so I would manually turn 
> OFFLINE node2 from UNCLEAN, manually override quorum and have zpool mount and 
> NFS ip up.
>  
> Any idea?
>  
>  
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
>  
> 
> 
> 
> 
> ----------------------------------------------------------------------------
> ------
> 
> Da: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
> A: users at clusterlabs.org 
> Data: 11 dicembre 2020 11.35.44 CET
> Oggetto: [ClusterLabs] Antw: [EXT] Recoveing from node failure
> 
> 
> Hi!
> 
> Did you take care for special "two node" settings (quorum I mean)?
> When I use "crm_mon -1Arfj", I see something like
> " * Current DC: h19 (version 
> 2.0.4+20200616.2deceaa3a-3.3.1-2.0.4+20200616.2deceaa3a) - partition with 
> quorum"
> 
> What do you see?
> 
> Regards,
> Ulrich
> 
>>>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 11.12.2020 um 11:23 in
> Nachricht <350849824.6300.1607682209284 at www>:
>> Hi, I finally could manage stonith with IPMI in my 2 nodes XStreamOS/illumos 
> 
>> storage cluster.
>> I have NFS IPs and shared storage zpool moving from one node or the other, 
>> and stonith controllin ipmi powering off when something is not clear.
>> 
>> What happens now is that if I shutdown 2nd node, I see the OFFLINE status 
>> from node 1 and everything is up and running, and this is ok:
>> 
>> Online: [ xstha1 ]
>> OFFLINE: [ xstha2 ]
>> Full list of resources:
>> xstha1_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
>> xstha2_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
>> xstha1-stonith (stonith:external/ipmi): Started xstha1
>> xstha2-stonith (stonith:external/ipmi): Started xstha1
>> zpool_data (ocf::heartbeat:ZFS): Started xstha1
>> But if also reboot 1st node, it starts with the UNCLEAN state, nothing is 
>> running, so I clearstate of node 2, but resources are not started:
>> 
>> Online: [ xstha1 ]
>> OFFLINE: [ xstha2 ]
>> Full list of resources:
>> xstha1_san0_IP (ocf::heartbeat:IPaddr): Stopped
>> xstha2_san0_IP (ocf::heartbeat:IPaddr): Stopped
>> xstha1-stonith (stonith:external/ipmi): Stopped
>> xstha2-stonith (stonith:external/ipmi): Stopped
>> zpool_data (ocf::heartbeat:ZFS): Stopped
>> I tried restarting zpool_data or other resources:
>> # crm resource start zpool_data
>> but nothing happens!
>> How can I recover from this state? Node2 needs to stay down, but I want 
>> node1 to work.
>> Thanks!
>> Gabriele 
>> 
>> 
>> Sonicle S.r.l. : http://www.sonicle.com 
>> Music: http://www.gabrielebulfon.com 
>> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
>> 
> 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/