[ClusterLabs] Antw: [EXT] Recoveing from node failure

Fri Dec 11 09:51:28 EST 2020

I cannot "use wait_for_all: 0", cause this would move automatically a powered off node from UNCLEAN to OFFLINE and mount the ZFS pool (total risk!): I want to manually move from UNCLEAN to OFFLINE, when I know that 2nd node is actually off!

Actually with wait_for_all to default (1) that was the case, so node1 would wait for my intervention when booting and node2 is down.
So what think I need is some way to manually override the quorum in such a case (node 2 down for maintenance, node 1 reboot), so I would manually turn OFFLINE node2 from UNCLEAN, manually override quorum and have zpool mount and NFS ip up.

Any idea?

Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets

----------------------------------------------------------------------------------

Da: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
A: users at clusterlabs.org 
Data: 11 dicembre 2020 11.35.44 CET
Oggetto: [ClusterLabs] Antw: [EXT] Recoveing from node failure

Hi!

Did you take care for special "two node" settings (quorum I mean)?
When I use "crm_mon -1Arfj", I see something like
" * Current DC: h19 (version 2.0.4+20200616.2deceaa3a-3.3.1-2.0.4+20200616.2deceaa3a) - partition with quorum"

What do you see?

Regards,
Ulrich

>>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 11.12.2020 um 11:23 in
Nachricht <350849824.6300.1607682209284 at www>:
> Hi, I finally could manage stonith with IPMI in my 2 nodes XStreamOS/illumos 
> storage cluster.
> I have NFS IPs and shared storage zpool moving from one node or the other, 
> and stonith controllin ipmi powering off when something is not clear.
> 
> What happens now is that if I shutdown 2nd node, I see the OFFLINE status 
> from node 1 and everything is up and running, and this is ok:
> 
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
> Full list of resources:
> xstha1_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> xstha2_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> xstha1-stonith (stonith:external/ipmi): Started xstha1
> xstha2-stonith (stonith:external/ipmi): Started xstha1
> zpool_data (ocf::heartbeat:ZFS): Started xstha1
> But if also reboot 1st node, it starts with the UNCLEAN state, nothing is 
> running, so I clearstate of node 2, but resources are not started:
> 
> Online: [ xstha1 ]
> OFFLINE: [ xstha2 ]
> Full list of resources:
> xstha1_san0_IP (ocf::heartbeat:IPaddr): Stopped
> xstha2_san0_IP (ocf::heartbeat:IPaddr): Stopped
> xstha1-stonith (stonith:external/ipmi): Stopped
> xstha2-stonith (stonith:external/ipmi): Stopped
> zpool_data (ocf::heartbeat:ZFS): Stopped
> I tried restarting zpool_data or other resources:
> # crm resource start zpool_data
> but nothing happens!
> How can I recover from this state? Node2 needs to stay down, but I want 
> node1 to work.
> Thanks!
> Gabriele 
> 
> 
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
> 

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20201211/6cfd72e2/attachment.htm>