[ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Dec 14 02:52:16 EST 2020
>>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 11.12.2020 um 15:51 in
Nachricht <1053095478.6540.1607698288628 at www>:
> I cannot "use wait_for_all: 0", cause this would move automatically a powered
> off node from UNCLEAN to OFFLINE and mount the ZFS pool (total risk!): I want
> to manually move from UNCLEAN to OFFLINE, when I know that 2nd node is
> actually off!
Personally I think when you'll have to confirm that a node is down you need no cluster, because all actions would wait until the node is no longer unclean. I wouldn't want to be alerted in the middle of the night at weekends just to confirm that there was some problem, when the cluster could handle that automatically while I sleep.
>
> Actually with wait_for_all to default (1) that was the case, so node1 would
> wait for my intervention when booting and node2 is down.
> So what think I need is some way to manually override the quorum in such a
> case (node 2 down for maintenance, node 1 reboot), so I would manually turn
> OFFLINE node2 from UNCLEAN, manually override quorum and have zpool mount and
> NFS ip up.
>
> Any idea?
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>
>
>
>
>
> ----------------------------------------------------------------------------
> ------
>
> Da: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
> A: users at clusterlabs.org
> Data: 11 dicembre 2020 11.35.44 CET
> Oggetto: [ClusterLabs] Antw: [EXT] Recoveing from node failure
>
>
> Hi!
>
> Did you take care for special "two node" settings (quorum I mean)?
> When I use "crm_mon -1Arfj", I see something like
> " * Current DC: h19 (version
> 2.0.4+20200616.2deceaa3a-3.3.1-2.0.4+20200616.2deceaa3a) - partition with
> quorum"
>
> What do you see?
>
> Regards,
> Ulrich
>
>>>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 11.12.2020 um 11:23 in
> Nachricht <350849824.6300.1607682209284 at www>:
>> Hi, I finally could manage stonith with IPMI in my 2 nodes XStreamOS/illumos
>
>> storage cluster.
>> I have NFS IPs and shared storage zpool moving from one node or the other,
>> and stonith controllin ipmi powering off when something is not clear.
>>
>> What happens now is that if I shutdown 2nd node, I see the OFFLINE status
>> from node 1 and everything is up and running, and this is ok:
>>
>> Online: [ xstha1 ]
>> OFFLINE: [ xstha2 ]
>> Full list of resources:
>> xstha1_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
>> xstha2_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
>> xstha1-stonith (stonith:external/ipmi): Started xstha1
>> xstha2-stonith (stonith:external/ipmi): Started xstha1
>> zpool_data (ocf::heartbeat:ZFS): Started xstha1
>> But if also reboot 1st node, it starts with the UNCLEAN state, nothing is
>> running, so I clearstate of node 2, but resources are not started:
>>
>> Online: [ xstha1 ]
>> OFFLINE: [ xstha2 ]
>> Full list of resources:
>> xstha1_san0_IP (ocf::heartbeat:IPaddr): Stopped
>> xstha2_san0_IP (ocf::heartbeat:IPaddr): Stopped
>> xstha1-stonith (stonith:external/ipmi): Stopped
>> xstha2-stonith (stonith:external/ipmi): Stopped
>> zpool_data (ocf::heartbeat:ZFS): Stopped
>> I tried restarting zpool_data or other resources:
>> # crm resource start zpool_data
>> but nothing happens!
>> How can I recover from this state? Node2 needs to stay down, but I want
>> node1 to work.
>> Thanks!
>> Gabriele
>>
>>
>> Sonicle S.r.l. : http://www.sonicle.com
>> Music: http://www.gabrielebulfon.com
>> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>>
>
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list