[ClusterLabs] How to unfence without reboot (fence_mpath)
Strahil Nikolov
hunter86_bg at yahoo.com
Mon Feb 17 09:18:57 EST 2020
On February 17, 2020 3:36:27 PM GMT+02:00, Ondrej <ondrej-clusterlabs at famera.cz> wrote:
>Hello Strahil,
>
>On 2/17/20 3:39 PM, Strahil Nikolov wrote:
>> Hello Ondrej,
>>
>> thanks for your reply. I really appreciate that.
>>
>> I have picked fence_multipath as I'm preparing for my EX436 and I
>can't know what agent will be useful on the exam.
>> Also ,according to https://access.redhat.com/solutions/3201072 ,
>there could be a race condition with fence_scsi.
>
>I believe that exam is about testing knowledge in configuration and not
>
>testing knowledge in knowing which race condition bugs are present and
>how to handle them :)
>If you have access to learning materials for EX436 exam I would
>recommend trying those ones out - they have labs and comprehensive
>review exercises that are useful in preparation for exam.
>
>> So, I've checked the cluster when fencing and the node immediately
>goes offline.
>> Last messages from pacemaker are:
>> <snip>
>> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice: Client
>stonith_admin.controld.23888.b57ceee7 wants to fence (reboot)
>'node1.localdomain' with device '(any)'
>> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice:
>Requesting peer fencing (reboot) of node1.localdomain
>> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]: notice:
>FENCING can fence (reboot) node1.localdomain (aka. '1'): static-list
>> Feb 17 08:21:58 node1.localdomain stonith-ng[23808]: notice:
>Operation reboot of node1.localdomain by node2.localdomain for
>stonith_admin.controld.23888 at node1.localdomain.ede38ffb: OK
>- This part looks OK - meaning the fencing looks like a success.
>> Feb 17 08:21:58 node1.localdomain crmd[23812]: crit: We were
>allegedly just fenced by node2.localdomain for node1.localdomai
>- this is also normal as node just announces that it was fenced by
>other
>node
>
>> <snip>
>>
>> Which for me means - node1 just got fenced again. Actually fencing
>works ,as I/O is immediately blocked and the reservation is removed.
>>
>> I've used https://access.redhat.com/solutions/2766611 to setup the
>fence_mpath , but I could have messed up something.
>- note related to exam: you will not have Internet on exam, so I would
>
>expect that you would have to configure something that would not
>require
>access to this (and as Dan Swartzendruber pointed out in other email -
>we cannot* even see RH links without account)
>
>* you can get free developers account to read them, but ideally that
>should be not needed and is certainly inconvenient for wide public
>audience
>
>>
>> Cluster config is:
>> [root at node3 ~]# pcs config show
>> Cluster Name: HACLUSTER2
>> Corosync Nodes:
>> node1.localdomain node2.localdomain node3.localdomain
>> Pacemaker Nodes:
>> node1.localdomain node2.localdomain node3.localdomain
>>
>> Resources:
>> Clone: dlm-clone
>> Meta Attrs: interleave=true ordered=true
>> Resource: dlm (class=ocf provider=pacemaker type=controld)
>> Operations: monitor interval=30s on-fail=fence
>(dlm-monitor-interval-30s)
>> start interval=0s timeout=90 (dlm-start-interval-0s)
>> stop interval=0s timeout=100 (dlm-stop-interval-0s)
>> Clone: clvmd-clone
>> Meta Attrs: interleave=true ordered=true
>> Resource: clvmd (class=ocf provider=heartbeat type=clvm)
>> Operations: monitor interval=30s on-fail=fence
>(clvmd-monitor-interval-30s)
>> start interval=0s timeout=90s
>(clvmd-start-interval-0s)
>> stop interval=0s timeout=90s (clvmd-stop-interval-0s)
>> Clone: TESTGFS2-clone
>> Meta Attrs: interleave=true
>> Resource: TESTGFS2 (class=ocf provider=heartbeat type=Filesystem)
>> Attributes: device=/dev/TEST/gfs2 directory=/GFS2 fstype=gfs2
>options=noatime run_fsck=no
>> Operations: monitor interval=15s on-fail=fence OCF_CHECK_LEVEL=20
>(TESTGFS2-monitor-interval-15s)
>> notify interval=0s timeout=60s
>(TESTGFS2-notify-interval-0s)
>> start interval=0s timeout=60s
>(TESTGFS2-start-interval-0s)
>> stop interval=0s timeout=60s
>(TESTGFS2-stop-interval-0s)
>>
>> Stonith Devices:
>> Resource: FENCING (class=stonith type=fence_mpath)
>> Attributes: devices=/dev/mapper/36001405cb123d0000000000000000000
>pcmk_host_argument=key
>pcmk_host_map=node1.localdomain:1;node2.localdomain:2;node3.localdomain:3
>pcmk_monitor_action=metadata pcmk_reboot_action=off
>> Meta Attrs: provides=unfencing
>> Operations: monitor interval=60s (FENCING-monitor-interval-60s)
>> Fencing Levels:
>>
>> Location Constraints:
>> Ordering Constraints:
>> start dlm-clone then start clvmd-clone (kind:Mandatory)
>(id:order-dlm-clone-clvmd-clone-mandatory)
>> start clvmd-clone then start TESTGFS2-clone (kind:Mandatory)
>(id:order-clvmd-clone-TESTGFS2-clone-mandatory)
>> Colocation Constraints:
>> clvmd-clone with dlm-clone (score:INFINITY)
>(id:colocation-clvmd-clone-dlm-clone-INFINITY)
>> TESTGFS2-clone with clvmd-clone (score:INFINITY)
>(id:colocation-TESTGFS2-clone-clvmd-clone-INFINITY)
>> Ticket Constraints:
>>
>> Alerts:
>> No alerts defined
>>
>> Resources Defaults:
>> No defaults set
>>
>> [root at node3 ~]# crm_mon -r1
>> Stack: corosync
>> Current DC: node3.localdomain (version 1.1.20-5.el7_7.2-3c4c782f70) -
>partition with quorum
>> Last updated: Mon Feb 17 08:39:30 2020
>> Last change: Sun Feb 16 18:44:06 2020 by root via cibadmin on
>node1.localdomain
>>
>> 3 nodes configured
>> 10 resources configured
>>
>> Online: [ node2.localdomain node3.localdomain ]
>> OFFLINE: [ node1.localdomain ]
>>
>> Full list of resources:
>>
>> FENCING (stonith:fence_mpath): Started node2.localdomain
>> Clone Set: dlm-clone [dlm]
>> Started: [ node2.localdomain node3.localdomain ]
>> Stopped: [ node1.localdomain ]
>> Clone Set: clvmd-clone [clvmd]
>> Started: [ node2.localdomain node3.localdomain ]
>> Stopped: [ node1.localdomain ]
>> Clone Set: TESTGFS2-clone [TESTGFS2]
>> Started: [ node2.localdomain node3.localdomain ]
>> Stopped: [ node1.localdomain ]
>>
>>
>>
>>
>> In the logs , I've noticed that the node is first unfenced and later
>it is fenced again... For the unfence , I believe "meta
>provides=unfencing" is 'guilty', yet I'm not sure about the action from
>node2.
>
>'Unfecing' is exactly the expected behavior when provides=unfencing is
>present (and it should be present with fence_scsi and fence_multipath).
>
>Here the important part is "first unfenced and later it is fenced
>again". If everything is in normal state, then the node should not be
>just fenced again. So it would make sense to me to investigate that
>'fencing' after unfencing. I would expect that one of the nodes will
>have a more verbose logs that would give idea why the fencing was
>ordered. (my lucky guess would be failed 'monitor' operation on any of
>the resources as all of them 'on-fail=fence', but this would need a
>support from logs to be sure)
>Also logs from fenced node can provide some information what happened
>on
>node - if that was the cause of fencing.
>
>> So far I have used SCSI reservations only with ServiceGuard, while
>SBD on SUSE - and I was wondering if the setup is correctly done.
>I don't see anything particularly bad looking from configuration point
>of view. Best place to look for reason are now the logs from other
>nodes
>after 'unfencing' and before 'fencing again'
>
>> Storage in this test setup is a Highly Available iSCSI Cluster ontop
>of DRBD /RHEL 7 again/, and it seems that SCSI Reservations Support is
>OK.
> From logs you have provided so far the reservations keys works as
>fencing is happening and reports OK.
>
>> Best Regards,
>> Strahil Nikolov
>
>Example of fencing because 'monitor' operation of resource 'testtest'
>failed from logs:
>
>Feb 17 22:32:15 [1289] fastvm-centos-7-7-174 pengine: warning:
>pe_fence_node: Cluster node fastvm-centos-7-7-175 will be fenced:
>
>testtest failed there
>Feb 17 22:32:15 [1289] fastvm-centos-7-7-174 pengine: notice:
>LogNodeActions: * Fence (reboot) fastvm-centos-7-7-175 'testtest
>failed there'
>
>--
>Ondrej
>_______________________________________________
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
Hey Ondrej,
Sadly the lab in the training is using a customized fencing mechanism that cannot be reproduced outside of RedHat training lab.
As I don't know what will be the environment (Red Hat prevents any disclosure on that) , I have to pick a fencing mechanism that will work in any environment and 'fence_mpath' matches those criteria.
Sadly, RedHat expects the engineer to be able to deal with bugs (RedHat CEO's intervirew from several years ago confirmed that), so if I know that fence_scsi can have issues - it is better to play safe and avoid it.
I'm sorry for quoting the RedHat's Solutions . It mentions that each node should have a unique reservation_key ( in /etc/multipath.conf ) and the stonith agent is not defined with the mandatory 'key' as it is in the pcmk_host_maps .
Best Regards,
Strahil Nikolov
More information about the Users
mailing list