<div>                Hello all,<br><br>I think I found the problem.<br>On the fenced  node after a  restart of the cluster stack , I saw the following:<br><br>controld(dlm)[13025]:    ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing<br><br>I was so focused on the DC logs,  so I missed it.<br><br>I guess with HALVM , there will be no need to reboot - yet when dlm/clvmd  were  interrupted , the only path will be to reboot.<br><br>Best Regards,<br>Strahil Nikolov<br><br><br>            </div>            <div class="yahoo_quoted" style="margin:10px 0px 0px 0.8ex;border-left:1px solid #ccc;padding-left:1ex;">                        <div style="font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:13px;color:#26282a;">                                <div>                    В понеделник, 17 февруари 2020 г., 15:36:39 ч. Гринуич+2, Ondrej <ondrej-clusterlabs@famera.cz> написа:                </div>                <div><br></div>                <div><br></div>                <div><div dir="ltr">Hello Strahil,<br clear="none"><br clear="none">On 2/17/20 3:39 PM, Strahil Nikolov wrote:<br clear="none">> Hello Ondrej,<br clear="none">> <br clear="none">> thanks for your reply. I really appreciate that.<br clear="none">> <br clear="none">> I have picked fence_multipath as I'm preparing for my EX436 and I can't know what agent will be useful on the exam.<br clear="none">> Also ,according to <a shape="rect" href="https://access.redhat.com/solutions/3201072 " target="_blank">https://access.redhat.com/solutions/3201072 </a>, there could be a race condition with fence_scsi.<br clear="none"><br clear="none">I believe that exam is about testing knowledge in configuration and not <br clear="none">testing knowledge in knowing which race condition bugs are present and <br clear="none">how to handle them :)<br clear="none">If you have access to learning materials for EX436 exam I would <br clear="none">recommend trying those ones out - they have labs and comprehensive <br clear="none">review exercises that are useful in preparation for exam.<br clear="none"><br clear="none">> So, I've checked the cluster when fencing and the node immediately goes offline.<br clear="none">> Last messages from pacemaker are:<br clear="none">> <snip><br clear="none">> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]:   notice: Client stonith_admin.controld.23888.b57ceee7 wants to fence (reboot) 'node1.localdomain' with device '(any)'<br clear="none">> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]:   notice: Requesting peer fencing (reboot) of node1.localdomain<br clear="none">> Feb 17 08:21:57 node1.localdomain stonith-ng[23808]:   notice: FENCING can fence (reboot) node1.localdomain (aka. '1'): static-list<br clear="none">> Feb 17 08:21:58 node1.localdomain stonith-ng[23808]:   notice: Operation reboot of node1.localdomain by node2.localdomain for <br clear="none"><a shape="rect" ymailto="mailto:stonith_admin.controld.23888@node1.localdomain.ede38ffb" href="mailto:stonith_admin.controld.23888@node1.localdomain.ede38ffb">stonith_admin.controld.23888@node1.localdomain.ede38ffb</a>: OK<br clear="none">- This part looks OK - meaning the fencing looks like a success.<br clear="none">> Feb 17 08:21:58 node1.localdomain crmd[23812]:     crit: We were allegedly just fenced by node2.localdomain for node1.localdomai<br clear="none">- this is also normal as node just announces that it was fenced by other <br clear="none">node<br clear="none"><br clear="none">> <snip><br clear="none">> <br clear="none">> Which for me means - node1 just got fenced again. Actually fencing works ,as I/O is immediately blocked and the reservation is removed.<br clear="none">> <br clear="none">> I've used <a shape="rect" href="https://access.redhat.com/solutions/2766611 " target="_blank">https://access.redhat.com/solutions/2766611 </a>to setup the fence_mpath , but I could have messed up something.<br clear="none">-  note related to exam: you will not have Internet on exam, so I would <br clear="none">expect that you would have to configure something that would not require <br clear="none">access to this (and as Dan Swartzendruber pointed out in other email - <br clear="none">we cannot* even see RH links without account)<br clear="none"><br clear="none">* you can get free developers account to read them, but ideally that <br clear="none">should be not needed and is certainly inconvenient for wide public audience<br clear="none"><br clear="none">> <br clear="none">> Cluster config is:<br clear="none">> [<a shape="rect" ymailto="mailto:root@node3" href="mailto:root@node3">root@node3</a> ~]# pcs config show<br clear="none">> Cluster Name: HACLUSTER2<br clear="none">> Corosync Nodes:<br clear="none">>   node1.localdomain node2.localdomain node3.localdomain<br clear="none">> Pacemaker Nodes:<br clear="none">>   node1.localdomain node2.localdomain node3.localdomain<br clear="none">> <br clear="none">> Resources:<br clear="none">>   Clone: dlm-clone<br clear="none">>    Meta Attrs: interleave=true ordered=true<br clear="none">>    Resource: dlm (class=ocf provider=pacemaker type=controld)<br clear="none">>     Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)<br clear="none">>                 start interval=0s timeout=90 (dlm-start-interval-0s)<br clear="none">>                 stop interval=0s timeout=100 (dlm-stop-interval-0s)<br clear="none">>   Clone: clvmd-clone<br clear="none">>    Meta Attrs: interleave=true ordered=true<br clear="none">>    Resource: clvmd (class=ocf provider=heartbeat type=clvm)<br clear="none">>     Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)<br clear="none">>                 start interval=0s timeout=90s (clvmd-start-interval-0s)<br clear="none">>                 stop interval=0s timeout=90s (clvmd-stop-interval-0s)<br clear="none">>   Clone: TESTGFS2-clone<br clear="none">>    Meta Attrs: interleave=true<br clear="none">>    Resource: TESTGFS2 (class=ocf provider=heartbeat type=Filesystem)<br clear="none">>     Attributes: device=/dev/TEST/gfs2 directory=/GFS2 fstype=gfs2 options=noatime run_fsck=no<br clear="none">>     Operations: monitor interval=15s on-fail=fence OCF_CHECK_LEVEL=20 (TESTGFS2-monitor-interval-15s)<br clear="none">>                 notify interval=0s timeout=60s (TESTGFS2-notify-interval-0s)<br clear="none">>                 start interval=0s timeout=60s (TESTGFS2-start-interval-0s)<br clear="none">>                 stop interval=0s timeout=60s (TESTGFS2-stop-interval-0s)<br clear="none">> <br clear="none">> Stonith Devices:<br clear="none">>   Resource: FENCING (class=stonith type=fence_mpath)<br clear="none">>    Attributes: devices=/dev/mapper/36001405cb123d0000000000000000000 pcmk_host_argument=key pcmk_host_map=node1.localdomain:1;node2.localdomain:2;node3.localdomain:3 pcmk_monitor_action=metadata pcmk_reboot_action=off<br clear="none">>    Meta Attrs: provides=unfencing<br clear="none">>    Operations: monitor interval=60s (FENCING-monitor-interval-60s)<br clear="none">> Fencing Levels:<br clear="none">> <br clear="none">> Location Constraints:<br clear="none">> Ordering Constraints:<br clear="none">>    start dlm-clone then start clvmd-clone (kind:Mandatory) (id:order-dlm-clone-clvmd-clone-mandatory)<br clear="none">>    start clvmd-clone then start TESTGFS2-clone (kind:Mandatory) (id:order-clvmd-clone-TESTGFS2-clone-mandatory)<br clear="none">> Colocation Constraints:<br clear="none">>    clvmd-clone with dlm-clone (score:INFINITY) (id:colocation-clvmd-clone-dlm-clone-INFINITY)<br clear="none">>    TESTGFS2-clone with clvmd-clone (score:INFINITY) (id:colocation-TESTGFS2-clone-clvmd-clone-INFINITY)<br clear="none">> Ticket Constraints:<br clear="none">> <br clear="none">> Alerts:<br clear="none">>   No alerts defined<br clear="none">> <br clear="none">> Resources Defaults:<br clear="none">>   No defaults set<br clear="none">> <br clear="none">> [<a shape="rect" ymailto="mailto:root@node3" href="mailto:root@node3">root@node3</a> ~]# crm_mon -r1<br clear="none">> Stack: corosync<br clear="none">> Current DC: node3.localdomain (version 1.1.20-5.el7_7.2-3c4c782f70) - partition with quorum<br clear="none">> Last updated: Mon Feb 17 08:39:30 2020<br clear="none">> Last change: Sun Feb 16 18:44:06 2020 by root via cibadmin on node1.localdomain<br clear="none">> <br clear="none">> 3 nodes configured<br clear="none">> 10 resources configured<br clear="none">> <br clear="none">> Online: [ node2.localdomain node3.localdomain ]<br clear="none">> OFFLINE: [ node1.localdomain ]<br clear="none">> <br clear="none">> Full list of resources:<br clear="none">> <br clear="none">>   FENCING        (stonith:fence_mpath):  Started node2.localdomain<br clear="none">>   Clone Set: dlm-clone [dlm]<br clear="none">>       Started: [ node2.localdomain node3.localdomain ]<br clear="none">>       Stopped: [ node1.localdomain ]<br clear="none">>   Clone Set: clvmd-clone [clvmd]<br clear="none">>       Started: [ node2.localdomain node3.localdomain ]<br clear="none">>       Stopped: [ node1.localdomain ]<br clear="none">>   Clone Set: TESTGFS2-clone [TESTGFS2]<br clear="none">>       Started: [ node2.localdomain node3.localdomain ]<br clear="none">>       Stopped: [ node1.localdomain ]<br clear="none">> <br clear="none">> <br clear="none">> <br clear="none">> <br clear="none">> In the logs , I've noticed that the node is first unfenced and later it is fenced again... For the unfence , I believe "meta provides=unfencing" is 'guilty', yet I'm not sure about the action from node2.<br clear="none"><br clear="none">'Unfecing' is exactly the expected behavior when provides=unfencing is <br clear="none">present (and it should be present with fence_scsi and fence_multipath).<br clear="none"><br clear="none">Here the important part is "first unfenced and later it is fenced <br clear="none">again". If everything is in normal state, then the node should not be <br clear="none">just fenced again. So it would make sense to me to investigate that <br clear="none">'fencing' after unfencing. I would expect that one of the nodes will <br clear="none">have a more verbose logs that would give idea why the fencing was <br clear="none">ordered. (my lucky guess would be failed 'monitor' operation on any of <br clear="none">the resources as all of them 'on-fail=fence', but this would need a <br clear="none">support from logs to be sure)<br clear="none">Also logs from fenced node can provide some information what happened on <br clear="none">node - if that was the cause of fencing.<br clear="none"><br clear="none">> So far I have used SCSI reservations only with ServiceGuard, while SBD on SUSE - and I was wondering if the setup is correctly done.<br clear="none">I don't see anything particularly bad looking from configuration point <br clear="none">of view. Best place to look for reason are now the logs from other nodes <br clear="none">after 'unfencing' and before 'fencing again'<br clear="none"><br clear="none">> Storage in this test setup is a Highly Available iSCSI Cluster ontop of DRBD /RHEL 7 again/, and it seems that SCSI Reservations Support is OK.<br clear="none"> From logs you have provided so far the reservations keys works as <br clear="none">fencing is happening and reports OK.<br clear="none"><br clear="none">> Best Regards,<br clear="none">> Strahil Nikolov<br clear="none"><br clear="none">Example of fencing because 'monitor' operation of resource 'testtest' <br clear="none">failed from logs:<br clear="none"><br clear="none">Feb 17 22:32:15 [1289] fastvm-centos-7-7-174    pengine:  warning: <br clear="none">pe_fence_node:       Cluster node fastvm-centos-7-7-175 will be fenced: <br clear="none">testtest failed there<br clear="none">Feb 17 22:32:15 [1289] fastvm-centos-7-7-174    pengine:   notice: <br clear="none">LogNodeActions:       * Fence (reboot) fastvm-centos-7-7-175 'testtest <br clear="none">failed there'<div class="yqt1237303881" id="yqtfd91913"><br clear="none"><br clear="none">--<br clear="none">Ondrej<br clear="none">_______________________________________________<br clear="none">Manage your subscription:<br clear="none"><a shape="rect" href="https://lists.clusterlabs.org/mailman/listinfo/users" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br clear="none"><br clear="none">ClusterLabs home: <a shape="rect" href="https://www.clusterlabs.org/" target="_blank">https://www.clusterlabs.org/</a></div></div></div>            </div>                </div>