<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="auto">Hi,</div><div dir="auto"><br></div><div dir="auto">I need some help!<br><div><br></div><div>I have a DRBD cluster and one node was switched off for a couple of days. The single node ran fine without a hiccup. When i switch it on I got into a situation where all resources got stopped and one DRBD volume was secondary and the others primary as it seemingly tried to perform a role swop to the node just switched on (ha1 was live and then i switched on ha2 at 08:06 for the sake of logs understanding)</div><div><br></div><div><div>bash-5.1# cat /proc/drbd </div><div>version: 8.4.11 (api:1/proto:86-101)</div><div>srcversion: 60F610B702CC05315B04B50 </div><div> 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----</div><div> ns:109798092 nr:90528 dw:373317496 dr:353811713 al:558387 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0</div><div> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----</div><div> ns:415010252 nr:188601628 dw:1396698240 dr:1032339078 al:1387347 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0</div><div> 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----</div><div> ns:27957772 nr:21354732 dw:97210572 dr:100798651 al:5283 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0</div></div><div><br></div><div>The cluster state ended up as</div><div><br></div><div><div>bash-5.1# pcs status</div><div>Cluster name: HA</div><div>Status of pacemakerd: 'Pacemaker is running' (last updated 2023-08-10 08:38:40Z)</div><div>Cluster Summary:</div><div> * Stack: corosync</div><div> * Current DC: ha2.local (version 2.1.4-5.el9_1.2-dc6eb4362e) - partition with quorum</div><div> * Last updated: Thu Aug 10 08:38:40 2023</div><div> * Last change: Mon Jul 10 06:49:08 2023 by hacluster via crmd on ha1.local</div><div> * 2 nodes configured</div><div> * 14 resource instances configured</div><div><br></div><div>Node List:</div><div> * Online: [ ha1.local ha2.local ]</div><div><br></div><div>Full List of Resources:</div><div> * Clone Set: LV_BLOB-clone [LV_BLOB] (promotable):</div><div> * Promoted: [ ha2.local ]</div><div> * Unpromoted: [ ha1.local ]</div><div> * Resource Group: nsdrbd:</div><div> * LV_BLOBFS<span style="white-space:pre"> </span>(ocf:heartbeat:Filesystem):<span style="white-space:pre"> </span> Started ha2.local</div><div> * LV_POSTGRESFS<span style="white-space:pre"> </span>(ocf:heartbeat:Filesystem):<span style="white-space:pre"> </span> Stopped</div><div> * LV_HOMEFS<span style="white-space:pre"> </span>(ocf:heartbeat:Filesystem):<span style="white-space:pre"> </span> Stopped</div><div> * ClusterIP<span style="white-space:pre"> </span>(ocf:heartbeat:IPaddr2):<span style="white-space:pre"> </span> Stopped</div><div> * Clone Set: LV_POSTGRES-clone [LV_POSTGRES] (promotable):</div><div> * Promoted: [ ha1.local ]</div><div> * Unpromoted: [ ha2.local ]</div><div> * postgresql<span style="white-space:pre"> </span>(systemd:postgresql):<span style="white-space:pre"> </span> Stopped</div><div> * Clone Set: LV_HOME-clone [LV_HOME] (promotable):</div><div> * Promoted: [ ha1.local ]</div><div> * Unpromoted: [ ha2.local ]</div><div> * ns_mhswdog<span style="white-space:pre"> </span>(lsb:mhswdog):<span style="white-space:pre"> </span> Stopped</div><div> * Clone Set: pingd-clone [pingd]:</div><div> * Started: [ ha1.local ha2.local ]</div><div><br></div><div>Failed Resource Actions:</div><div> * LV_POSTGRES promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:19:27 2023 after 1m30.003s</div><div> * LV_BLOB promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:15:38 2023 after 1m30.001s</div><div><br></div><div>Daemon Status:</div><div> corosync: active/enabled</div><div> pacemaker: active/enabled</div><div> pcsd: active/enabled</div></div><div><br></div><div>I attach the logs of the two nodes. I also attach the output of pcs config show</div><div><br></div><div>My questions:</div><div>- can anyone help me figure out what happened here ?</div><div>- as a side question, if a situation resolved itself, is there a way to have pcs do a resource cleanup by itself ?</div><div><br></div><div>Thanks</div><div><br></div></div>
</div></div></div>