<div dir="ltr"><div>This is the last update from my side and we could close the thread.</div><div><br></div><div>We did the change to preserve sequential shutdown nodes and after ~30 cycles (each cycle is 3 HA groups with 4 nodes and 3 storages) we could say that the proposed workaround works as expected.<br></div><div>I saw <a href="https://github.com/ClusterLabs/pacemaker/pull/3177">https://github.com/ClusterLabs/pacemaker/pull/3177</a> and <a href="https://github.com/ClusterLabs/pacemaker/pull/3178">https://github.com/ClusterLabs/pacemaker/pull/3178</a> , but I didn't check how it works. So we preserved the original version.</div><div><br></div><div>Thanks everybody,</div><div>Arthur Novik<br></div><div><br></div><div>> On Thu, 2023-08-03 at 12:37:18 -0500, Ken Gaillot wrote:<br>> In the other case, the problem turned out to be a timing issue that can<br>> occur when the DC and attribute writer are shutting down at the same<br>> time. Since the problem in this case also occurred after shutting down<br>> two nodes together, I'm thinking it's likely the same issue.<br>> <br>> A fix should be straightforward. A workaround in the meantime would be<br>> to shut down nodes in sequence rather than in parallel, when shutting<br>> down just some nodes. (Shutting down the entire cluster shouldn't be<br>> subject to the race condition.)<br>> <br>> On Wed, 2023-08-02 at 16:53 -0500, Ken Gaillot wrote:<br>> > Ha! I didn't realize crm_report saves blackboxes as text. Always<br>> > something new to learn with Pacemaker :)<br>> > <br>> > As of 2.1.5, the controller now gets agent metadata asynchronously,<br>> > which fixed bugs with synchronous calls blocking the controller. Once<br>> > the metadata action returns, the original action that required the<br>> > metadata is attempted.<br>> > <br>> > This led to the odd log messages. Normally, agent actions can't be<br>> > attempted once the shutdown sequence begins. However, in this case,<br>> > metadata actions were initiated before shutdown, but completed after<br>> > shutdown began. The controller thus attempted the original actions<br>> > after it had already disconnected from the executor, resulting in the<br>> > odd logs.<br>> > <br>> > The fix for that is simple, but addresses only the logs, not the<br>> > original problem that caused the controller to shut down. I'm still<br>> > looking into that.<br>> > <br>> > I've since heard about a similar case, and I suspect in that case, it<br>> > was related to having a node with an older version trying to join a<br>> > cluster with a newer version.<br>> > <br>> > On Fri, 2023-07-28 at 15:21 +0300, Novik Arthur wrote:<br>> > >  2023-07-21_pacemaker_debug.log.vm01.bz2<br>> > >  2023-07-21_pacemaker_debug.log.vm02.bz2<br>> > >  2023-07-21_pacemaker_debug.log.vm03.bz2<br>> > >  2023-07-21_pacemaker_debug.log.vm04.bz2<br>> > >  blackbox_txt_vm04.tar.bz2<br>> > > On Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at<br>> > > <a href="http://redhat.com">redhat.com</a><br>> > > wrote:<br>> > > <br>> > > > Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-<br>> > > controld-<br>> > > > 4257.1" (my version can't read it) will show trace logs that<br>> > > > might<br>> > > give<br>> > > > a better idea of what exactly went wrong at this time (though<br>> > > > these<br>> > > > issues are side effects, not the cause).<br>> > > <br>> > > Blackboxes were attached to crm_report and they are in txt format.<br>> > > Just in case adding them to this email.<br>> > > <br>> > > > FYI, it's not necessary to set cluster-recheck-interval as low as<br>> > > > 1<br>> > > > minute. A long time ago that could be useful, but modern<br>> > > > Pacemaker<br>> > > > doesn't need it to calculate things such as failure expiration. I<br>> > > > recommend leaving it at default, or at least raising it to 5<br>> > > minutes or<br>> > > > so.<br>> > > <br>> > > That's good to know, since those rules came from pacemaker-1.x and<br>> > > I'm an adept of the "don't touch if it works" rule<br>> > > <br>> > > > vm02, vm03, and vm04 all left the cluster at that time, leaving<br>> > > only<br>> > > > vm01. At this point, vm01 should have deleted the transient<br>> > > attributes<br>> > > > for all three nodes. Unfortunately, the logs for that would only<br>> > > > be<br>> > > in<br>> > > > pacemaker.log, which crm_report appears not to have grabbed, so I<br>> > > am<br>> > > > not sure whether it tried.<br>> > > Please find debug logs for "Jul 21" from DC (vm01) and crashed node<br>> > > (vm04) in an attachment.<br>> > > > Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at <a href="http://redhat.com">redhat.com</a><br>> > > wrote:<br>> > > > On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote:<br>> > > > > On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur <freishutz at<br>> > > <a href="http://gmail.com">gmail.com</a>><br>> > > > > wrote:<br>> > > > > > Hello Andrew, Ken and the entire community!<br>> > > > > > <br>> > > > > > I faced a problem and I would like to ask for help.<br>> > > > > > <br>> > > > > > Preamble:<br>> > > > > > I have dual controller storage (C0, C1) with 2 VM per<br>> > > controller<br>> > > > > > (vm0[1,2] on C0, vm[3,4] on C1).<br>> > > > > > I did online controller upgrade (update the firmware on<br>> > > physical<br>> > > > > > controller) and for that purpose we have a special procedure:<br>> > > > > > <br>> > > > > > Put all vms on the controller which will be updated into the<br>> > > > > > standby mode (vm0[3,4] in logs).<br>> > > > > > Once all resources are moved to spare controller VMs, turn on<br>> > > > > > maintenance-mode (DC machine is vm01).<br>> > > > > > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM<br>> > > > > > +<br>> > > > > > HCA/HBA + BMC drivers will be updated).<br>> > > > > > Reboot C1<br>> > > > > > Start vm0[3,4]<br>> > > > > > On this step I hit the problem.<br>> > > > > > Do the same steps for C0 (turn off maint, put nodes 3,4 to<br>> > > online,<br>> > > > > > put 1-2 to standby, maint and etc).<br>> > > > > > <br>> > > > > > Here is what I observed during step 5.<br>> > > > > > Machine vm03 started without problems, but vm04 caught<br>> > > > > > critical<br>> > > > > > error and HA stack died. If manually start the pacemaker one<br>> > > more<br>> > > > > > time then it starts without problems and vm04 joins the<br>> > > cluster.<br>> > > > > > Some logs from vm04:<br>> > > > > > <br>> > > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is<br>> > > within<br>> > > > > > the primary component and will provide service.<br>> > > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] Members[4]: 1<br>> > > > > > 2<br>> > > 3 4<br>> > > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [MAIN  ] Completed<br>> > > service<br>> > > > > > synchronization, ready to provide service.<br>> > > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] rx: host: 3<br>> > > link: 1<br>> > > > > > is up<br>> > > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] link:<br>> > > > > > Resetting<br>> > > MTU<br>> > > > > > for link 1 because host 3 joined<br>> > > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] host: host: 3<br>> > > > > > (passive) best link: 0 (pri: 1)<br>> > > > > > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting<br>> > > > > > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600<br>> > > > > > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: PMTUD<br>> > > link<br>> > > > > > change for host: 3 link: 1 from 453 to 65413<br>> > > > > > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: Global<br>> > > data<br>> > > > > > MTU changed to: 1397<br>> > > > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting<br>> > > ping-<br>> > > > > > lnet-o2ib-o2ib[vm02]: (unset) -> 4000<br>> > > > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting<br>> > > > > > ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600<br>> > > > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting<br>> > > ping-<br>> > > > > > lnet-o2ib-o2ib[vm01]: (unset) -> 4000<br>> > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State<br>> > > > > > transition S_NOT_DC -> S_STOPPING<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot<br>> > > > > > execute monitor of sfa-home-vd: No executor connection<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning:<br>> > > > > > Cannot<br>> > > > > > calculate digests for operation sfa-home-vd_monitor_0 because<br>> > > we<br>> > > > > > have no connection to executor for vm04<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result<br>> > > > > > of<br>> > > > > > probe operation for sfa-home-vd on vm04: Error (No executor<br>> > > > > > connection)<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot<br>> > > > > > execute monitor of ifspeed-lnet-o2ib-o2ib: No executor<br>> > > connection<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning:<br>> > > > > > Cannot<br>> > > > > > calculate digests for operation ifspeed-lnet-o2ib-<br>> > > o2ib_monitor_0<br>> > > > > > because we have no connection to executor for vm04<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result<br>> > > > > > of<br>> > > > > > probe operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No<br>> > > > > > executor connection)<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot<br>> > > > > > execute monitor of ping-lnet-o2ib-o2ib: No executor<br>> > > > > > connection<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning:<br>> > > > > > Cannot<br>> > > > > > calculate digests for operation ping-lnet-o2ib-o2ib_monitor_0<br>> > > > > > because we have no connection to executor for vm04<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result<br>> > > > > > of<br>> > > > > > probe operation for ping-lnet-o2ib-o2ib on vm04: Error (No<br>> > > executor<br>> > > > > > connection)<br>> > > > > > Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker-<br>> > > > > > controld[4257] is unresponsive to ipc after 1 tries<br>> > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting<br>> > > cluster<br>> > > > > > down because pacemaker-controld[4257] had fatal failure<br>> > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down<br>> > > > > > Pacemaker<br>> > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping<br>> > > pacemaker-<br>> > > > > > schedulerd<br>> > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping<br>> > > pacemaker-<br>> > > > > > attrd<br>> > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping<br>> > > pacemaker-<br>> > > > > > execd<br>> > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping<br>> > > pacemaker-<br>> > > > > > fenced<br>> > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping<br>> > > pacemaker-<br>> > > > > > based<br>> > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown<br>> > > complete<br>> > > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down<br>> > > and<br>> > > > > > staying down after fatal error<br>> > > > > > <br>> > > > > > Jul 21 04:05:44 vm04 root[10111]: openibd: Set node_desc for<br>> > > > > > mlx5_0: vm04 HCA-1<br>> > > > > > Jul 21 04:05:44 vm04 root[10113]: openibd: Set node_desc for<br>> > > > > > mlx5_1: vm04 HCA-2<br>> > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  error:<br>> > > > > > Shutting<br>> > > > > > down controller after unexpected shutdown request from vm01<br>> > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: Problem<br>> > > > > > detected<br>> > > at<br>> > > > > > handle_shutdown_ack:954 (controld_messages.c), please see<br>> > > > > > /var/lib/pacemaker/blackbox/pacemaker-controld-4257.1 for<br>> > > > > > additional details<br>> > > > <br>> > > > Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-<br>> > > controld-<br>> > > > 4257.1" (my version can't read it) will show trace logs that<br>> > > > might<br>> > > give<br>> > > > a better idea of what exactly went wrong at this time (though<br>> > > > these<br>> > > > issues are side effects, not the cause).<br>> > > > <br>> > > > FYI, it's not necessary to set cluster-recheck-interval as low as<br>> > > > 1<br>> > > > minute. A long time ago that could be useful, but modern<br>> > > > Pacemaker<br>> > > > doesn't need it to calculate things such as failure expiration. I<br>> > > > recommend leaving it at default, or at least raising it to 5<br>> > > minutes or<br>> > > > so.<br>> > > > <br>> > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice: State<br>> > > > > > transition S_NOT_DC -> S_STOPPING<br>> > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:<br>> > > > > > Disconnected from the executor<br>> > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:<br>> > > > > > Disconnected from Corosync<br>> > > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:<br>> > > > > > Disconnected from the CIB manager<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  notice:<br>> > > > > > Disconnected from the CIB manager<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  crit: GLib:<br>> > > > > > g_hash_table_lookup: assertion 'hash_table != NULL' failed<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  error: Cannot<br>> > > > > > execute monitor of sfa-home-vd: No executor connection<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  crit: GLib:<br>> > > > > > g_hash_table_lookup: assertion 'hash_table != NULL' failed<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  warning:<br>> > > > > > Cannot<br>> > > > > > calculate digests for operation sfa-home-vd_monitor_0 because<br>> > > we<br>> > > > > > have no connection to executor for vm04<br>> > > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  warning:<br>> > > Resource<br>> > > > > > update -107 failed: (rc=-107) Transport endpoint is not<br>> > > connected<br>> > > > > The controller disconnects from the executor and deletes the<br>> > > executor<br>> > > > > state table (lrm_state_table) in the middle of the shutdown<br>> > > process:<br>> > > <a href="https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L240">https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L240</a><br>> > > > > These crit messages are happening when we try to look up<br>> > > > > executor<br>> > > > > state when lrm_state_table is NULL. That shouldn't happen. I<br>> > > guess<br>> > > > > the<br>> > > > > lookups are happening while draining the mainloop:<br>> > > > > <br>> > > <a href="https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L286-L294">https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L286-L294</a><br>> > > > <br>> > > > The blackbox should help confirm that.<br>> > > > <br>> > > > > > the log from DC vm01:<br>> > > > > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice:<br>> > > Transition<br>> > > > > > 16 aborted: Peer Halt<br>> > > > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected<br>> > > > > > another attribute writer (vm04), starting new election<br>> > > > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting<br>> > > #attrd-<br>> > > > > > protocol[vm04]: (unset) -> 5<br>> > > > > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice:<br>> > > Finalizing<br>> > > > > > join-2 for 1 node (sync'ing CIB from vm02)<br>> > > > > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice:<br>> > > Requested<br>> > > > > > CIB version   <generation_tuple crm_feature_set="3.16.2"<br>> > > validate-<br>> > > > > > with="pacemaker-3.9" epoch="567" num_updates="111"<br>> > > admin_epoch="0"<br>> > > > > > cib-last-writt<br>> > > > > > en="Fri Jul 21 03:48:43 2023" update-origin="vm01" update-<br>> > > > > > client="cibadmin" update-user="root" have-quorum="0" dc-<br>> > > uuid="1"/><br>> > > > > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Recorded<br>> > > local<br>> > > > > > node as attribute writer (was unset)<br>> > > > > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Setting<br>> > > > > > #feature-set[vm04]: (unset) -> 3.16.2<br>> > > > > > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice:<br>> > > Transition<br>> > > > > > 16 aborted by deletion of lrm[@id='4']: Resource state<br>> > > > > > removal<br>> > > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: No<br>> > > fencing<br>> > > > > > will be done until there are resources to manage<br>> > > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:  *<br>> > > > > > Shutdown vm04<br>> > > > <br>> > > > This is where things start to go wrong, and it has nothing to do<br>> > > with<br>> > > > any of the messages here. It means that the shutdown node<br>> > > > attribute<br>> > > was<br>> > > > not erased when vm04 shut down the last time before this. Going<br>> > > back,<br>> > > > we see when that happened:<br>> > > > <br>> > > > Jul 21 03:49:06 vm01 pacemaker-attrd[4017]: notice: Setting<br>> > > shutdown[vm04]: (unset) -> 1689911346<br>> > > > vm02, vm03, and vm04 all left the cluster at that time, leaving<br>> > > only<br>> > > > vm01. At this point, vm01 should have deleted the transient<br>> > > attributes<br>> > > > for all three nodes. Unfortunately, the logs for that would only<br>> > > > be<br>> > > in<br>> > > > pacemaker.log, which crm_report appears not to have grabbed, so I<br>> > > am<br>> > > > not sure whether it tried.<br>> > > > <br>> > > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:<br>> > > Calculated<br>> > > > > > transition 17, saving inputs in<br>> > > > > > /var/lib/pacemaker/pengine/pe-<br>> > > > > > input-940.bz2<br>> > > > <br>> > > > What's interesting in this transition is that we schedule probes<br>> > > > on<br>> > > > vm04 even though we're shutting it down. That's a bug, and leads<br>> > > > to<br>> > > the<br>> > > > "No executor connection" messages we see on vm04. I've added a<br>> > > > task<br>> > > to<br>> > > > our project manager to take care of that. That's all a side<br>> > > > effect<br>> > > > though and not causing any real problems.<br>> > > > <br>> > > > > > As far as I understand, vm04 was killed by DC during the<br>> > > election<br>> > > > > > of a new attr writer?<br>> > > > > <br>> > > > > Not sure yet, maybe someone else recognizes this.<br>> > > > > <br>> > > > > I see the transition was aborted due to peer halt right after<br>> > > node<br>> > > > > vm04 joined. A new election started due to detection of node<br>> > > > > vm04<br>> > > as<br>> > > > > attribute writer. Node vm04's resource state was removed, which<br>> > > is a<br>> > > > > normal part of the join sequence; this caused another<br>> > > > > transition<br>> > > > > abort<br>> > > > > message for the same transition number.<br>> > > > > <br>> > > > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Node<br>> > > > > vm04<br>> > > > > state<br>> > > > > is now member<br>> > > > > ...<br>> > > > > Jul 21 04:05:39 vm01 corosync[3134]:  [KNET  ] pmtud: Global<br>> > > > > data<br>> > > MTU<br>> > > > > changed to: 1397<br>> > > > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice:<br>> > > > > Transition<br>> > > 16<br>> > > > > aborted: Peer Halt<br>> > > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected<br>> > > another<br>> > > > > attribute writer (vm04), starting new election<br>> > > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting<br>> > > > > #attrd-protocol[vm04]: (unset) -> 5<br>> > > > > ...<br>> > > > > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice:<br>> > > > > Transition<br>> > > 16<br>> > > > > aborted by deletion of lrm[@id='4']: Resource state removal<br>> > > > > <br>> > > > > Looking at pe-input-939 and pe-input-940, node vm04 was marked<br>> > > > > as<br>> > > > > shut down:<br>> > > > > <br>> > > > > Jul 21 04:05:38 vm01 pacemaker-schedulerd[4028]: notice:<br>> > > Calculated<br>> > > > > transition 16, saving inputs in<br>> > > > > /var/lib/pacemaker/pengine/pe-input-939.bz2<br>> > > > > Jul 21 04:05:44 vm01 pacemaker-controld[4048]: notice:<br>> > > > > Transition<br>> > > 16<br>> > > > > (Complete=24, Pending=0, Fired=0, Skipped=34, Incomplete=34,<br>> > > > > Source=/var/lib/pacemaker/pengine/pe-input-939.bz2): Stopped<br>> > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:  *<br>> > > Shutdown<br>> > > > > vm04<br>> > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:<br>> > > Calculated<br>> > > > > transition 17, saving inputs in<br>> > > > > /var/lib/pacemaker/pengine/pe-input-940.bz2<br>> > > > > <br>> > > > > 939:<br>> > > > >     <node_state id="4" uname="vm04" in_ccm="false"<br>> > > > > crmd="offline"<br>> > > > > crm-debug-origin="do_state_transition" join="down"<br>> > > expected="down"><br>> > > > >       <transient_attributes id="4"><br>> > > > >         <instance_attributes id="status-4"><br>> > > > >           <nvpair id="status-4-.feature-set" name="#feature-<br>> > > > > set"<br>> > > > > value="3.16.2"/><br>> > > > >           <nvpair id="status-4-shutdown" name="shutdown"<br>> > > > > value="1689911346"/><br>> > > > >         </instance_attributes><br>> > > > >       </transient_attributes><br>> > > > > <br>> > > > > 940:<br>> > > > >     <node_state id="4" uname="vm04" in_ccm="true" crmd="online"<br>> > > > > crm-debug-origin="do_state_transition" join="member"<br>> > > > > expected="member"><br>> > > > >       <transient_attributes id="4"><br>> > > > >         <instance_attributes id="status-4"><br>> > > > >           <nvpair id="status-4-.feature-set" name="#feature-<br>> > > > > set"<br>> > > > > value="3.16.2"/><br>> > > > >           <nvpair id="status-4-shutdown" name="shutdown"<br>> > > > > value="1689911346"/><br>> > > > >         </instance_attributes><br>> > > > >       </transient_attributes><br>> > > > > <br>> > > > > I suppose that node vm04's state was not updated before the<br>> > > > > transition<br>> > > > > was aborted. So when the new transition (940) ran, the<br>> > > > > scheduler<br>> > > saw<br>> > > > > that node vm04 is expected to be in shutdown state, and it<br>> > > triggered<br>> > > > > a<br>> > > > > shutdown.<br>> > > > > <br>> > > > > This behavior might already be fixed upstream by the following<br>> > > > > commit:<br>> > > > > <a href="https://github.com/ClusterLabs/pacemaker/commit/5e3b3d14">https://github.com/ClusterLabs/pacemaker/commit/5e3b3d14</a><br>> > > > > <br>> > > > > That commit introduced a regression, however, and I'm working<br>> > > > > on<br>> > > > > fixing it.<br>> > > > <br>> > > > I suspect that's unrelated, because transient attributes are<br>> > > cleared<br>> > > > when a node leaves rather than when it joins.<br>> > > > <br>> > > > > <br>> > > > > > The issue is reproducible from time to time and the version<br>> > > > > > of<br>> > > > > > pacemaker is " 2.1.5-8.1.el8_8-a3f44794f94" from Rocky linux<br>> > > 8.8.<br>> > > > > > I attached crm_report with blackbox. I have debug logs, but<br>> > > they<br>> > > > > > are pretty heavy (~40MB bzip --best). Please tell me if you<br>> > > need<br>> > > > > > them.<br>> > > > > > <br>> > > > > > Thanks,<br>> > > > > > Arthur<br>> > > > > > <br>> > > > -- <br>> > > > Ken Gaillot <kgaillot at <a href="http://redhat.com">redhat.com</a>><br>> > > --<br>> > > Arthur Novik<br>> > > _______________________________________________<br>> > > Manage your subscription:<br>> > > <a href="https://lists.clusterlabs.org/mailman/listinfo/users">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>> > > <br>> > > ClusterLabs home: <a href="https://www.clusterlabs.org/">https://www.clusterlabs.org/</a><br>> -- <br>> Ken Gaillot <kgaillot at <a href="http://redhat.com">redhat.com</a>><br></div></div>