<div dir="ltr"><pre></pre><div class="gmail_chip gmail_drive_chip" style="width:396px;height:18px;max-height:18px;background-color:rgb(245,245,245);padding:5px;color:rgb(34,34,34);font-family:arial;font-style:normal;font-weight:bold;font-size:13px;border:1px solid rgb(221,221,221);line-height:1"><a href="https://drive.google.com/file/d/1zPgIjKQw0_COTax1nHr4FJdDKHwW8qMN/view?usp=drive_web" target="_blank" style="display:inline-block;max-width:366px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;text-decoration:none;padding:1px 0;border:none" aria-label="2023-07-21_pacemaker_debug.log.vm01.bz2"><img style="vertical-align: bottom; border: none;" src="https://ssl.gstatic.com/docs/doclist/images/icon_10_generic_list.png"> <span dir="ltr" style="color:rgb(17,85,204);text-decoration:none;vertical-align:bottom">2023-07-21_pacemaker_debug.log.vm01.bz2</span></a><img src="//ssl.gstatic.com/ui/v1/icons/common/x_8px.png" style="opacity: 0.55; cursor: pointer; float: right; position: relative; top: -1px; display: none;"></div><div class="gmail_chip gmail_drive_chip" style="width:396px;height:18px;max-height:18px;background-color:rgb(245,245,245);padding:5px;color:rgb(34,34,34);font-family:arial;font-style:normal;font-weight:bold;font-size:13px;border:1px solid rgb(221,221,221);line-height:1"><a href="https://drive.google.com/file/d/1wxEh4ZjmGwmy0ockclvYPSlMbsxj6v4k/view?usp=drive_web" target="_blank" style="display:inline-block;max-width:366px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;text-decoration:none;padding:1px 0;border:none" aria-label="2023-07-21_pacemaker_debug.log.vm02.bz2"><img style="vertical-align: bottom; border: none;" src="https://ssl.gstatic.com/docs/doclist/images/icon_10_generic_list.png"> <span dir="ltr" style="color:rgb(17,85,204);text-decoration:none;vertical-align:bottom">2023-07-21_pacemaker_debug.log.vm02.bz2</span></a><img src="//ssl.gstatic.com/ui/v1/icons/common/x_8px.png" style="opacity: 0.55; cursor: pointer; float: right; position: relative; top: -1px; display: none;"></div><div class="gmail_chip gmail_drive_chip" style="width:396px;height:18px;max-height:18px;background-color:rgb(245,245,245);padding:5px;color:rgb(34,34,34);font-family:arial;font-style:normal;font-weight:bold;font-size:13px;border:1px solid rgb(221,221,221);line-height:1"><a href="https://drive.google.com/file/d/1YA_nsbuXA_0B2I1u0DbeI_Dd3llbftoP/view?usp=drive_web" target="_blank" style="display:inline-block;max-width:366px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;text-decoration:none;padding:1px 0;border:none" aria-label="2023-07-21_pacemaker_debug.log.vm03.bz2"><img style="vertical-align: bottom; border: none;" src="https://ssl.gstatic.com/docs/doclist/images/icon_10_generic_list.png"> <span dir="ltr" style="color:rgb(17,85,204);text-decoration:none;vertical-align:bottom">2023-07-21_pacemaker_debug.log.vm03.bz2</span></a><img src="//ssl.gstatic.com/ui/v1/icons/common/x_8px.png" style="opacity: 0.55; cursor: pointer; float: right; position: relative; top: -1px; display: none;"></div><div class="gmail_chip gmail_drive_chip" style="width:396px;height:18px;max-height:18px;background-color:rgb(245,245,245);padding:5px;color:rgb(34,34,34);font-family:arial;font-style:normal;font-weight:bold;font-size:13px;border:1px solid rgb(221,221,221);line-height:1"><a href="https://drive.google.com/file/d/1FQkoWM20PNW7VZzoduEKYZqQmb9-Ij5w/view?usp=drive_web" target="_blank" style="display:inline-block;max-width:366px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;text-decoration:none;padding:1px 0;border:none" aria-label="2023-07-21_pacemaker_debug.log.vm04.bz2"><img style="vertical-align: bottom; border: none;" src="https://ssl.gstatic.com/docs/doclist/images/icon_10_generic_list.png"> <span dir="ltr" style="color:rgb(17,85,204);text-decoration:none;vertical-align:bottom">2023-07-21_pacemaker_debug.log.vm04.bz2</span></a><img src="//ssl.gstatic.com/ui/v1/icons/common/x_8px.png" style="opacity: 0.55; cursor: pointer; float: right; position: relative; top: -1px; display: none;"></div><div class="gmail_chip gmail_drive_chip" style="width:396px;height:18px;max-height:18px;background-color:rgb(245,245,245);padding:5px;color:rgb(34,34,34);font-family:arial;font-style:normal;font-weight:bold;font-size:13px;border:1px solid rgb(221,221,221);line-height:1"><a href="https://drive.google.com/file/d/1fZXzg4RRSBBWIHqDq-Af4hNsWRxZngQ-/view?usp=drive_web" target="_blank" style="display:inline-block;max-width:366px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;text-decoration:none;padding:1px 0;border:none" aria-label="blackbox_txt_vm04.tar.bz2"><img style="vertical-align: bottom; border: none;" src="https://ssl.gstatic.com/docs/doclist/images/icon_10_generic_list.png"> <span dir="ltr" style="color:rgb(17,85,204);text-decoration:none;vertical-align:bottom">blackbox_txt_vm04.tar.bz2</span></a><img src="//ssl.gstatic.com/ui/v1/icons/common/x_8px.png" style="opacity: 0.55; cursor: pointer; float: right; position: relative; top: -1px; display: none;"></div><pre>On Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot <a href="mailto:users%40clusterlabs.org?Subject=Re:%20Re%3A%20%5BClusterLabs%5D%20Need%20a%20help%20with%20%22%28crm_glib_handler%29%20crit%3A%20GLib%3A%0A%20g_hash_table_lookup%3A%20assertion%20%27hash_table%20%21%3D%20NULL%27%20failed%22&In-Reply-To=%3C93011555dbaa91f51f9c660313807a16b6e2f676.camel%40redhat.com%3E" title="[ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"" target="_blank">kgaillot at redhat.com</a> wrote:

<br>> Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-controld-

> 4257.1" (my version can't read it) will show trace logs that might give

> a better idea of what exactly went wrong at this time (though these

> issues are side effects, not the cause).

Blackboxes were attached to crm_report and they are in txt format. Just in case adding them to this email.

> FYI, it's not necessary to set cluster-recheck-interval as low as 1

> minute. A long time ago that could be useful, but modern Pacemaker

> doesn't need it to calculate things such as failure expiration. I

> recommend leaving it at default, or at least raising it to 5 minutes or

> so.<br>

That's good to know, since those rules came from pacemaker-1.x and I'm an adept of the "don't touch if it works" rule

> vm02, vm03, and vm04 all left the cluster at that time, leaving only<br>> vm01. At this point, vm01 should have deleted the transient attributes<br>> for all three nodes. Unfortunately, the logs for that would only be in<br>> pacemaker.log, which crm_report appears not to have grabbed, so I am<br>> not sure whether it tried.</pre><div>Please find debug logs for "Jul 21" from DC (vm01) and crashed node (vm04) in an attachment.</div>

<pre>> Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot <a href="mailto:users%40clusterlabs.org?Subject=Re:%20Re%3A%20%5BClusterLabs%5D%20Need%20a%20help%20with%20%22%28crm_glib_handler%29%20crit%3A%20GLib%3A%0A%20g_hash_table_lookup%3A%20assertion%20%27hash_table%20%21%3D%20NULL%27%20failed%22&In-Reply-To=%3C93011555dbaa91f51f9c660313807a16b6e2f676.camel%40redhat.com%3E" title="[ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"" target="_blank">kgaillot at redhat.com</a> wrote:<br>> On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote:<br>> > On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur <freishutz at <a href="http://gmail.com">gmail.com</a>><br>> > wrote:<br>> > > Hello Andrew, Ken and the entire community!<br>> > > <br>> > > I faced a problem and I would like to ask for help.<br>> > > <br>> > > Preamble:<br>> > > I have dual controller storage (C0, C1) with 2 VM per controller<br>> > > (vm0[1,2] on C0, vm[3,4] on C1).<br>> > > I did online controller upgrade (update the firmware on physical<br>> > > controller) and for that purpose we have a special procedure:<br>> > > <br>> > > Put all vms on the controller which will be updated into the<br>> > > standby mode (vm0[3,4] in logs).<br>> > > Once all resources are moved to spare controller VMs, turn on<br>> > > maintenance-mode (DC machine is vm01).<br>> > > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM +<br>> > > HCA/HBA + BMC drivers will be updated).<br>> > > Reboot C1<br>> > > Start vm0[3,4]<br>> > > On this step I hit the problem.<br>> > > Do the same steps for C0 (turn off maint, put nodes 3,4 to online,<br>> > > put 1-2 to standby, maint and etc).<br>> > > <br>> > > Here is what I observed during step 5.<br>> > > Machine vm03 started without problems, but vm04 caught critical<br>> > > error and HA stack died. If manually start the pacemaker one more<br>> > > time then it starts without problems and vm04 joins the cluster.<br>> > > <br>> > > Some logs from vm04:<br>> > > <br>> > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is within<br>> > > the primary component and will provide service.<br>> > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] Members[4]: 1 2 3 4<br>> > > Jul 21 04:05:39 vm04 corosync[3061]:  [MAIN  ] Completed service<br>> > > synchronization, ready to provide service.<br>> > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] rx: host: 3 link: 1<br>> > > is up<br>> > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] link: Resetting MTU<br>> > > for link 1 because host 3 joined<br>> > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] host: host: 3<br>> > > (passive) best link: 0 (pri: 1)<br>> > > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting<br>> > > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600<br>> > > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: PMTUD link<br>> > > change for host: 3 link: 1 from 453 to 65413<br>> > > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: Global data<br>> > > MTU changed to: 1397<br>> > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-<br>> > > lnet-o2ib-o2ib[vm02]: (unset) -> 4000<br>> > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting<br>> > > ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600<br>> > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-<br>> > > lnet-o2ib-o2ib[vm01]: (unset) -> 4000<br>> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State<br>> > > transition S_NOT_DC -> S_STOPPING<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot<br>> > > execute monitor of sfa-home-vd: No executor connection<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot<br>> > > calculate digests for operation sfa-home-vd_monitor_0 because we<br>> > > have no connection to executor for vm04<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of<br>> > > probe operation for sfa-home-vd on vm04: Error (No executor<br>> > > connection)<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot<br>> > > execute monitor of ifspeed-lnet-o2ib-o2ib: No executor connection<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot<br>> > > calculate digests for operation ifspeed-lnet-o2ib-o2ib_monitor_0<br>> > > because we have no connection to executor for vm04<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of<br>> > > probe operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No<br>> > > executor connection)<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot<br>> > > execute monitor of ping-lnet-o2ib-o2ib: No executor connection<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot<br>> > > calculate digests for operation ping-lnet-o2ib-o2ib_monitor_0<br>> > > because we have no connection to executor for vm04<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of<br>> > > probe operation for ping-lnet-o2ib-o2ib on vm04: Error (No executor<br>> > > connection)<br>> > > Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker-<br>> > > controld[4257] is unresponsive to ipc after 1 tries<br>> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting cluster<br>> > > down because pacemaker-controld[4257] had fatal failure<br>> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down<br>> > > Pacemaker<br>> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-<br>> > > schedulerd<br>> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-<br>> > > attrd<br>> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-<br>> > > execd<br>> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-<br>> > > fenced<br>> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-<br>> > > based<br>> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown complete<br>> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down and<br>> > > staying down after fatal error<br>> > > <br>> > > Jul 21 04:05:44 vm04 root[10111]: openibd: Set node_desc for<br>> > > mlx5_0: vm04 HCA-1<br>> > > Jul 21 04:05:44 vm04 root[10113]: openibd: Set node_desc for<br>> > > mlx5_1: vm04 HCA-2<br>> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  error: Shutting<br>> > > down controller after unexpected shutdown request from vm01<br>> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: Problem detected at<br>> > > handle_shutdown_ack:954 (controld_messages.c), please see<br>> > > /var/lib/pacemaker/blackbox/pacemaker-controld-4257.1 for<br>> > > additional details<br>> <br>> Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-controld-<br>> 4257.1" (my version can't read it) will show trace logs that might give<br>> a better idea of what exactly went wrong at this time (though these<br>> issues are side effects, not the cause).<br>> <br>> FYI, it's not necessary to set cluster-recheck-interval as low as 1<br>> minute. A long time ago that could be useful, but modern Pacemaker<br>> doesn't need it to calculate things such as failure expiration. I<br>> recommend leaving it at default, or at least raising it to 5 minutes or<br>> so.<br>> <br>> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice: State<br>> > > transition S_NOT_DC -> S_STOPPING<br>> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:<br>> > > Disconnected from the executor<br>> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:<br>> > > Disconnected from Corosync<br>> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:<br>> > > Disconnected from the CIB manager<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  notice:<br>> > > Disconnected from the CIB manager<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  crit: GLib:<br>> > > g_hash_table_lookup: assertion 'hash_table != NULL' failed<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  error: Cannot<br>> > > execute monitor of sfa-home-vd: No executor connection<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  crit: GLib:<br>> > > g_hash_table_lookup: assertion 'hash_table != NULL' failed<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  warning: Cannot<br>> > > calculate digests for operation sfa-home-vd_monitor_0 because we<br>> > > have no connection to executor for vm04<br>> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  warning: Resource<br>> > > update -107 failed: (rc=-107) Transport endpoint is not connected<br>> > <br>> > The controller disconnects from the executor and deletes the executor<br>> > state table (lrm_state_table) in the middle of the shutdown process:<br>> > <a href="https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L240">https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L240</a><br>> > <br>> > These crit messages are happening when we try to look up executor<br>> > state when lrm_state_table is NULL. That shouldn't happen. I guess<br>> > the<br>> > lookups are happening while draining the mainloop:<br>> > <a href="https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L286-L294">https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L286-L294</a><br>> > <br>> <br>> The blackbox should help confirm that.<br>> <br>> > > the log from DC vm01:<br>> > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Transition<br>> > > 16 aborted: Peer Halt<br>> > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected<br>> > > another attribute writer (vm04), starting new election<br>> > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting #attrd-<br>> > > protocol[vm04]: (unset) -> 5<br>> > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Finalizing<br>> > > join-2 for 1 node (sync'ing CIB from vm02)<br>> > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Requested<br>> > > CIB version   <generation_tuple crm_feature_set="3.16.2" validate-<br>> > > with="pacemaker-3.9" epoch="567" num_updates="111" admin_epoch="0"<br>> > > cib-last-writt<br>> > > en="Fri Jul 21 03:48:43 2023" update-origin="vm01" update-<br>> > > client="cibadmin" update-user="root" have-quorum="0" dc-uuid="1"/><br>> > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Recorded local<br>> > > node as attribute writer (was unset)<br>> > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Setting<br>> > > #feature-set[vm04]: (unset) -> 3.16.2<br>> > > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: Transition<br>> > > 16 aborted by deletion of lrm[@id='4']: Resource state removal<br>> > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: No fencing<br>> > > will be done until there are resources to manage<br>> > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:  *<br>> > > Shutdown vm04<br>> <br>> This is where things start to go wrong, and it has nothing to do with<br>> any of the messages here. It means that the shutdown node attribute was<br>> not erased when vm04 shut down the last time before this. Going back,<br>> we see when that happened:<br>> <br>> Jul 21 03:49:06 vm01 pacemaker-attrd[4017]: notice: Setting shutdown[vm04]: (unset) -> 1689911346<br>> <br>> vm02, vm03, and vm04 all left the cluster at that time, leaving only<br>> vm01. At this point, vm01 should have deleted the transient attributes<br>> for all three nodes. Unfortunately, the logs for that would only be in<br>> pacemaker.log, which crm_report appears not to have grabbed, so I am<br>> not sure whether it tried.<br>> <br>> > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: Calculated<br>> > > transition 17, saving inputs in /var/lib/pacemaker/pengine/pe-<br>> > > input-940.bz2<br>> <br>> What's interesting in this transition is that we schedule probes on<br>> vm04 even though we're shutting it down. That's a bug, and leads to the<br>> "No executor connection" messages we see on vm04. I've added a task to<br>> our project manager to take care of that. That's all a side effect<br>> though and not causing any real problems.<br>> <br>> > > As far as I understand, vm04 was killed by DC during the election<br>> > > of a new attr writer?<br>> > <br>> > Not sure yet, maybe someone else recognizes this.<br>> > <br>> > I see the transition was aborted due to peer halt right after node<br>> > vm04 joined. A new election started due to detection of node vm04 as<br>> > attribute writer. Node vm04's resource state was removed, which is a<br>> > normal part of the join sequence; this caused another transition<br>> > abort<br>> > message for the same transition number.<br>> > <br>> > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Node vm04<br>> > state<br>> > is now member<br>> > ...<br>> > Jul 21 04:05:39 vm01 corosync[3134]:  [KNET  ] pmtud: Global data MTU<br>> > changed to: 1397<br>> > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Transition 16<br>> > aborted: Peer Halt<br>> > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected another<br>> > attribute writer (vm04), starting new election<br>> > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting<br>> > #attrd-protocol[vm04]: (unset) -> 5<br>> > ...<br>> > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: Transition 16<br>> > aborted by deletion of lrm[@id='4']: Resource state removal<br>> > <br>> > Looking at pe-input-939 and pe-input-940, node vm04 was marked as<br>> > shut down:<br>> > <br>> > Jul 21 04:05:38 vm01 pacemaker-schedulerd[4028]: notice: Calculated<br>> > transition 16, saving inputs in<br>> > /var/lib/pacemaker/pengine/pe-input-939.bz2<br>> > Jul 21 04:05:44 vm01 pacemaker-controld[4048]: notice: Transition 16<br>> > (Complete=24, Pending=0, Fired=0, Skipped=34, Incomplete=34,<br>> > Source=/var/lib/pacemaker/pengine/pe-input-939.bz2): Stopped<br>> > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:  * Shutdown<br>> > vm04<br>> > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: Calculated<br>> > transition 17, saving inputs in<br>> > /var/lib/pacemaker/pengine/pe-input-940.bz2<br>> > <br>> > 939:<br>> >     <node_state id="4" uname="vm04" in_ccm="false" crmd="offline"<br>> > crm-debug-origin="do_state_transition" join="down" expected="down"><br>> >       <transient_attributes id="4"><br>> >         <instance_attributes id="status-4"><br>> >           <nvpair id="status-4-.feature-set" name="#feature-set"<br>> > value="3.16.2"/><br>> >           <nvpair id="status-4-shutdown" name="shutdown"<br>> > value="1689911346"/><br>> >         </instance_attributes><br>> >       </transient_attributes><br>> > <br>> > 940:<br>> >     <node_state id="4" uname="vm04" in_ccm="true" crmd="online"<br>> > crm-debug-origin="do_state_transition" join="member"<br>> > expected="member"><br>> >       <transient_attributes id="4"><br>> >         <instance_attributes id="status-4"><br>> >           <nvpair id="status-4-.feature-set" name="#feature-set"<br>> > value="3.16.2"/><br>> >           <nvpair id="status-4-shutdown" name="shutdown"<br>> > value="1689911346"/><br>> >         </instance_attributes><br>> >       </transient_attributes><br>> > <br>> > I suppose that node vm04's state was not updated before the<br>> > transition<br>> > was aborted. So when the new transition (940) ran, the scheduler saw<br>> > that node vm04 is expected to be in shutdown state, and it triggered<br>> > a<br>> > shutdown.<br>> > <br>> > This behavior might already be fixed upstream by the following<br>> > commit:<br>> > <a href="https://github.com/ClusterLabs/pacemaker/commit/5e3b3d14">https://github.com/ClusterLabs/pacemaker/commit/5e3b3d14</a><br>> > <br>> > That commit introduced a regression, however, and I'm working on<br>> > fixing it.<br>> <br>> I suspect that's unrelated, because transient attributes are cleared<br>> when a node leaves rather than when it joins.<br>> <br>> > <br>> > <br>> > > The issue is reproducible from time to time and the version of<br>> > > pacemaker is " 2.1.5-8.1.el8_8-a3f44794f94" from Rocky linux 8.8.<br>> > > <br>> > > I attached crm_report with blackbox. I have debug logs, but they<br>> > > are pretty heavy (~40MB bzip --best). Please tell me if you need<br>> > > them.<br>> > > <br>> > > Thanks,<br>> > > Arthur<br>> > > <br>> -- <br>> Ken Gaillot <kgaillot at <a href="http://redhat.com">redhat.com</a>><br>--<br></pre><pre>Arthur Novik<br></pre>

</div>