[ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"

Fri Jul 28 08:21:48 EDT 2023

 2023-07-21_pacemaker_debug.log.vm01.bz2
<https://drive.google.com/file/d/1zPgIjKQw0_COTax1nHr4FJdDKHwW8qMN/view?usp=drive_web>
 2023-07-21_pacemaker_debug.log.vm02.bz2
<https://drive.google.com/file/d/1wxEh4ZjmGwmy0ockclvYPSlMbsxj6v4k/view?usp=drive_web>
 2023-07-21_pacemaker_debug.log.vm03.bz2
<https://drive.google.com/file/d/1YA_nsbuXA_0B2I1u0DbeI_Dd3llbftoP/view?usp=drive_web>
 2023-07-21_pacemaker_debug.log.vm04.bz2
<https://drive.google.com/file/d/1FQkoWM20PNW7VZzoduEKYZqQmb9-Ij5w/view?usp=drive_web>
 blackbox_txt_vm04.tar.bz2
<https://drive.google.com/file/d/1fZXzg4RRSBBWIHqDq-Af4hNsWRxZngQ-/view?usp=drive_web>

On Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com
<users%40clusterlabs.org?Subject=Re:%20Re%3A%20%5BClusterLabs%5D%20Need%20a%20help%20with%20%22%28crm_glib_handler%29%20crit%3A%20GLib%3A%0A%20g_hash_table_lookup%3A%20assertion%20%27hash_table%20%21%3D%20NULL%27%20failed%22&In-Reply-To=%3C93011555dbaa91f51f9c660313807a16b6e2f676.camel%40redhat.com%3E>
wrote:
> Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-controld-
> 4257.1" (my version can't read it) will show trace logs that might give
> a better idea of what exactly went wrong at this time (though these
> issues are side effects, not the cause).

Blackboxes were attached to crm_report and they are in txt format.
Just in case adding them to this email.

> FYI, it's not necessary to set cluster-recheck-interval as low as 1
> minute. A long time ago that could be useful, but modern Pacemaker
> doesn't need it to calculate things such as failure expiration. I
> recommend leaving it at default, or at least raising it to 5 minutes or
> so.

That's good to know, since those rules came from pacemaker-1.x and I'm
an adept of the "don't touch if it works" rule

> vm02, vm03, and vm04 all left the cluster at that time, leaving only
> vm01. At this point, vm01 should have deleted the transient attributes
> for all three nodes. Unfortunately, the logs for that would only be in
> pacemaker.log, which crm_report appears not to have grabbed, so I am
> not sure whether it tried.

Please find debug logs for "Jul 21" from DC (vm01) and crashed node (vm04)
in an attachment.

> Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com <users%40clusterlabs.org?Subject=Re:%20Re%3A%20%5BClusterLabs%5D%20Need%20a%20help%20with%20%22%28crm_glib_handler%29%20crit%3A%20GLib%3A%0A%20g_hash_table_lookup%3A%20assertion%20%27hash_table%20%21%3D%20NULL%27%20failed%22&In-Reply-To=%3C93011555dbaa91f51f9c660313807a16b6e2f676.camel%40redhat.com%3E> wrote:
> On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote:
> > On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur <freishutz at gmail.com>
> > wrote:
> > > Hello Andrew, Ken and the entire community!
> > >
> > > I faced a problem and I would like to ask for help.
> > >
> > > Preamble:
> > > I have dual controller storage (C0, C1) with 2 VM per controller
> > > (vm0[1,2] on C0, vm[3,4] on C1).
> > > I did online controller upgrade (update the firmware on physical
> > > controller) and for that purpose we have a special procedure:
> > >
> > > Put all vms on the controller which will be updated into the
> > > standby mode (vm0[3,4] in logs).
> > > Once all resources are moved to spare controller VMs, turn on
> > > maintenance-mode (DC machine is vm01).
> > > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM +
> > > HCA/HBA + BMC drivers will be updated).
> > > Reboot C1
> > > Start vm0[3,4]
> > > On this step I hit the problem.
> > > Do the same steps for C0 (turn off maint, put nodes 3,4 to online,
> > > put 1-2 to standby, maint and etc).
> > >
> > > Here is what I observed during step 5.
> > > Machine vm03 started without problems, but vm04 caught critical
> > > error and HA stack died. If manually start the pacemaker one more
> > > time then it starts without problems and vm04 joins the cluster.
> > >
> > > Some logs from vm04:
> > >
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is within
> > > the primary component and will provide service.
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] Members[4]: 1 2 3 4
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [MAIN  ] Completed service
> > > synchronization, ready to provide service.
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] rx: host: 3 link: 1
> > > is up
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] link: Resetting MTU
> > > for link 1 because host 3 joined
> > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] host: host: 3
> > > (passive) best link: 0 (pri: 1)
> > > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting
> > > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600
> > > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: PMTUD link
> > > change for host: 3 link: 1 from 453 to 65413
> > > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: Global data
> > > MTU changed to: 1397
> > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-
> > > lnet-o2ib-o2ib[vm02]: (unset) -> 4000
> > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting
> > > ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600
> > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting ping-
> > > lnet-o2ib-o2ib[vm01]: (unset) -> 4000
> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State
> > > transition S_NOT_DC -> S_STOPPING
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > > execute monitor of sfa-home-vd: No executor connection
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > > calculate digests for operation sfa-home-vd_monitor_0 because we
> > > have no connection to executor for vm04
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > > probe operation for sfa-home-vd on vm04: Error (No executor
> > > connection)
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > > execute monitor of ifspeed-lnet-o2ib-o2ib: No executor connection
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > > calculate digests for operation ifspeed-lnet-o2ib-o2ib_monitor_0
> > > because we have no connection to executor for vm04
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > > probe operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No
> > > executor connection)
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > > execute monitor of ping-lnet-o2ib-o2ib: No executor connection
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot
> > > calculate digests for operation ping-lnet-o2ib-o2ib_monitor_0
> > > because we have no connection to executor for vm04
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
> > > probe operation for ping-lnet-o2ib-o2ib on vm04: Error (No executor
> > > connection)
> > > Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker-
> > > controld[4257] is unresponsive to ipc after 1 tries
> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting cluster
> > > down because pacemaker-controld[4257] had fatal failure
> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down
> > > Pacemaker
> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > > schedulerd
> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > > attrd
> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > > execd
> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > > fenced
> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping pacemaker-
> > > based
> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown complete
> > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down and
> > > staying down after fatal error
> > >
> > > Jul 21 04:05:44 vm04 root[10111]: openibd: Set node_desc for
> > > mlx5_0: vm04 HCA-1
> > > Jul 21 04:05:44 vm04 root[10113]: openibd: Set node_desc for
> > > mlx5_1: vm04 HCA-2
> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  error: Shutting
> > > down controller after unexpected shutdown request from vm01
> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: Problem detected at
> > > handle_shutdown_ack:954 (controld_messages.c), please see
> > > /var/lib/pacemaker/blackbox/pacemaker-controld-4257.1 for
> > > additional details
>
> Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-controld-
> 4257.1" (my version can't read it) will show trace logs that might give
> a better idea of what exactly went wrong at this time (though these
> issues are side effects, not the cause).
>
> FYI, it's not necessary to set cluster-recheck-interval as low as 1
> minute. A long time ago that could be useful, but modern Pacemaker
> doesn't need it to calculate things such as failure expiration. I
> recommend leaving it at default, or at least raising it to 5 minutes or
> so.
>
> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice: State
> > > transition S_NOT_DC -> S_STOPPING
> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:
> > > Disconnected from the executor
> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:
> > > Disconnected from Corosync
> > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:
> > > Disconnected from the CIB manager
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  notice:
> > > Disconnected from the CIB manager
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  crit: GLib:
> > > g_hash_table_lookup: assertion 'hash_table != NULL' failed
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  error: Cannot
> > > execute monitor of sfa-home-vd: No executor connection
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  crit: GLib:
> > > g_hash_table_lookup: assertion 'hash_table != NULL' failed
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  warning: Cannot
> > > calculate digests for operation sfa-home-vd_monitor_0 because we
> > > have no connection to executor for vm04
> > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  warning: Resource
> > > update -107 failed: (rc=-107) Transport endpoint is not connected
> >
> > The controller disconnects from the executor and deletes the executor
> > state table (lrm_state_table) in the middle of the shutdown process:
> > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L240
> >
> > These crit messages are happening when we try to look up executor
> > state when lrm_state_table is NULL. That shouldn't happen. I guess
> > the
> > lookups are happening while draining the mainloop:
> > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L286-L294
> >
>
> The blackbox should help confirm that.
>
> > > the log from DC vm01:
> > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Transition
> > > 16 aborted: Peer Halt
> > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected
> > > another attribute writer (vm04), starting new election
> > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting #attrd-
> > > protocol[vm04]: (unset) -> 5
> > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Finalizing
> > > join-2 for 1 node (sync'ing CIB from vm02)
> > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Requested
> > > CIB version   <generation_tuple crm_feature_set="3.16.2" validate-
> > > with="pacemaker-3.9" epoch="567" num_updates="111" admin_epoch="0"
> > > cib-last-writt
> > > en="Fri Jul 21 03:48:43 2023" update-origin="vm01" update-
> > > client="cibadmin" update-user="root" have-quorum="0" dc-uuid="1"/>
> > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Recorded local
> > > node as attribute writer (was unset)
> > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Setting
> > > #feature-set[vm04]: (unset) -> 3.16.2
> > > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: Transition
> > > 16 aborted by deletion of lrm[@id='4']: Resource state removal
> > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: No fencing
> > > will be done until there are resources to manage
> > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:  *
> > > Shutdown vm04
>
> This is where things start to go wrong, and it has nothing to do with
> any of the messages here. It means that the shutdown node attribute was
> not erased when vm04 shut down the last time before this. Going back,
> we see when that happened:
>
> Jul 21 03:49:06 vm01 pacemaker-attrd[4017]: notice: Setting shutdown[vm04]: (unset) -> 1689911346
>
> vm02, vm03, and vm04 all left the cluster at that time, leaving only
> vm01. At this point, vm01 should have deleted the transient attributes
> for all three nodes. Unfortunately, the logs for that would only be in
> pacemaker.log, which crm_report appears not to have grabbed, so I am
> not sure whether it tried.
>
> > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: Calculated
> > > transition 17, saving inputs in /var/lib/pacemaker/pengine/pe-
> > > input-940.bz2
>
> What's interesting in this transition is that we schedule probes on
> vm04 even though we're shutting it down. That's a bug, and leads to the
> "No executor connection" messages we see on vm04. I've added a task to
> our project manager to take care of that. That's all a side effect
> though and not causing any real problems.
>
> > > As far as I understand, vm04 was killed by DC during the election
> > > of a new attr writer?
> >
> > Not sure yet, maybe someone else recognizes this.
> >
> > I see the transition was aborted due to peer halt right after node
> > vm04 joined. A new election started due to detection of node vm04 as
> > attribute writer. Node vm04's resource state was removed, which is a
> > normal part of the join sequence; this caused another transition
> > abort
> > message for the same transition number.
> >
> > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Node vm04
> > state
> > is now member
> > ...
> > Jul 21 04:05:39 vm01 corosync[3134]:  [KNET  ] pmtud: Global data MTU
> > changed to: 1397
> > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Transition 16
> > aborted: Peer Halt
> > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected another
> > attribute writer (vm04), starting new election
> > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting
> > #attrd-protocol[vm04]: (unset) -> 5
> > ...
> > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: Transition 16
> > aborted by deletion of lrm[@id='4']: Resource state removal
> >
> > Looking at pe-input-939 and pe-input-940, node vm04 was marked as
> > shut down:
> >
> > Jul 21 04:05:38 vm01 pacemaker-schedulerd[4028]: notice: Calculated
> > transition 16, saving inputs in
> > /var/lib/pacemaker/pengine/pe-input-939.bz2
> > Jul 21 04:05:44 vm01 pacemaker-controld[4048]: notice: Transition 16
> > (Complete=24, Pending=0, Fired=0, Skipped=34, Incomplete=34,
> > Source=/var/lib/pacemaker/pengine/pe-input-939.bz2): Stopped
> > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:  * Shutdown
> > vm04
> > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: Calculated
> > transition 17, saving inputs in
> > /var/lib/pacemaker/pengine/pe-input-940.bz2
> >
> > 939:
> >     <node_state id="4" uname="vm04" in_ccm="false" crmd="offline"
> > crm-debug-origin="do_state_transition" join="down" expected="down">
> >       <transient_attributes id="4">
> >         <instance_attributes id="status-4">
> >           <nvpair id="status-4-.feature-set" name="#feature-set"
> > value="3.16.2"/>
> >           <nvpair id="status-4-shutdown" name="shutdown"
> > value="1689911346"/>
> >         </instance_attributes>
> >       </transient_attributes>
> >
> > 940:
> >     <node_state id="4" uname="vm04" in_ccm="true" crmd="online"
> > crm-debug-origin="do_state_transition" join="member"
> > expected="member">
> >       <transient_attributes id="4">
> >         <instance_attributes id="status-4">
> >           <nvpair id="status-4-.feature-set" name="#feature-set"
> > value="3.16.2"/>
> >           <nvpair id="status-4-shutdown" name="shutdown"
> > value="1689911346"/>
> >         </instance_attributes>
> >       </transient_attributes>
> >
> > I suppose that node vm04's state was not updated before the
> > transition
> > was aborted. So when the new transition (940) ran, the scheduler saw
> > that node vm04 is expected to be in shutdown state, and it triggered
> > a
> > shutdown.
> >
> > This behavior might already be fixed upstream by the following
> > commit:
> > https://github.com/ClusterLabs/pacemaker/commit/5e3b3d14
> >
> > That commit introduced a regression, however, and I'm working on
> > fixing it.
>
> I suspect that's unrelated, because transient attributes are cleared
> when a node leaves rather than when it joins.
>
> >
> >
> > > The issue is reproducible from time to time and the version of
> > > pacemaker is " 2.1.5-8.1.el8_8-a3f44794f94" from Rocky linux 8.8.
> > >
> > > I attached crm_report with blackbox. I have debug logs, but they
> > > are pretty heavy (~40MB bzip --best). Please tell me if you need
> > > them.
> > >
> > > Thanks,
> > > Arthur
> > >
> --
> Ken Gaillot <kgaillot at redhat.com>
--

Arthur Novik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230728/4f98de6d/attachment-0001.htm>