[ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"

Thu Aug 3 12:37:18 EDT 2023

In the other case, the problem turned out to be a timing issue that can
occur when the DC and attribute writer are shutting down at the same
time. Since the problem in this case also occurred after shutting down
two nodes together, I'm thinking it's likely the same issue.

A fix should be straightforward. A workaround in the meantime would be
to shut down nodes in sequence rather than in parallel, when shutting
down just some nodes. (Shutting down the entire cluster shouldn't be
subject to the race condition.)

On Wed, 2023-08-02 at 16:53 -0500, Ken Gaillot wrote:
> Ha! I didn't realize crm_report saves blackboxes as text. Always
> something new to learn with Pacemaker :)
> 
> As of 2.1.5, the controller now gets agent metadata asynchronously,
> which fixed bugs with synchronous calls blocking the controller. Once
> the metadata action returns, the original action that required the
> metadata is attempted.
> 
> This led to the odd log messages. Normally, agent actions can't be
> attempted once the shutdown sequence begins. However, in this case,
> metadata actions were initiated before shutdown, but completed after
> shutdown began. The controller thus attempted the original actions
> after it had already disconnected from the executor, resulting in the
> odd logs.
> 
> The fix for that is simple, but addresses only the logs, not the
> original problem that caused the controller to shut down. I'm still
> looking into that.
> 
> I've since heard about a similar case, and I suspect in that case, it
> was related to having a node with an older version trying to join a
> cluster with a newer version.
> 
> On Fri, 2023-07-28 at 15:21 +0300, Novik Arthur wrote:
> >  2023-07-21_pacemaker_debug.log.vm01.bz2
> >  2023-07-21_pacemaker_debug.log.vm02.bz2
> >  2023-07-21_pacemaker_debug.log.vm03.bz2
> >  2023-07-21_pacemaker_debug.log.vm04.bz2
> >  blackbox_txt_vm04.tar.bz2
> > On Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at
> > redhat.com
> > wrote:
> > 
> > > Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-
> > controld-
> > > 4257.1" (my version can't read it) will show trace logs that
> > > might
> > give
> > > a better idea of what exactly went wrong at this time (though
> > > these
> > > issues are side effects, not the cause).
> > 
> > Blackboxes were attached to crm_report and they are in txt format.
> > Just in case adding them to this email.
> > 
> > > FYI, it's not necessary to set cluster-recheck-interval as low as
> > > 1
> > > minute. A long time ago that could be useful, but modern
> > > Pacemaker
> > > doesn't need it to calculate things such as failure expiration. I
> > > recommend leaving it at default, or at least raising it to 5
> > minutes or
> > > so.
> > 
> > That's good to know, since those rules came from pacemaker-1.x and
> > I'm an adept of the "don't touch if it works" rule
> > 
> > > vm02, vm03, and vm04 all left the cluster at that time, leaving
> > only
> > > vm01. At this point, vm01 should have deleted the transient
> > attributes
> > > for all three nodes. Unfortunately, the logs for that would only
> > > be
> > in
> > > pacemaker.log, which crm_report appears not to have grabbed, so I
> > am
> > > not sure whether it tried.
> > Please find debug logs for "Jul 21" from DC (vm01) and crashed node
> > (vm04) in an attachment.
> > > Thu, Jul 27 12:06:42 EDT 2023, Ken Gaillot kgaillot at redhat.com
> > wrote:
> > > On Wed, 2023-07-26 at 13:29 -0700, Reid Wahl wrote:
> > > > On Fri, Jul 21, 2023 at 9:51 AM Novik Arthur <freishutz at
> > gmail.com>
> > > > wrote:
> > > > > Hello Andrew, Ken and the entire community!
> > > > > 
> > > > > I faced a problem and I would like to ask for help.
> > > > > 
> > > > > Preamble:
> > > > > I have dual controller storage (C0, C1) with 2 VM per
> > controller
> > > > > (vm0[1,2] on C0, vm[3,4] on C1).
> > > > > I did online controller upgrade (update the firmware on
> > physical
> > > > > controller) and for that purpose we have a special procedure:
> > > > > 
> > > > > Put all vms on the controller which will be updated into the
> > > > > standby mode (vm0[3,4] in logs).
> > > > > Once all resources are moved to spare controller VMs, turn on
> > > > > maintenance-mode (DC machine is vm01).
> > > > > Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM
> > > > > +
> > > > > HCA/HBA + BMC drivers will be updated).
> > > > > Reboot C1
> > > > > Start vm0[3,4]
> > > > > On this step I hit the problem.
> > > > > Do the same steps for C0 (turn off maint, put nodes 3,4 to
> > online,
> > > > > put 1-2 to standby, maint and etc).
> > > > > 
> > > > > Here is what I observed during step 5.
> > > > > Machine vm03 started without problems, but vm04 caught
> > > > > critical
> > > > > error and HA stack died. If manually start the pacemaker one
> > more
> > > > > time then it starts without problems and vm04 joins the
> > cluster.
> > > > > Some logs from vm04:
> > > > > 
> > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is
> > within
> > > > > the primary component and will provide service.
> > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] Members[4]: 1
> > > > > 2
> > 3 4
> > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [MAIN  ] Completed
> > service
> > > > > synchronization, ready to provide service.
> > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] rx: host: 3
> > link: 1
> > > > > is up
> > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] link:
> > > > > Resetting
> > MTU
> > > > > for link 1 because host 3 joined
> > > > > Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] host: host: 3
> > > > > (passive) best link: 0 (pri: 1)
> > > > > Jul 21 04:05:39 vm04 pacemaker-attrd[4240]: notice: Setting
> > > > > ifspeed-lnet-o2ib-o2ib[vm02]: (unset) -> 600
> > > > > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: PMTUD
> > link
> > > > > change for host: 3 link: 1 from 453 to 65413
> > > > > Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: Global
> > data
> > > > > MTU changed to: 1397
> > > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting
> > ping-
> > > > > lnet-o2ib-o2ib[vm02]: (unset) -> 4000
> > > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting
> > > > > ifspeed-lnet-o2ib-o2ib[vm01]: (unset) -> 600
> > > > > Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting
> > ping-
> > > > > lnet-o2ib-o2ib[vm01]: (unset) -> 4000
> > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: notice: State
> > > > > transition S_NOT_DC -> S_STOPPING
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > > > > execute monitor of sfa-home-vd: No executor connection
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning:
> > > > > Cannot
> > > > > calculate digests for operation sfa-home-vd_monitor_0 because
> > we
> > > > > have no connection to executor for vm04
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result
> > > > > of
> > > > > probe operation for sfa-home-vd on vm04: Error (No executor
> > > > > connection)
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > > > > execute monitor of ifspeed-lnet-o2ib-o2ib: No executor
> > connection
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning:
> > > > > Cannot
> > > > > calculate digests for operation ifspeed-lnet-o2ib-
> > o2ib_monitor_0
> > > > > because we have no connection to executor for vm04
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result
> > > > > of
> > > > > probe operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No
> > > > > executor connection)
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
> > > > > execute monitor of ping-lnet-o2ib-o2ib: No executor
> > > > > connection
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: warning:
> > > > > Cannot
> > > > > calculate digests for operation ping-lnet-o2ib-o2ib_monitor_0
> > > > > because we have no connection to executor for vm04
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result
> > > > > of
> > > > > probe operation for ping-lnet-o2ib-o2ib on vm04: Error (No
> > executor
> > > > > connection)
> > > > > Jul 21 04:05:49 vm04 pacemakerd[4127]: notice: pacemaker-
> > > > > controld[4257] is unresponsive to ipc after 1 tries
> > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: warning: Shutting
> > cluster
> > > > > down because pacemaker-controld[4257] had fatal failure
> > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down
> > > > > Pacemaker
> > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping
> > pacemaker-
> > > > > schedulerd
> > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping
> > pacemaker-
> > > > > attrd
> > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping
> > pacemaker-
> > > > > execd
> > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping
> > pacemaker-
> > > > > fenced
> > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Stopping
> > pacemaker-
> > > > > based
> > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutdown
> > complete
> > > > > Jul 21 04:05:52 vm04 pacemakerd[4127]: notice: Shutting down
> > and
> > > > > staying down after fatal error
> > > > > 
> > > > > Jul 21 04:05:44 vm04 root[10111]: openibd: Set node_desc for
> > > > > mlx5_0: vm04 HCA-1
> > > > > Jul 21 04:05:44 vm04 root[10113]: openibd: Set node_desc for
> > > > > mlx5_1: vm04 HCA-2
> > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  error:
> > > > > Shutting
> > > > > down controller after unexpected shutdown request from vm01
> > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]: Problem
> > > > > detected
> > at
> > > > > handle_shutdown_ack:954 (controld_messages.c), please see
> > > > > /var/lib/pacemaker/blackbox/pacemaker-controld-4257.1 for
> > > > > additional details
> > > 
> > > Running "qb-blackbox /var/lib/pacemaker/blackbox/pacemaker-
> > controld-
> > > 4257.1" (my version can't read it) will show trace logs that
> > > might
> > give
> > > a better idea of what exactly went wrong at this time (though
> > > these
> > > issues are side effects, not the cause).
> > > 
> > > FYI, it's not necessary to set cluster-recheck-interval as low as
> > > 1
> > > minute. A long time ago that could be useful, but modern
> > > Pacemaker
> > > doesn't need it to calculate things such as failure expiration. I
> > > recommend leaving it at default, or at least raising it to 5
> > minutes or
> > > so.
> > > 
> > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice: State
> > > > > transition S_NOT_DC -> S_STOPPING
> > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:
> > > > > Disconnected from the executor
> > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:
> > > > > Disconnected from Corosync
> > > > > Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:
> > > > > Disconnected from the CIB manager
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  notice:
> > > > > Disconnected from the CIB manager
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  crit: GLib:
> > > > > g_hash_table_lookup: assertion 'hash_table != NULL' failed
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  error: Cannot
> > > > > execute monitor of sfa-home-vd: No executor connection
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  crit: GLib:
> > > > > g_hash_table_lookup: assertion 'hash_table != NULL' failed
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  warning:
> > > > > Cannot
> > > > > calculate digests for operation sfa-home-vd_monitor_0 because
> > we
> > > > > have no connection to executor for vm04
> > > > > Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  warning:
> > Resource
> > > > > update -107 failed: (rc=-107) Transport endpoint is not
> > connected
> > > > The controller disconnects from the executor and deletes the
> > executor
> > > > state table (lrm_state_table) in the middle of the shutdown
> > process:
> > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L240
> > > > These crit messages are happening when we try to look up
> > > > executor
> > > > state when lrm_state_table is NULL. That shouldn't happen. I
> > guess
> > > > the
> > > > lookups are happening while draining the mainloop:
> > > > 
> > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.1.5/daemons/controld/controld_control.c#L286-L294
> > > 
> > > The blackbox should help confirm that.
> > > 
> > > > > the log from DC vm01:
> > > > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice:
> > Transition
> > > > > 16 aborted: Peer Halt
> > > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected
> > > > > another attribute writer (vm04), starting new election
> > > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting
> > #attrd-
> > > > > protocol[vm04]: (unset) -> 5
> > > > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice:
> > Finalizing
> > > > > join-2 for 1 node (sync'ing CIB from vm02)
> > > > > Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice:
> > Requested
> > > > > CIB version   <generation_tuple crm_feature_set="3.16.2"
> > validate-
> > > > > with="pacemaker-3.9" epoch="567" num_updates="111"
> > admin_epoch="0"
> > > > > cib-last-writt
> > > > > en="Fri Jul 21 03:48:43 2023" update-origin="vm01" update-
> > > > > client="cibadmin" update-user="root" have-quorum="0" dc-
> > uuid="1"/>
> > > > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Recorded
> > local
> > > > > node as attribute writer (was unset)
> > > > > Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Setting
> > > > > #feature-set[vm04]: (unset) -> 3.16.2
> > > > > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice:
> > Transition
> > > > > 16 aborted by deletion of lrm[@id='4']: Resource state
> > > > > removal
> > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: No
> > fencing
> > > > > will be done until there are resources to manage
> > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:  *
> > > > > Shutdown vm04
> > > 
> > > This is where things start to go wrong, and it has nothing to do
> > with
> > > any of the messages here. It means that the shutdown node
> > > attribute
> > was
> > > not erased when vm04 shut down the last time before this. Going
> > back,
> > > we see when that happened:
> > > 
> > > Jul 21 03:49:06 vm01 pacemaker-attrd[4017]: notice: Setting
> > shutdown[vm04]: (unset) -> 1689911346
> > > vm02, vm03, and vm04 all left the cluster at that time, leaving
> > only
> > > vm01. At this point, vm01 should have deleted the transient
> > attributes
> > > for all three nodes. Unfortunately, the logs for that would only
> > > be
> > in
> > > pacemaker.log, which crm_report appears not to have grabbed, so I
> > am
> > > not sure whether it tried.
> > > 
> > > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:
> > Calculated
> > > > > transition 17, saving inputs in
> > > > > /var/lib/pacemaker/pengine/pe-
> > > > > input-940.bz2
> > > 
> > > What's interesting in this transition is that we schedule probes
> > > on
> > > vm04 even though we're shutting it down. That's a bug, and leads
> > > to
> > the
> > > "No executor connection" messages we see on vm04. I've added a
> > > task
> > to
> > > our project manager to take care of that. That's all a side
> > > effect
> > > though and not causing any real problems.
> > > 
> > > > > As far as I understand, vm04 was killed by DC during the
> > election
> > > > > of a new attr writer?
> > > > 
> > > > Not sure yet, maybe someone else recognizes this.
> > > > 
> > > > I see the transition was aborted due to peer halt right after
> > node
> > > > vm04 joined. A new election started due to detection of node
> > > > vm04
> > as
> > > > attribute writer. Node vm04's resource state was removed, which
> > is a
> > > > normal part of the join sequence; this caused another
> > > > transition
> > > > abort
> > > > message for the same transition number.
> > > > 
> > > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Node
> > > > vm04
> > > > state
> > > > is now member
> > > > ...
> > > > Jul 21 04:05:39 vm01 corosync[3134]:  [KNET  ] pmtud: Global
> > > > data
> > MTU
> > > > changed to: 1397
> > > > Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice:
> > > > Transition
> > 16
> > > > aborted: Peer Halt
> > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected
> > another
> > > > attribute writer (vm04), starting new election
> > > > Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting
> > > > #attrd-protocol[vm04]: (unset) -> 5
> > > > ...
> > > > Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice:
> > > > Transition
> > 16
> > > > aborted by deletion of lrm[@id='4']: Resource state removal
> > > > 
> > > > Looking at pe-input-939 and pe-input-940, node vm04 was marked
> > > > as
> > > > shut down:
> > > > 
> > > > Jul 21 04:05:38 vm01 pacemaker-schedulerd[4028]: notice:
> > Calculated
> > > > transition 16, saving inputs in
> > > > /var/lib/pacemaker/pengine/pe-input-939.bz2
> > > > Jul 21 04:05:44 vm01 pacemaker-controld[4048]: notice:
> > > > Transition
> > 16
> > > > (Complete=24, Pending=0, Fired=0, Skipped=34, Incomplete=34,
> > > > Source=/var/lib/pacemaker/pengine/pe-input-939.bz2): Stopped
> > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:  *
> > Shutdown
> > > > vm04
> > > > Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:
> > Calculated
> > > > transition 17, saving inputs in
> > > > /var/lib/pacemaker/pengine/pe-input-940.bz2
> > > > 
> > > > 939:
> > > >     <node_state id="4" uname="vm04" in_ccm="false"
> > > > crmd="offline"
> > > > crm-debug-origin="do_state_transition" join="down"
> > expected="down">
> > > >       <transient_attributes id="4">
> > > >         <instance_attributes id="status-4">
> > > >           <nvpair id="status-4-.feature-set" name="#feature-
> > > > set"
> > > > value="3.16.2"/>
> > > >           <nvpair id="status-4-shutdown" name="shutdown"
> > > > value="1689911346"/>
> > > >         </instance_attributes>
> > > >       </transient_attributes>
> > > > 
> > > > 940:
> > > >     <node_state id="4" uname="vm04" in_ccm="true" crmd="online"
> > > > crm-debug-origin="do_state_transition" join="member"
> > > > expected="member">
> > > >       <transient_attributes id="4">
> > > >         <instance_attributes id="status-4">
> > > >           <nvpair id="status-4-.feature-set" name="#feature-
> > > > set"
> > > > value="3.16.2"/>
> > > >           <nvpair id="status-4-shutdown" name="shutdown"
> > > > value="1689911346"/>
> > > >         </instance_attributes>
> > > >       </transient_attributes>
> > > > 
> > > > I suppose that node vm04's state was not updated before the
> > > > transition
> > > > was aborted. So when the new transition (940) ran, the
> > > > scheduler
> > saw
> > > > that node vm04 is expected to be in shutdown state, and it
> > triggered
> > > > a
> > > > shutdown.
> > > > 
> > > > This behavior might already be fixed upstream by the following
> > > > commit:
> > > > https://github.com/ClusterLabs/pacemaker/commit/5e3b3d14
> > > > 
> > > > That commit introduced a regression, however, and I'm working
> > > > on
> > > > fixing it.
> > > 
> > > I suspect that's unrelated, because transient attributes are
> > cleared
> > > when a node leaves rather than when it joins.
> > > 
> > > > 
> > > > > The issue is reproducible from time to time and the version
> > > > > of
> > > > > pacemaker is " 2.1.5-8.1.el8_8-a3f44794f94" from Rocky linux
> > 8.8.
> > > > > I attached crm_report with blackbox. I have debug logs, but
> > they
> > > > > are pretty heavy (~40MB bzip --best). Please tell me if you
> > need
> > > > > them.
> > > > > 
> > > > > Thanks,
> > > > > Arthur
> > > > > 
> > > -- 
> > > Ken Gaillot <kgaillot at redhat.com>
> > --
> > Arthur Novik
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>