[ClusterLabs] Need a help with "(crm_glib_handler) crit: GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failed"

Novik Arthur freishutz at gmail.com
Fri Jul 21 08:42:13 EDT 2023


Hello Andrew, Ken and the entire community!

I faced a problem and I would like to ask for help.

Preamble:
I have dual controller storage (C0, C1) with 2 VM per controller (vm0[1,2]
on C0, vm[3,4] on C1).
I did online controller upgrade (update the firmware on physical
controller) and for that purpose we have a special procedure:

   1. Put all vms on the controller which will be updated into the *standby
   mode *(vm0[3,4] in logs).
   2. Once all resources are moved to spare controller VMs, turn on
   *maintenance-mode* (DC machine is vm01).
   3. Shutdown vm0[3,4] and perform firmware update on C1 (OS + KVM +
   HCA/HBA + BMC drivers will be updated).
   4. Reboot C1
   5. Start vm0[3,4]
   On this step I hit the problem.
   6. Do the same steps for C0 (turn off maint, put nodes 3,4 to online,
   put 1-2 to standby, maint and etc).

*Here is what I observed during step 5.*
Machine vm03 started without problems, but vm04 caught critical error and
HA stack died. If manually start the pacemaker one more time then it starts
without problems and vm04 joins the cluster.

Some logs from *vm04*:
































*Jul 21 04:05:39 vm04 corosync[3061]:  [QUORUM] This node is within the
primary component and will provide service.Jul 21 04:05:39 vm04
corosync[3061]:  [QUORUM] Members[4]: 1 2 3 4Jul 21 04:05:39 vm04
corosync[3061]:  [MAIN  ] Completed service synchronization, ready to
provide service.Jul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] rx: host: 3
link: 1 is upJul 21 04:05:39 vm04 corosync[3061]:  [KNET  ] link: Resetting
MTU for link 1 because host 3 joinedJul 21 04:05:39 vm04 corosync[3061]:
 [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)Jul 21 04:05:39 vm04
pacemaker-attrd[4240]: notice: Setting ifspeed-lnet-o2ib-o2ib[vm02]:
(unset) -> 600Jul 21 04:05:40 vm04 corosync[3061]:  [KNET  ] pmtud: PMTUD
link change for host: 3 link: 1 from 453 to 65413Jul 21 04:05:40 vm04
corosync[3061]:  [KNET  ] pmtud: Global data MTU changed to: 1397Jul 21
04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting
ping-lnet-o2ib-o2ib[vm02]: (unset) -> 4000Jul 21 04:05:40 vm04
pacemaker-attrd[4240]: notice: Setting ifspeed-lnet-o2ib-o2ib[vm01]:
(unset) -> 600Jul 21 04:05:40 vm04 pacemaker-attrd[4240]: notice: Setting
ping-lnet-o2ib-o2ib[vm01]: (unset) -> 4000Jul 21 04:05:47 vm04
pacemaker-controld[4257]: notice: State transition S_NOT_DC ->
S_STOPPINGJul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
execute monitor of sfa-home-vd: No executor connectionJul 21 04:05:48 vm04
pacemaker-controld[4257]: warning: Cannot calculate digests for operation
sfa-home-vd_monitor_0 because we have no connection to executor for vm04Jul
21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of probe operation
for sfa-home-vd on vm04: Error (No executor connection)Jul 21 04:05:48 vm04
pacemaker-controld[4257]: error: Cannot execute monitor of
ifspeed-lnet-o2ib-o2ib: No executor connectionJul 21 04:05:48 vm04
pacemaker-controld[4257]: warning: Cannot calculate digests for operation
ifspeed-lnet-o2ib-o2ib_monitor_0 because we have no connection to executor
for vm04Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Result of
probe operation for ifspeed-lnet-o2ib-o2ib on vm04: Error (No executor
connection)Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error: Cannot
execute monitor of ping-lnet-o2ib-o2ib: No executor connectionJul 21
04:05:48 vm04 pacemaker-controld[4257]: warning: Cannot calculate digests
for operation ping-lnet-o2ib-o2ib_monitor_0 because we have no connection
to executor for vm04Jul 21 04:05:48 vm04 pacemaker-controld[4257]: error:
Result of probe operation for ping-lnet-o2ib-o2ib on vm04: Error (No
executor connection)Jul 21 04:05:49 vm04 pacemakerd[4127]: notice:
pacemaker-controld[4257] is unresponsive to ipc after 1 triesJul 21
04:05:52 vm04 pacemakerd[4127]: warning: Shutting cluster down because
pacemaker-controld[4257] had fatal failureJul 21 04:05:52 vm04
pacemakerd[4127]: notice: Shutting down PacemakerJul 21 04:05:52 vm04
pacemakerd[4127]: notice: Stopping pacemaker-schedulerdJul 21 04:05:52 vm04
pacemakerd[4127]: notice: Stopping pacemaker-attrdJul 21 04:05:52 vm04
pacemakerd[4127]: notice: Stopping pacemaker-execdJul 21 04:05:52 vm04
pacemakerd[4127]: notice: Stopping pacemaker-fencedJul 21 04:05:52 vm04
pacemakerd[4127]: notice: Stopping pacemaker-basedJul 21 04:05:52 vm04
pacemakerd[4127]: notice: Shutdown completeJul 21 04:05:52 vm04
pacemakerd[4127]: notice: Shutting down and staying down after fatal error*














*Jul 21 04:05:44 vm04 root[10111]: openibd: Set node_desc for mlx5_0: vm04
HCA-1Jul 21 04:05:44 vm04 root[10113]: openibd: Set node_desc for mlx5_1:
vm04 HCA-2Jul 21 04:05:47 vm04 pacemaker-controld[4257]:  error: Shutting
down controller after unexpected shutdown request from vm01Jul 21 04:05:47
vm04 pacemaker-controld[4257]: Problem detected at handle_shutdown_ack:954
(controld_messages.c), please see
/var/lib/pacemaker/blackbox/pacemaker-controld-4257.1 for additional
detailsJul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice: State
transition S_NOT_DC -> S_STOPPINGJul 21 04:05:47 vm04
pacemaker-controld[4257]:  notice: Disconnected from the executorJul 21
04:05:47 vm04 pacemaker-controld[4257]:  notice: Disconnected from
CorosyncJul 21 04:05:47 vm04 pacemaker-controld[4257]:  notice:
Disconnected from the CIB managerJul 21 04:05:48 vm04
pacemaker-controld[4257]:  notice: Disconnected from the CIB managerJul 21
04:05:48 vm04 pacemaker-controld[4257]:  crit: GLib: g_hash_table_lookup:
assertion 'hash_table != NULL' failedJul 21 04:05:48 vm04
pacemaker-controld[4257]:  error: Cannot execute monitor of sfa-home-vd: No
executor connectionJul 21 04:05:48 vm04 pacemaker-controld[4257]:  crit:
GLib: g_hash_table_lookup: assertion 'hash_table != NULL' failedJul 21
04:05:48 vm04 pacemaker-controld[4257]:  warning: Cannot calculate digests
for operation sfa-home-vd_monitor_0 because we have no connection to
executor for vm04Jul 21 04:05:48 vm04 pacemaker-controld[4257]:  warning:
Resource update -107 failed: (rc=-107) Transport endpoint is not connected*


*the log from DC vm01:*
Jul 21 04:05:39 vm01 pacemaker-controld[4048]: notice: Transition 16
aborted: Peer Halt
Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Detected another
attribute writer (vm04), starting new election
Jul 21 04:05:39 vm01 pacemaker-attrd[4017]: notice: Setting
#attrd-protocol[vm04]: (unset) -> 5
Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Finalizing join-2
for 1 node (sync'ing CIB from vm02)
Jul 21 04:05:40 vm01 pacemaker-controld[4048]: notice: Requested CIB
version   <generation_tuple crm_feature_set="3.16.2"
validate-with="pacemaker-3.9" epoch="567" num_updates="111" admin_epoch="0"
cib-last-writt
en="Fri Jul 21 03:48:43 2023" update-origin="vm01" update-client="cibadmin"
update-user="root" have-quorum="0" dc-uuid="1"/>
Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Recorded local node as
attribute writer (was unset)
Jul 21 04:05:40 vm01 pacemaker-attrd[4017]: notice: Setting
#feature-set[vm04]: (unset) -> 3.16.2
Jul 21 04:05:41 vm01 pacemaker-controld[4048]: notice: Transition 16
aborted by deletion of lrm[@id='4']: Resource state removal
Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: No fencing will be
done until there are resources to manage
Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice:  * Shutdown vm04
Jul 21 04:05:47 vm01 pacemaker-schedulerd[4028]: notice: Calculated
transition 17, saving inputs in /var/lib/pacemaker/pengine/pe-input-940.bz2

As far as I understand*,* vm04 was killed by DC during the election of a
new attr writer?

*The issue is reproducible from time to time and the version of pacemaker
is "* 2.1.5-8.1.el8_8-a3f44794f94
*" from Rocky linux 8.8.*

*I attached crm_report with blackbox. I have debug logs, but they are
pretty heavy (~40MB bzip --best). Please tell me if you need them.*

*Thanks,*

*Arthur*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230721/07606398/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pcmk-Fri-21-Jul-2023.tar.bz2
Type: application/octet-stream
Size: 2699603 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230721/07606398/attachment-0001.obj>


More information about the Users mailing list