[ClusterLabs] Cluster does not start resources

Tue Aug 23 22:04:58 EDT 2022

On Tuesday, August 23, 2022, Lentes, Bernd <
bernd.lentes at helmholtz-muenchen.de> wrote:
> Hi,
>
> currently i can't start resources on our 2-node-cluster.
> Cluster seems to be ok:
>
> Stack: corosync
> Current DC: ha-idg-1 (version
1.1.24+20210811.f5abda0ee-3.21.9-1.1.24+20210811.f5abda0ee) - partition
with quorum
> Last updated: Wed Aug 24 02:56:46 2022
> Last change: Wed Aug 24 02:56:41 2022 by hacluster via crmd on ha-idg-1
>
> 2 nodes configured
> 40 resource instances configured (26 DISABLED)
>
> Node ha-idg-1: online
> Node ha-idg-2: online
>
> Inactive resources:
>
> fence_ilo_ha-idg-2      (stonith:fence_ilo2):   Stopped
> fence_ilo_ha-idg-1      (stonith:fence_ilo4):   Stopped
>  Clone Set: cl_share [gr_share]
>      Stopped: [ ha-idg-1 ha-idg-2 ]
>  Clone Set: ClusterMon-clone [ClusterMon-SMTP]
>      Stopped (disabled): [ ha-idg-1 ha-idg-2 ]
> vm-mausdb       (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-sim  (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-geneious     (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-idcc-devel   (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-genetrap     (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-mouseidgenes (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-greensql     (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-severin      (ocf::lentes:VirtualDomain):    Stopped (disabled)
> ping_19216810010        (ocf::pacemaker:ping):  Stopped (disabled)
> ping_19216810020        (ocf::pacemaker:ping):  Stopped (disabled)
> vm_crispor      (ocf::heartbeat:VirtualDomain): Stopped (unmanaged)
> vm-dietrich     (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-pathway      (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-crispor-server       (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-geneious-license     (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-nc-mcd       (ocf::lentes:VirtualDomain):    Stopped (disabled,
unmanaged)
> vm-amok (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-geneious-license-mcd (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-documents-oo (ocf::lentes:VirtualDomain):    Stopped (disabled)
> fs_test_ocfs2   (ocf::lentes:Filesystem.new):   Stopped
> vm-ssh  (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm_snipanalysis (ocf::lentes:VirtualDomain):    Stopped (disabled,
unmanaged)
> vm-seneca       (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-photoshop    (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-check-mk     (ocf::lentes:VirtualDomain):    Stopped (disabled)
> vm-encore       (ocf::lentes:VirtualDomain):    Stopped (disabled)
>
> Migration Summary:
> * Node ha-idg-1:
> * Node ha-idg-2:
>
> Fencing History:
> * Off of ha-idg-2 successful: delegate=ha-idg-1, client=crmd.27356,
origin=ha-idg-1,
>     last-successful='Wed Aug 24 01:53:49 2022'
>
> Trying to start e.g. cl_share, which is a prerequisite for the virtual
domains ... nothing happens.
> I did a "crm resource cleanup" (although crm_mon shows no error) hoping
this will help ... it didn't.
> my command history:
>  1471  2022-08-24 03:11:27 crm resource cleanup
>  1472  2022-08-24 03:11:52 crm resource cleanup cl_share
>  1473  2022-08-24 03:12:45 crm resource start cl_share
> (to correlate with the log)
>
> I found some weird entries in the log after the "crm resource cleanup":
>
> Aug 24 03:11:28 [27351] ha-idg-1        cib:  warning: do_local_notify:
A-Sync reply to crmd failed: No message of desired type
> Aug 24 03:11:33 [27351] ha-idg-1        cib:     info: cib_process_ping:
      Reporting our current digest to ha-idg-1:
ed5bb7d32532ebf1ce3c45d0067c55b3 for 7.28627.70 (0x15073e0 0)
> Aug 24 03:11:52 [27353] ha-idg-1       lrmd:     info:
process_lrmd_get_rsc_info:       Resource 'dlm:0' not found (0 active
resources)
> Aug 24 03:11:52 [27356] ha-idg-1       crmd:   notice: do_lrm_invoke:
 Not registering resource 'dlm:0' for a delete event | get-rc=-19 (No such
device) transition-key=(null)
>
> What does that mean "Resource not found" ?
>
>  ...
> Aug 24 03:11:57 [27351] ha-idg-1        cib:     info: cib_process_ping:
      Reporting our current digest to ha-idg-1:
0b3e9ad9ad8103ce2da3b6b8d41e6716 for 7.28628.0 (0x1352bf0 0)
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:    error:
do_pe_invoke_callback:   Could not retrieve the Cluster Information Base:
Timer expired | rc=-62 call=222
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
register_fsa_error_adv:  Resetting the current action list
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:    error: do_log:  Input
I_ERROR received in state S_POLICY_ENGINE from do_pe_invoke_callback
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:  warning:
do_state_transition:     State transition S_POLICY_ENGINE -> S_RECOVERY |
input=I_ERROR cause=C_FSA_INTERNAL origin=do_pe_invoke_callback
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:  warning: do_recover:
Fast-tracking shutdown in response to errors
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:  warning: do_election_vote:
      Not voting in election, we're in state S_RECOVERY
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: do_dc_release:
 DC role released
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: pe_ipc_destroy:
Connection to the Policy Engine released
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: do_te_control:
 Transitioner is now inactive
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:    error: do_log:  Input
I_TERMINATE received in state S_RECOVERY from do_recover
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
do_state_transition:     State transition S_RECOVERY -> S_TERMINATE |
input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: do_shutdown:
 Disconnecting STONITH...
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
tengine_stonith_connection_destroy:      Fencing daemon disconnected
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: do_lrm_control:
Disconnecting from the LRM
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
lrmd_api_disconnect:     Disconnecting IPC LRM connection to local
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
lrmd_ipc_connection_destroy:     IPC connection destroyed
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
lrm_connection_destroy:  LRM Connection disconnected
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
lrmd_api_disconnect:     Disconnecting IPC LRM connection to local
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:   notice: do_lrm_control:
Disconnected from the LRM
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
crm_cluster_disconnect:  Disconnecting from cluster infrastructure: corosync
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
terminate_cs_connection: Disconnecting from Corosync
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:   notice:
terminate_cs_connection: Disconnected from Corosync
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
crm_cluster_disconnect:  Disconnected from corosync
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: do_ha_control:
 Disconnected from the cluster
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: do_cib_control:
Disconnecting CIB
> Aug 24 03:11:58 [27351] ha-idg-1        cib:     info:
cib_process_readwrite:   We are now in R/O mode
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
crmd_cib_connection_destroy:     Connection to the CIB terminated...
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:   notice: do_cib_control:
Disconnected from the CIB
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
qb_ipcs_us_withdraw:     withdrawing server sockets
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: do_exit:
Performing A_EXIT_0 - gracefully exiting the CRMd
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: do_exit: [crmd]
stopped (0)
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: crmd_exit:
 Dropping I_PENDING: [ state=S_TERMINATE cause=C_FSA_INTERNAL
origin=do_election_vote ]
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: crmd_exit:
 Dropping I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL
origin=do_dc_release ]
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: crmd_exit:
 Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL
origin=do_stop ]
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info:
crmd_quorum_destroy:     connection closed
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: crmd_cs_destroy:
connection closed
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: crmd_init:
 27356 stopped: OK (0)
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:    error: crmd_fast_exit:
Could not recover from internal error
> Aug 24 03:11:58 [27356] ha-idg-1       crmd:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
> Aug 24 03:11:58 [27350] ha-idg-1 pacemakerd:    error: pcmk_child_exit:
crmd[27356] exited with status 201 (Generic Pacemaker error)
> Aug 24 03:11:58 [27350] ha-idg-1 pacemakerd:     info:
pcmk__ipc_is_authentic_process_active:   Could not connect to crmd IPC:
Connection refused
> Aug 24 03:11:58 [27350] ha-idg-1 pacemakerd:   notice:
pcmk_process_exit:       Respawning failed child process: crmd
> Aug 24 03:11:58 [27350] ha-idg-1 pacemakerd:     info: start_child:
 Using uid=90 and group=90 for process crmd
> Aug 24 03:11:58 [27350] ha-idg-1 pacemakerd:     info: start_child:
 Forked child 18222 for process crmd
> Aug 24 03:11:58 [27350] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
> Aug 24 03:11:58 [27350] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
> Aug 24 03:11:58 [18222] ha-idg-1       crmd:     info: crm_log_init:
Changed active directory to /var/lib/pacemaker/cores
> Aug 24 03:11:58 [18222] ha-idg-1       crmd:     info: main:    CRM Git
Version: 1.1.24+20210811.f5abda0ee-3.21.9 (1.1.24+20210811.f5abda0ee)
> Aug 24 03:11:58 [18222] ha-idg-1       crmd:     info: get_cluster_type:
      Verifying cluster type: 'corosync'
>
> I appreciate any help.
>

Can you share your CIB? Not sure off hand what everything means (resource
not found, IPC error, crmd failure and respawn), and pacemaker v1 logs
aren't the easiest to interpret. But perhaps something in the CIB will show
itself as an issue.

> Thanks.
>
> Bernd
> --
> Bernd Lentes
> System Administrator
> Institute for Metabolism and Cell Death (MCD)
> Building 25 - office 122
> HelmholtzZentrum München
> bernd.lentes at helmholtz-muenchen.de
> phone: +49 89 3187 1241
>        +49 89 3187 49123
> fax:   +49 89 3187 2294
> http://www.helmholtz-muenchen.de/mcd
>
> Public key:
> 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1
ff 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d
82 fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f
11 b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff
32 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a
79 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00
71 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3
e4 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87
90 d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45
60 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28
8b 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01
>

-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220823/f074ac87/attachment-0001.htm>