[ClusterLabs] 2-Node cluster - both nodes unclean - can't start cluster

Fri Mar 10 16:29:39 EST 2023

On Fri, Mar 10, 2023 at 10:49 AM Lentes, Bernd
<bernd.lentes at helmholtz-muenchen.de> wrote:
>
> Hi,
>
> I don’t get my cluster running. I had problems with an OCFS2 Volume, both
> nodes have been fenced.
> When I do now a “systemctl start pacemaker.service”, crm_mon shows for a few
> seconds both nodes as UNCLEAN, then pacemaker stops.
> I try to confirm the fendcing with “Stonith_admin –C”, but it doesn’t work.
> Maybe time is to short, pacemaker is just running for a few seconds.
>
> Here is the log:
>
> Mar 10 19:36:24 [31037] ha-idg-1 corosync notice  [MAIN  ] Corosync Cluster
> Engine ('2.3.6'): started and ready to provide service.
> Mar 10 19:36:24 [31037] ha-idg-1 corosync info    [MAIN  ] Corosync built-in
> features: debug testagents augeas systemd pie relro bindnow
> Mar 10 19:36:24 [31037] ha-idg-1 corosync notice  [TOTEM ] Initializing
> transport (UDP/IP Multicast).
> Mar 10 19:36:24 [31037] ha-idg-1 corosync notice  [TOTEM ] Initializing
> transmit/receive security (NSS) crypto: aes256 hash: sha1
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [TOTEM ] The network
> interface [192.168.100.10] is now up.
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> loaded: corosync configuration map access [0]
> Mar 10 19:36:25 [31037] ha-idg-1 corosync info    [QB    ] server name: cmap
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> loaded: corosync configuration service [1]
> Mar 10 19:36:25 [31037] ha-idg-1 corosync info    [QB    ] server name: cfg
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> loaded: corosync cluster closed process group service v1.01 [2]
> Mar 10 19:36:25 [31037] ha-idg-1 corosync info    [QB    ] server name: cpg
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> loaded: corosync profile loading service [4]
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] Using quorum
> provider corosync_votequorum
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] This node is
> within the primary component and will provide service.
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] Members[0]:
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> loaded: corosync vote quorum service v1.0 [5]
> Mar 10 19:36:25 [31037] ha-idg-1 corosync info    [QB    ] server name:
> votequorum
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> loaded: corosync cluster quorum service v0.1 [3]
> Mar 10 19:36:25 [31037] ha-idg-1 corosync info    [QB    ] server name:
> quorum
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [TOTEM ] A new membership
> (192.168.100.10:2340) was formed. Members joined: 1084777482

Is this really the corosync node ID of one of your nodes? If not,
what's your corosync version? Is the number the same every time the
issue happens? The number is so large and seemingly random that I
wonder if there's some kind of memory corruption.

> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] Members[1]:
> 1084777482
> Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [MAIN  ] Completed service
> synchronization, ready to provide service.
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice: main:    Starting
> Pacemaker 1.1.24+20210811.f5abda0ee-3.27.1 | build=1.1.24+20210811.f5abda0ee
> features: generated-manpages agent-manp
> ages ncurses libqb-logging libqb-ipc lha-fencing systemd nagios
> corosync-native atomic-attrd snmp libesmtp acls cibsecrets
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: main:    Maximum core
> file size is: 18446744073709551615
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: qb_ipcs_us_publish:
> server name: pacemakerd
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
> pcmk__ipc_is_authentic_process_active:   Could not connect to lrmd IPC:
> Connection refused
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
> pcmk__ipc_is_authentic_process_active:   Could not connect to cib_ro IPC:
> Connection refused
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
> pcmk__ipc_is_authentic_process_active:   Could not connect to crmd IPC:
> Connection refused
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
> pcmk__ipc_is_authentic_process_active:   Could not connect to attrd IPC:
> Connection refused
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
> pcmk__ipc_is_authentic_process_active:   Could not connect to pengine IPC:
> Connection refused
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
> pcmk__ipc_is_authentic_process_active:   Could not connect to stonith-ng
> IPC: Connection refused
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice: get_node_name:
> Could not obtain a node name for corosync nodeid 1084777482
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: crm_get_peer:
> Created entry 3c2499de-58a8-44f7-bf1e-03ff1fbec774/0x1456550 for node
> (null)/1084777482 (1 total)
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: crm_get_peer:    Node
> 1084777482 has uuid 1084777482
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: crm_update_peer_proc:
> cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice:
> cluster_connect_quorum:  Quorum acquired
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice: get_node_name:
> Defaulting to uname -n for the local corosync node name
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: crm_get_peer:    Node
> 1084777482 is now known as ha-idg-1
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
> Using uid=90 and group=90 for process cib
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
> Forked child 31045 for process cib
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
> Forked child 31046 for process stonith-ng
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
> Forked child 31047 for process lrmd
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
> Using uid=90 and group=90 for process attrd
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
> Forked child 31048 for process attrd
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
> Using uid=90 and group=90 for process pengine
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
> Forked child 31049 for process pengine
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
> Using uid=90 and group=90 for process crmd
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
> Forked child 31050 for process crmd
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: main:    Starting
> mainloop
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
> pcmk_quorum_notification:        Quorum retained | membership=2340 members=1
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice:
> crm_update_peer_state_iter:      Node ha-idg-1 state is now member |
> nodeid=1084777482 previous=unknown source=pcmk_quorum_notification
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: pcmk_cpg_membership:
> Group pacemakerd event 0: node 1084777482 pid 31044 joined via cpg_join
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: pcmk_cpg_membership:
> Group pacemakerd event 0: ha-idg-1 (node 1084777482 pid 31044) is member
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: crm_log_init:
> Changed active directory to /var/lib/pacemaker/cores
> Mar 10 19:36:25 [31049] ha-idg-1    pengine:     info: crm_log_init:
> Changed active directory to /var/lib/pacemaker/cores
> Mar 10 19:36:25 [31049] ha-idg-1    pengine:     info: qb_ipcs_us_publish:
> server name: pengine
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: get_cluster_type:
> Verifying cluster type: 'corosync'
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: crm_log_init:
> Changed active directory to /var/lib/pacemaker/cores
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: get_cluster_type:
> Assuming an active 'corosync' cluster
> Mar 10 19:36:25 [31049] ha-idg-1    pengine:     info: main:    Starting
> pengine
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: main:    Starting up
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: retrieveCib:
> Reading cluster configuration file /var/lib/pacemaker/cib/cib.xml (digest:
> /var/lib/pacemaker/cib/cib.xml.sig)
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: get_cluster_type:
> Verifying cluster type: 'corosync'
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: get_cluster_type:
> Assuming an active 'corosync' cluster
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: corosync
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: crm_log_init:
> Changed active directory to /var/lib/pacemaker/cores
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: get_cluster_type:
> Verifying cluster type: 'corosync'
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: get_cluster_type:
> Assuming an active 'corosync' cluster
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: corosync
> Mar 10 19:36:25 [31047] ha-idg-1       lrmd:     info: crm_log_init:
> Changed active directory to /var/lib/pacemaker/cores
> Mar 10 19:36:25 [31047] ha-idg-1       lrmd:     info: qb_ipcs_us_publish:
> server name: lrmd
> Mar 10 19:36:25 [31047] ha-idg-1       lrmd:     info: main:    Starting
> Mar 10 19:36:25 [31050] ha-idg-1       crmd:     info: crm_log_init:
> Changed active directory to /var/lib/pacemaker/cores
> Mar 10 19:36:25 [31050] ha-idg-1       crmd:     info: main:    CRM Git
> Version: 1.1.24+20210811.f5abda0ee-3.27.1 (1.1.24+20210811.f5abda0ee)
> Mar 10 19:36:25 [31050] ha-idg-1       crmd:     info: get_cluster_type:
> Verifying cluster type: 'corosync'
> Mar 10 19:36:25 [31050] ha-idg-1       crmd:     info: get_cluster_type:
> Assuming an active 'corosync' cluster
> Mar 10 19:36:25 [31050] ha-idg-1       crmd:  warning:
> log_deprecation_warnings:        Compile-time support for crm_mon SNMP
> options is deprecated and will be removed in a future release (configure
> alerts instead)
> Mar 10 19:36:25 [31050] ha-idg-1       crmd:  warning:
> log_deprecation_warnings:        Compile-time support for crm_mon SMTP
> options is deprecated and will be removed in a future release (configure
> alerts instead)
> Mar 10 19:36:25 [31050] ha-idg-1       crmd:     info: do_log:  Input
> I_STARTUP received in state S_STARTING from crmd_init
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info:
> validate_with_relaxng:   Creating RNG parser context
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482       ⇐========= this happens
> quite often
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:   notice: get_node_name:
> Could not obtain a node name for corosync nodeid 1084777482
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: crm_get_peer:
> Created entry c1bd522c-34da-49b3-97cb-22fd4580959b/0x109e210 for node
> (null)/1084777482 (1 total)
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: crm_get_peer:    Node
> 1084777482 has uuid 1084777482
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: crm_update_peer_proc:
> cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:   notice:
> crm_update_peer_state_iter:      Node (null) state is now member |
> nodeid=1084777482 previous=unknown source=crm_update_peer_proc
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info:
> init_cs_connection_once: Connection to 'corosync': established
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:   notice: get_node_name:
> Could not obtain a node name for corosync nodeid 1084777482
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: crm_get_peer:
> Created entry 1d232d33-d274-415d-be94-765dc1b4e1e4/0x9478d0 for node
> (null)/1084777482 (1 total)
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: crm_get_peer:    Node
> 1084777482 has uuid 1084777482
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: crm_update_peer_proc:
> cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:   notice:
> crm_update_peer_state_iter:      Node (null) state is now member |
> nodeid=1084777482 previous=unknown source=crm_update_peer_proc
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: startCib:        CIB
> Initialization completed successfully
> Mar 10 19:36:25 [31045] ha-idg-1        cib:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: corosync
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:   notice: get_node_name:
> Defaulting to uname -n for the local corosync node name
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: crm_get_peer:    Node
> 1084777482 is now known as ha-idg-1
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:   notice: get_node_name:
> Defaulting to uname -n for the local corosync node name
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info:
> init_cs_connection_once: Connection to 'corosync': established
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:25 [31045] ha-idg-1        cib:   notice: get_node_name:
> Could not obtain a node name for corosync nodeid 1084777482
> Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: main:    Cluster
> connection active
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: crm_get_peer:
> Created entry 7c2b1d3d-0ab6-4fa6-887c-5d01e5927a67/0x147af10 for node
> (null)/1084777482 (1 total)
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: crm_get_peer:    Node
> 1084777482 has uuid 1084777482
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: crm_update_peer_proc:
> cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online
> Mar 10 19:36:25 [31045] ha-idg-1        cib:   notice:
> crm_update_peer_state_iter:      Node (null) state is now member |
> nodeid=1084777482 previous=unknown source=crm_update_peer_proc
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info:
> init_cs_connection_once: Connection to 'corosync': established
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:   notice: get_node_name:
> Defaulting to uname -n for the local corosync node name
> Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: crm_get_peer:    Node
> 1084777482 is now known as ha-idg-1
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:25 [31045] ha-idg-1        cib:   notice: get_node_name:
> Defaulting to uname -n for the local corosync node name
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: crm_get_peer:    Node
> 1084777482 is now known as ha-idg-1
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: qb_ipcs_us_publish:
> server name: cib_ro
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: qb_ipcs_us_publish:
> server name: cib_rw
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: qb_ipcs_us_publish:
> server name: cib_shm
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: cib_init:
> Starting cib mainloop
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: pcmk_cpg_membership:
> Group cib event 0: node 1084777482 pid 31045 joined via cpg_join
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: pcmk_cpg_membership:
> Group cib event 0: ha-idg-1 (node 1084777482 pid 31045) is member
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: cib_file_backup:
> Archived previous version as /var/lib/pacemaker/cib/cib-34.raw
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info:
> cib_file_write_with_digest:      Wrote version 7.29548.0 of the CIB to disk
> (digest: 03b4ec65319cef255d43fc1ec9d285a5)
> Mar 10 19:36:25 [31045] ha-idg-1        cib:     info:
> cib_file_write_with_digest:      Reading cluster configuration file
> /var/lib/pacemaker/cib/cib.MBy2v0 (digest:
> /var/lib/pacemaker/cib/cib.nDn0X9)
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_cib_control:  CIB
> connection established
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: corosync
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: get_node_name:
> Could not obtain a node name for corosync nodeid 1084777482
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: crm_get_peer:
> Created entry 873262c1-ede0-4ba7-97e6-53ead0a6d7b0/0x1613910 for node
> (null)/1084777482 (1 total)
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: crm_get_peer:    Node
> 1084777482 has uuid 1084777482
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: crm_update_peer_proc:
> cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: get_node_name:
> Defaulting to uname -n for the local corosync node name
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info:
> init_cs_connection_once: Connection to 'corosync': established
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: get_node_name:
> Defaulting to uname -n for the local corosync node name
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: crm_get_peer:    Node
> 1084777482 is now known as ha-idg-1
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: peer_update_callback:
> Cluster node ha-idg-1 is now in unknown state      ⇐===== is that the
> problem ?

Probably a normal part of the startup process but I haven't tested it yet.

> Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: attrd_erase_attrs:
> Clearing transient attributes from CIB |
> xpath=//node_state[@uname='ha-idg-1']/transient_attributes
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info:
> attrd_start_election_if_needed:  Starting an election to determine the
> writer
> Mar 10 19:36:26 [31045] ha-idg-1        cib:     info: cib_process_request:
> Forwarding cib_delete operation for section
> //node_state[@uname='ha-idg-1']/transient_attributes to all
> (origin=local/attrd/2)
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:   notice: get_node_name:
> Defaulting to uname -n for the local corosync node name
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: main:    CIB
> connection active
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: qb_ipcs_us_publish:
> server name: attrd
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: main:    Accepting
> attribute updates
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: pcmk_cpg_membership:
> Group attrd event 0: node 1084777482 pid 31048 joined via cpg_join
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: pcmk_cpg_membership:
> Group attrd event 0: ha-idg-1 (node 1084777482 pid 31048) is member
> Mar 10 19:36:26 [31045] ha-idg-1        cib:     info: corosync_node_name:
> Unable to get node name for nodeid 1084777482
> Mar 10 19:36:26 [31045] ha-idg-1        cib:   notice: get_node_name:
> Defaulting to uname -n for the local corosync node name
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: election_check:
> election-attrd won by local node
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:   notice: attrd_declare_winner:
> Recorded local node as attribute writer (was unset)
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: attrd_peer_update:
> Setting #attrd-protocol[ha-idg-1]: (null) -> 2 from ha-idg-1
> Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: write_attribute:
> Processed 1 private change for #attrd-protocol, id=n/a, set=n/a
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: setup_cib:
> Watching for stonith topology changes
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: qb_ipcs_us_publish:
> server name: stonith-ng
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: main:    Starting
> stonith-ng mainloop
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: pcmk_cpg_membership:
> Group stonith-ng event 0: node 1084777482 pid 31046 joined via cpg_join
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: pcmk_cpg_membership:
> Group stonith-ng event 0: ha-idg-1 (node 1084777482 pid 31046) is member
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice:
> cluster_connect_quorum:  Quorum acquired
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: init_cib_cache_cb:
> Updating device list from the cib: init
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: cib_devices_update:
> Updating devices to version 7.29548.0
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:   notice: unpack_config:   On
> loss of CCM Quorum: Ignore
> Mar 10 19:36:26 [31045] ha-idg-1        cib:     info: cib_process_request:
> Completed cib_delete operation for section
> //node_state[@uname='ha-idg-1']/transient_attributes: OK (rc=0,
> origin=ha-idg-1/attrd/2, version=7.29548.0)
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_ha_control:
> Connected to the cluster
> Mar 10 19:36:26 [31045] ha-idg-1        cib:     info: cib_process_request:
> Forwarding cib_modify operation for section nodes to all
> (origin=local/crmd/3)
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: lrmd_ipc_connect:
> Connecting to lrmd
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_lrm_control:  LRM
> connection established
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_started:
> Delaying start, no membership data (0000000000100000)
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info:
> pcmk_quorum_notification:        Quorum retained | membership=2340 members=1
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice:
> crm_update_peer_state_iter:      Node ha-idg-1 state is now member |
> nodeid=1084777482 previous=unknown source=pcmk_quorum_notification
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: peer_update_callback:
> Cluster node ha-idg-1 is now member (was in unknown state)
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_started:
> Delaying start, Config not read (0000000000000040)
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: pcmk_cpg_membership:
> Group crmd event 0: node 1084777482 pid 31050 joined via cpg_join
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: pcmk_cpg_membership:
> Group crmd event 0: ha-idg-1 (node 1084777482 pid 31050) is member
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_started:
> Delaying start, Config not read (0000000000000040)
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_started:
> Delaying start, Config not read (0000000000000040)
> Mar 10 19:36:26 [31045] ha-idg-1        cib:     info: cib_process_request:
> Completed cib_modify operation for section nodes: OK (rc=0,
> origin=ha-idg-1/crmd/3, version=7.29548.0)
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: qb_ipcs_us_publish:
> server name: crmd
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: do_started:      The
> local CRM is operational    ⇐============================ looks pretty good
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_log:  Input
> I_PENDING received in state S_STARTING from do_started
> Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: do_state_transition:
> State transition S_STARTING -> S_PENDING | input=I_PENDING
> cause=C_FSA_INTERNAL origin=do_started
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: action_synced_wait:
> Managed fence_ilo2_metadata_1 process 31052 exited with rc=0
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info:
> stonith_device_register: Added 'fence_ilo_ha-idg-2' to the device list (1
> active devices)
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: action_synced_wait:
> Managed fence_ilo4_metadata_1 process 31054 exited with rc=0
> Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info:
> stonith_device_register: Added 'fence_ilo_ha-idg-1' to the device list (2
> active devices)
> Mar 10 19:36:28 [31050] ha-idg-1       crmd:     info:
> te_trigger_stonith_history_sync: Fence history will be synchronized
> cluster-wide within 30 seconds
> Mar 10 19:36:28 [31050] ha-idg-1       crmd:   notice: te_connect_stonith:
> Fencer successfully connected
> Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:   notice: handle_request:
> Received manual confirmation that ha-idg-1 is fenced
> <===================== seems to be my "stonith_admin -C"

Yes

> Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:   notice:
> initiate_remote_stonith_op:      Initiating manual confirmation for
> ha-idg-1: 23926653-7baa-44b8-ade3-5ee8468f3db6
> Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:   notice: stonith_manual_ack:
> Injecting manual confirmation that ha-idg-1 is safely off/down
> Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:   notice: remote_op_done:
> Operation 'off' targeting ha-idg-1 on a human for
> stonith_admin.31555 at ha-idg-1.23926653: OK
> Mar 10 19:36:34 [31050] ha-idg-1       crmd:     info: exec_alert_list:
> Sending fencing alert via smtp_alert to informatic.idg at helmholtz-muenchen.de
> Mar 10 19:36:34 [31047] ha-idg-1       lrmd:     info:
> process_lrmd_alert_exec: Executing alert smtp_alert for
> 6bb5a831-e90c-4b0b-8783-0092a26a1e6c
> Mar 10 19:36:34 [31050] ha-idg-1       crmd:     crit:
> tengine_stonith_notify:  We were allegedly just fenced by a human for
> ha-idg-1!      <=====================  what does that mean ? I didn't fence
> it

It means you ran `stonith_admin -C`

https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-1.1.24/fencing/remote.c#L945-L961

> Mar 10 19:36:34 [31050] ha-idg-1       crmd:     info: crm_xml_cleanup:
> Cleaning up memory from libxml2
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:  warning: pcmk_child_exit:
> Shutting cluster down because crmd[31050] had fatal failure
> <=======================  ???

Pacemaker is shutting down on the local node because it just received
confirmation that it was fenced (because you ran `stonith_admin -C`).
This is expected behavior.

> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: pcmk_shutdown_worker:
> Shutting down Pacemaker
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: stop_child:
> Stopping pengine | sent signal 15 to process 31049
> Mar 10 19:36:34 [31049] ha-idg-1    pengine:   notice: crm_signal_dispatch:
> Caught 'Terminated' signal | 15 (invoking handler)
> Mar 10 19:36:34 [31049] ha-idg-1    pengine:     info: qb_ipcs_us_withdraw:
> withdrawing server sockets
> Mar 10 19:36:34 [31049] ha-idg-1    pengine:     info: crm_xml_cleanup:
> Cleaning up memory from libxml2
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: pcmk_child_exit:
> pengine[31049] exited with status 0 (OK)
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: stop_child:
> Stopping attrd | sent signal 15 to process 31048
> Mar 10 19:36:34 [31048] ha-idg-1      attrd:   notice: crm_signal_dispatch:
> Caught 'Terminated' signal | 15 (invoking handler)
> Mar 10 19:36:34 [31048] ha-idg-1      attrd:     info: main:    Shutting
> down attribute manager
> Mar 10 19:36:34 [31048] ha-idg-1      attrd:     info: qb_ipcs_us_withdraw:
> withdrawing server sockets
> Mar 10 19:36:34 [31048] ha-idg-1      attrd:     info: attrd_cib_destroy_cb:
> Connection disconnection complete
> Mar 10 19:36:34 [31048] ha-idg-1      attrd:     info: crm_xml_cleanup:
> Cleaning up memory from libxml2
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: pcmk_child_exit:
> attrd[31048] exited with status 0 (OK)
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: stop_child:
> Stopping lrmd | sent signal 15 to process 31047
> Mar 10 19:36:34 [31047] ha-idg-1       lrmd:   notice: crm_signal_dispatch:
> Caught 'Terminated' signal | 15 (invoking handler)
> Mar 10 19:36:34 [31047] ha-idg-1       lrmd:     info: lrmd_exit:
> Terminating with 0 clients
> Mar 10 19:36:34 [31047] ha-idg-1       lrmd:     info: qb_ipcs_us_withdraw:
> withdrawing server sockets
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:34 [31047] ha-idg-1       lrmd:     info: crm_xml_cleanup:
> Cleaning up memory from libxml2
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: pcmk_child_exit:
> lrmd[31047] exited with status 0 (OK)
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: stop_child:
> Stopping stonith-ng | sent signal 15 to process 31046
> Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:   notice: crm_signal_dispatch:
> Caught 'Terminated' signal | 15 (invoking handler)
> Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:     info: stonith_shutdown:
> Terminating with 3 clients
> Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:     info:
> cib_connection_destroy:  Connection to the CIB closed.
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:     info: qb_ipcs_us_withdraw:
> withdrawing server sockets
> Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:     info: crm_xml_cleanup:
> Cleaning up memory from libxml2
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: pcmk_child_exit:
> stonith-ng[31046] exited with status 0 (OK)
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: stop_child:
> Stopping cib | sent signal 15 to process 31045
> Mar 10 19:36:34 [31045] ha-idg-1        cib:   notice: crm_signal_dispatch:
> Caught 'Terminated' signal | 15 (invoking handler)
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: cib_shutdown:
> Disconnected 0 clients
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: cib_shutdown:    All
> clients disconnected (0)
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: terminate_cib:
> initiate_exit: Exiting from mainloop...
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
> crm_cluster_disconnect:  Disconnecting from cluster infrastructure: corosync
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
> terminate_cs_connection: Disconnecting from Corosync
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
> terminate_cs_connection: No Quorum connection
> Mar 10 19:36:34 [31045] ha-idg-1        cib:   notice:
> terminate_cs_connection: Disconnected from Corosync
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
> crm_cluster_disconnect:  Disconnected from corosync
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
> crm_cluster_disconnect:  Disconnecting from cluster infrastructure: corosync
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
> terminate_cs_connection: Disconnecting from Corosync
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
> cluster_disconnect_cpg:  No CPG connection
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
> terminate_cs_connection: No Quorum connection
> Mar 10 19:36:34 [31045] ha-idg-1        cib:   notice:
> terminate_cs_connection: Disconnected from Corosync
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
> crm_cluster_disconnect:  Disconnected from corosync
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
> Ignoring process list sent by peer for local node
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: qb_ipcs_us_withdraw:
> withdrawing server sockets
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: qb_ipcs_us_withdraw:
> withdrawing server sockets
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: qb_ipcs_us_withdraw:
> withdrawing server sockets
> Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: crm_xml_cleanup:
> Cleaning up memory from libxml2
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: pcmk_child_exit:
> cib[31045] exited with status 0 (OK)
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: pcmk_shutdown_worker:
> Shutdown complete
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: pcmk_shutdown_worker:
> Attempting to inhibit respawning after fatal error
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info:
> pcmk_exit_with_cluster:  Asking Corosync to shut down
> Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [CFG   ] Node 1084777482
> was shut down by sysadmin
> Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: crm_xml_cleanup:
> Cleaning up memory from libxml2
> Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Unloading all
> Corosync service engines.
> Mar 10 19:36:34 [31037] ha-idg-1 corosync info    [QB    ] withdrawing
> server sockets
> Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> unloaded: corosync vote quorum service v1.0
> Mar 10 19:36:34 [31037] ha-idg-1 corosync info    [QB    ] withdrawing
> server sockets
> Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> unloaded: corosync configuration map access
> Mar 10 19:36:34 [31037] ha-idg-1 corosync info    [QB    ] withdrawing
> server sockets
> Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> unloaded: corosync configuration service
> Mar 10 19:36:34 [31037] ha-idg-1 corosync info    [QB    ] withdrawing
> server sockets
> Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> unloaded: corosync cluster closed process group service v1.01
> Mar 10 19:36:34 [31037] ha-idg-1 corosync info    [QB    ] withdrawing
> server sockets
> Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> unloaded: corosync cluster quorum service v0.1
> Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
> unloaded: corosync profile loading service
> Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [MAIN  ] Corosync Cluster
> Engine exiting normally
>
> Bernd

Can you help me understand the issue here? You started the cluster on
this node at 19:36:24. 10 seconds later, you ran `stonith_admin -C`,
and the local node shut down Pacemaker, as expected. It doesn't look
like Pacemaker stopped until you ran that command.

The dc-deadtime property is set to 20 seconds by default. You can
expect nodes to be in UNCLEAN state until then.

>
> --
> Bernd Lentes
> System Administrator
> Institute for Metabolism and Cell Death (MCD)
> Building 25 - office 122
> HelmholtzZentrum München
> bernd.lentes at helmholtz-muenchen.de
> phone: +49 89 3187 1241
>        +49 89 3187 49123
> fax:   +49 89 3187 2294
> https://www.helmholtz-munich.de/en/mcd
>
> Public key:
> 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff
> 6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82
> fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11
> b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32
> 83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79
> 56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71
> 8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4
> 4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90
> d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60
> 94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b
> 83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01
> (null)
>
> Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH), Ingolstadter Landstr. 1, 85764 Neuherberg, www.helmholtz-munich.de. Geschaeftsfuehrung:  Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther, Daniela Sommer (kom.) | Aufsichtsratsvorsitzende: Prof. Dr. Veronika von Messling | Registergericht: Amtsgericht Muenchen  HRB 6466 | USt-IdNr. DE 129521671
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker