[ClusterLabs] Fencing on 2-node cluster

Wed Jun 20 18:04:58 EDT 2018

> On 2018-06-20, at 3:59 PM, Casey & Gina <caseyandgina at icloud.com> wrote:
> 
>> Get the cluster healthy, tail the system logs from both nodes, trigger a
>> fault and wait for things to settle. Then share the logs please.
> 
> What do you mean by "system logs"?  Do you mean the corosync.log?  Triggering a fault is powering off a node, so I can't get a tailed log file from that host.  Is there another mechanism I should try?

Sorry, I did a little more research.  I guess you mean the syslog, and realized I could `killall -9 corosync` to trigger a failure.  Let me know if there is a better way or this is okay...

Here are the logs:

Node that was "master" to start with, that I did not kill corosync on:

Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: Operation postgresql-10-main_notify_0: ok (node=d-gp2-dbpg64-2, call=36, rc=0, cib-update=0, confirmed=true)
Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 5 (Complete=12, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-58.bz2): Complete
Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 20 21:58:10 d-gp2-dbpg64-2 pgsqlms(postgresql-10-main)[15918]: INFO: Update score of "d-gp2-dbpg64-1" from -1000 to 1000 because of a change in the replication lag (0).
Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jun 20 21:58:10 d-gp2-dbpg64-2 pengine[2499]:   notice: On loss of CCM Quorum: Ignore
Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 6 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Complete
Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 20 21:58:10 d-gp2-dbpg64-2 pengine[2499]:   notice: Calculated Transition 6: /var/lib/pacemaker/pengine/pe-input-59.bz2
Jun 20 21:58:13 d-gp2-dbpg64-2 snmpd[1468]: error on subcontainer 'ia_addr' insert (-1)
Jun 20 22:01:13 d-gp2-dbpg64-2 snmpd[1468]: message repeated 6 times: [ error on subcontainer 'ia_addr' insert (-1)]
Jun 20 22:01:42 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] A processor failed, forming new configuration.
Jun 20 22:01:42 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] A processor failed, forming new configuration.
Jun 20 22:01:43 d-gp2-dbpg64-2 snmpd[1468]: error on subcontainer 'ia_addr' insert (-1)
Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] A new membership (10.124.164.249:260) was formed. Members left: 1
Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] Failed to receive the leave message. failed: 1
Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] A new membership (10.124.164.249:260) was formed. Members left: 1
Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] Failed to receive the leave message. failed: 1
Jun 20 22:01:43 d-gp2-dbpg64-2 pacemakerd[6716]:   notice: crm_reap_unseen_nodes: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
Jun 20 22:01:43 d-gp2-dbpg64-2 crmd[6721]:   notice: crm_reap_unseen_nodes: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: crm_update_peer_proc: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: crm_update_peer_proc: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: Removing d-gp2-dbpg64-1/1 from the membership list
Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: Purged 1 peers with id=1 and/or uname=d-gp2-dbpg64-1 from the membership cache
Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: Removing d-gp2-dbpg64-1/1 from the membership list
Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: Purged 1 peers with id=1 and/or uname=d-gp2-dbpg64-1 from the membership cache
Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: crm_update_peer_proc: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: Removing d-gp2-dbpg64-1/1 from the membership list
Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: Purged 1 peers with id=1 and/or uname=d-gp2-dbpg64-1 from the membership cache
Jun 20 22:01:43 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jun 20 22:01:44 d-gp2-dbpg64-2 pengine[2499]:   notice: On loss of CCM Quorum: Ignore
Jun 20 22:01:44 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 7 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete
Jun 20 22:01:44 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 20 22:01:44 d-gp2-dbpg64-2 pengine[2499]:   notice: Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-60.bz2
Jun 20 22:01:57 d-gp2-dbpg64-2 pgsqlms(postgresql-10-main)[17381]: INFO: Ignoring unknown application_name/node "d-gp2-dbpg64-1"

Node that was a standby, which I kill -9'd corosync on:

Jun 20 21:57:52 d-gp2-dbpg64-1 stonith-ng[2035]:   notice: On loss of CCM Quorum: Ignore
Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Jun 20 21:57:54 d-gp2-dbpg64-1 stonith-ng[2035]:   notice: Versions did not change in patch 0.81.8
Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation postgresql-master-vip_monitor_0: not running (node=d-gp2-dbpg64-1, call=5, rc=7, cib-update=12, confirmed=true)
Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation postgresql-10-main_monitor_0: not running (node=d-gp2-dbpg64-1, call=10, rc=7, cib-update=13, confirmed=true)
Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: d-gp2-dbpg64-1-postgresql-10-main_monitor_0:10 [ /var/run/postgresql:5432 - no response\npg_ctl: no server running\n ]
Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation vfencing_monitor_0: not running (node=d-gp2-dbpg64-1, call=14, rc=7, cib-update=14, confirmed=true)
Jun 20 21:57:55 d-gp2-dbpg64-1 pgsqlms(postgresql-10-main)[2155]: INFO: Instance "postgresql-10-main" started
Jun 20 21:57:55 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation postgresql-10-main_start_0: ok (node=d-gp2-dbpg64-1, call=15, rc=0, cib-update=15, confirmed=true)
Jun 20 21:57:55 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation postgresql-10-main_notify_0: ok (node=d-gp2-dbpg64-1, call=16, rc=0, cib-update=0, confirmed=true)
Jun 20 22:01:32 d-gp2-dbpg64-1 systemd[1]: Started Session 2 of user cshobe.
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Main process exited, code=killed, status=9/KILL
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Unit entered failed state.
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Failed with result 'signal'.
Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:  warning: new_event_notification (2036-2039-8): Bad file descriptor (9)
Jun 20 22:01:41 d-gp2-dbpg64-1 cib[2034]:    error: Connection to the CPG API failed: Library error (2)
Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2035]:    error: Connection to the CPG API failed: Library error (2)
Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2037]:    error: Connection to the CPG API failed: Library error (2)
Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2037]:   notice: Disconnecting client 0x559e6c2c8810, pid=2039...
Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:    error: Connection to stonith-ng failed
Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:    error: Connection to stonith-ng[0x55852f94ff10] closed (I/O condition=17)
Jun 20 22:01:41 d-gp2-dbpg64-1 pacemakerd[2030]:    error: Connection to the CPG API failed: Library error (2)
Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:   notice: Additional logging available in /var/log/corosync/corosync.log
Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:   notice: Additional logging available in /var/log/corosync/corosync.log
Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:   notice: Connecting to cluster infrastructure: corosync
Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:    error: Could not connect to the Cluster Process Group API: 2
Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:   notice: Connecting to cluster infrastructure: corosync
Jun 20 22:01:41 d-gp2-dbpg64-1 crmd[2648]:   notice: Additional logging available in /var/log/corosync/corosync.log
Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:    error: Could not connect to the Cluster Process Group API: 2
Jun 20 22:01:41 d-gp2-dbpg64-1 kernel: [  393.367015] show_signal_msg: 15 callbacks suppressed
Jun 20 22:01:41 d-gp2-dbpg64-1 kernel: [  393.367020] attrd[2647]: segfault at 1b8 ip 00007f8a4813a870 sp 00007ffc7a76f398 error 4 in libqb.so.0.17.2[7f8a4812d000+21000]
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Main process exited, code=exited, status=107/n/a
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Unit entered failed state.
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Failed with result 'exit-code'.