[ClusterLabs] Fencing on 2-node cluster

Digimer lists at alteeve.ca
Wed Jun 20 22:12:53 UTC 2018


Silly question; Did you actually enable stonith? Can you share your config?

digimer

On 2018-06-20 06:04 PM, Casey & Gina wrote:
>> On 2018-06-20, at 3:59 PM, Casey & Gina <caseyandgina at icloud.com> wrote:
>>
>>> Get the cluster healthy, tail the system logs from both nodes, trigger a
>>> fault and wait for things to settle. Then share the logs please.
>>
>> What do you mean by "system logs"?  Do you mean the corosync.log?  Triggering a fault is powering off a node, so I can't get a tailed log file from that host.  Is there another mechanism I should try?
> 
> Sorry, I did a little more research.  I guess you mean the syslog, and realized I could `killall -9 corosync` to trigger a failure.  Let me know if there is a better way or this is okay...
> 
> Here are the logs:
> 
> Node that was "master" to start with, that I did not kill corosync on:
> 
> Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: Operation postgresql-10-main_notify_0: ok (node=d-gp2-dbpg64-2, call=36, rc=0, cib-update=0, confirmed=true)
> Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 5 (Complete=12, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-58.bz2): Complete
> Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jun 20 21:58:10 d-gp2-dbpg64-2 pgsqlms(postgresql-10-main)[15918]: INFO: Update score of "d-gp2-dbpg64-1" from -1000 to 1000 because of a change in the replication lag (0).
> Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Jun 20 21:58:10 d-gp2-dbpg64-2 pengine[2499]:   notice: On loss of CCM Quorum: Ignore
> Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 6 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Complete
> Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jun 20 21:58:10 d-gp2-dbpg64-2 pengine[2499]:   notice: Calculated Transition 6: /var/lib/pacemaker/pengine/pe-input-59.bz2
> Jun 20 21:58:13 d-gp2-dbpg64-2 snmpd[1468]: error on subcontainer 'ia_addr' insert (-1)
> Jun 20 22:01:13 d-gp2-dbpg64-2 snmpd[1468]: message repeated 6 times: [ error on subcontainer 'ia_addr' insert (-1)]
> Jun 20 22:01:42 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] A processor failed, forming new configuration.
> Jun 20 22:01:42 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] A processor failed, forming new configuration.
> Jun 20 22:01:43 d-gp2-dbpg64-2 snmpd[1468]: error on subcontainer 'ia_addr' insert (-1)
> Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] A new membership (10.124.164.249:260) was formed. Members left: 1
> Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] Failed to receive the leave message. failed: 1
> Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] A new membership (10.124.164.249:260) was formed. Members left: 1
> Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] Failed to receive the leave message. failed: 1
> Jun 20 22:01:43 d-gp2-dbpg64-2 pacemakerd[6716]:   notice: crm_reap_unseen_nodes: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
> Jun 20 22:01:43 d-gp2-dbpg64-2 crmd[6721]:   notice: crm_reap_unseen_nodes: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
> Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: crm_update_peer_proc: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
> Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: crm_update_peer_proc: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
> Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: Removing d-gp2-dbpg64-1/1 from the membership list
> Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: Purged 1 peers with id=1 and/or uname=d-gp2-dbpg64-1 from the membership cache
> Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: Removing d-gp2-dbpg64-1/1 from the membership list
> Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: Purged 1 peers with id=1 and/or uname=d-gp2-dbpg64-1 from the membership cache
> Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: crm_update_peer_proc: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
> Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: Removing d-gp2-dbpg64-1/1 from the membership list
> Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: Purged 1 peers with id=1 and/or uname=d-gp2-dbpg64-1 from the membership cache
> Jun 20 22:01:43 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Jun 20 22:01:44 d-gp2-dbpg64-2 pengine[2499]:   notice: On loss of CCM Quorum: Ignore
> Jun 20 22:01:44 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 7 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete
> Jun 20 22:01:44 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jun 20 22:01:44 d-gp2-dbpg64-2 pengine[2499]:   notice: Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-60.bz2
> Jun 20 22:01:57 d-gp2-dbpg64-2 pgsqlms(postgresql-10-main)[17381]: INFO: Ignoring unknown application_name/node "d-gp2-dbpg64-1"
> 
> Node that was a standby, which I kill -9'd corosync on:
> 
> Jun 20 21:57:52 d-gp2-dbpg64-1 stonith-ng[2035]:   notice: On loss of CCM Quorum: Ignore
> Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> Jun 20 21:57:54 d-gp2-dbpg64-1 stonith-ng[2035]:   notice: Versions did not change in patch 0.81.8
> Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation postgresql-master-vip_monitor_0: not running (node=d-gp2-dbpg64-1, call=5, rc=7, cib-update=12, confirmed=true)
> Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation postgresql-10-main_monitor_0: not running (node=d-gp2-dbpg64-1, call=10, rc=7, cib-update=13, confirmed=true)
> Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: d-gp2-dbpg64-1-postgresql-10-main_monitor_0:10 [ /var/run/postgresql:5432 - no response\npg_ctl: no server running\n ]
> Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation vfencing_monitor_0: not running (node=d-gp2-dbpg64-1, call=14, rc=7, cib-update=14, confirmed=true)
> Jun 20 21:57:55 d-gp2-dbpg64-1 pgsqlms(postgresql-10-main)[2155]: INFO: Instance "postgresql-10-main" started
> Jun 20 21:57:55 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation postgresql-10-main_start_0: ok (node=d-gp2-dbpg64-1, call=15, rc=0, cib-update=15, confirmed=true)
> Jun 20 21:57:55 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation postgresql-10-main_notify_0: ok (node=d-gp2-dbpg64-1, call=16, rc=0, cib-update=0, confirmed=true)
> Jun 20 22:01:32 d-gp2-dbpg64-1 systemd[1]: Started Session 2 of user cshobe.
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Main process exited, code=killed, status=9/KILL
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Unit entered failed state.
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Failed with result 'signal'.
> Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:  warning: new_event_notification (2036-2039-8): Bad file descriptor (9)
> Jun 20 22:01:41 d-gp2-dbpg64-1 cib[2034]:    error: Connection to the CPG API failed: Library error (2)
> Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2035]:    error: Connection to the CPG API failed: Library error (2)
> Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2037]:    error: Connection to the CPG API failed: Library error (2)
> Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2037]:   notice: Disconnecting client 0x559e6c2c8810, pid=2039...
> Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:    error: Connection to stonith-ng failed
> Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:    error: Connection to stonith-ng[0x55852f94ff10] closed (I/O condition=17)
> Jun 20 22:01:41 d-gp2-dbpg64-1 pacemakerd[2030]:    error: Connection to the CPG API failed: Library error (2)
> Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:   notice: Additional logging available in /var/log/corosync/corosync.log
> Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:   notice: Additional logging available in /var/log/corosync/corosync.log
> Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:   notice: Connecting to cluster infrastructure: corosync
> Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:    error: Could not connect to the Cluster Process Group API: 2
> Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:   notice: Connecting to cluster infrastructure: corosync
> Jun 20 22:01:41 d-gp2-dbpg64-1 crmd[2648]:   notice: Additional logging available in /var/log/corosync/corosync.log
> Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:    error: Could not connect to the Cluster Process Group API: 2
> Jun 20 22:01:41 d-gp2-dbpg64-1 kernel: [  393.367015] show_signal_msg: 15 callbacks suppressed
> Jun 20 22:01:41 d-gp2-dbpg64-1 kernel: [  393.367020] attrd[2647]: segfault at 1b8 ip 00007f8a4813a870 sp 00007ffc7a76f398 error 4 in libqb.so.0.17.2[7f8a4812d000+21000]
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Main process exited, code=exited, status=107/n/a
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Unit entered failed state.
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould


More information about the Users mailing list