Hello all,<br>Corosync/pacemaker with 3 nodes(vserver1, vserver2, vserver3), 28 resources defined, with quorum, stonith (via ipmi/ilo).<br>Most of the time all these work correctly but when we shutdown servers and try to restart them, then strange things may happen:<br>
<br><div>1) all servers could be stuck in an unclean state (unclean online itself, unclean offline the others, OR ok itself, unclean offline the others). The only way to resolve this is to shutdown the severs and start them with a big time interval between, several times.<br>
<br>2) in the last startup of the servers, two servers were up, quorum was true, and then when the 3rd started to boot one of the two issued stonith to the 3rd, and then to the other one, and then to itself (the latter failed). This happened several times. resulting in downtime.<br>
In the logs below vserver1 should stonith vserver3 (since it's down) but it should'nt stonith vserver2 since it has quorum with it.<br><br>We are in the dark, please help if you can.<div><br><div>Best Regards,<br>
<font face="courier new, monospace">Spiros Ioannou</font><div><br></div><div>(from hb_report's events.txt)</div><div><font face="courier new, monospace"><br>May 10 20:08:58 vserver1 corosync[3965]: [MAIN ] Corosync Cluster Engine ('1.2.1'): started and ready to provide service.<br>
May 10 20:42:25 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update: memb: vserver1 218169610<br>May 10 20:42:25 vserver1 crmd: [3999]: info: crm_update_quorum: Updating quorum status to false (call=36)<br>May 10 20:48:44 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update: memb: vserver1 218169610<br>
May 10 20:48:44 vserver1 crmd: [3999]: notice: ais_dispatch: Membership 4804: quorum acquired<br>May 10 20:48:44 <b>vserver1</b> crmd: [3999]: info: crm_update_quorum:<b> Updating quorum status to true</b> (call=87)<br>May 10 20:48:44 vserver2 corosync[4202]: [MAIN ] Corosync Cluster Engine ('1.2.1'): started and ready to provide service.<br>
May 10 20:48:45 vserver2 crmd: [4233]: notice: ais_dispatch: Membership 4804: quorum acquired<br>May 10 20:48:46 <b>vserver1</b> crmd: [3999]: info: crm_update_quorum: Updating <b>quorum status to true</b> (call=102)<br>
May 10 20:48:53 vserver1 crmd: [3999]: info: te_fence_node: Executing poweroff fencing operation (50) on vserver3 (timeout=180000)<br>
May 10 20:48:53 vserver1 stonithd: [3994]: info: client tengine [pid: 3999] requests a STONITH operation POWEROFF on node vserver3<br>May 10 20:49:00 vserver1 stonithd: [3994]: info: Succeeded to STONITH the node vserver3: optype=POWEROFF. whodoit: vserver1<br>
May 10 20:50:59 vserver1 crmd: [3999]: info: te_fence_node: Executing poweroff fencing operation (36) on vserver2 (timeout=180000)<br>May 10 20:50:59 vserver1 stonithd: [3994]: info: client tengine [pid: 3999] requests a STONITH operation POWEROFF on node vserver2<br>
May 10 20:51:06 vserver1 stonithd: [3994]: info: Succeeded to STONITH the node <font color="#ff0000">vserver2: optype=POWEROFF. whodoit: vserver1</font><br>May 10 20:51:06 vserver1 crmd: [3999]: info: te_fence_node: Executing poweroff fencing operation (19) on vserver1 (timeout=180000)<br>
May 10 20:51:06 vserver1 stonithd: [3994]: info: client tengine [pid: 3999] requests a STONITH operation POWEROFF on node vserver1<br>May 10 20:51:10 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update: memb: vserver1 218169610<br>
May 10 20:51:10 vserver1 corosync[3965]: [pcmk ] info: pcmk_peer_update: lost: vserver2 184615178<br>May 10 20:51:10 vserver1 crmd: [3999]: notice: ais_dispatch: Membership 4808: quorum lost<br>May 10 20:51:10 vserver1 crmd: [3999]: info: crm_update_quorum: Updating quorum status to false (call=149)<br>
May 10 20:54:06 vserver1 stonithd: [3994]: ERROR: Failed to STONITH the node vserver1: optype=POWEROFF, op_result=TIMEOUT</font><br></div></div></div></div><div><font face="courier new, monospace"><br></font></div><div><span style="font-family:'courier new',monospace">AND (analysis.txt)</span></div>
<div><br></div><div><font face="courier new, monospace"><div>May 10 20:42:00 vserver1 crmd: [3999]: ERROR: crm_timer_popped: Integration Timer (I_INTEGRATED) just popped!</div><div>May 10 20:42:56 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction: vm-websrv:monitor process (PID 4191) will not die!</div>
<div>May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction: vm-alm:monitor process (PID 4267) will not die!</div><div>May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction: vm-be:monitor process (PID 4268) will not die!</div>
<div>May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction: vm-cam:monitor process (PID 4269) will not die!</div><div>May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation vm-websrv_monitor_0 (20) Timed Out (timeout=20000ms)</div>
<div>May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation vm-alm_monitor_0 (10) Timed Out (timeout=20000ms)</div><div>May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation vm-be_monitor_0 (11) Timed Out (timeout=20000ms)</div>
<div>May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation vm-cam_monitor_0 (12) Timed Out (timeout=20000ms)</div><div>May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation email-vm-cam_start_0 (44) Timed Out (timeout=20000ms)</div>
<div>May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation email-vm-devops_start_0 (46) Timed Out (timeout=20000ms)</div><div>May 10 20:49:26 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation email-vm-router_start_0 (48) Timed Out (timeout=20000ms)</div>
<div>May 10 20:49:27 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation email-vm-zabbix_start_0 (49) Timed Out (timeout=20000ms)</div><div>May 10 20:49:40 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation email-vm-ipsecgw_start_0 (51) Timed Out (timeout=20000ms)</div>
<div>May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation email-vm-cam_stop_0 (56) Timed Out (timeout=20000ms)</div><div>May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation email-vm-devops_stop_0 (58) Timed Out (timeout=20000ms)</div>
<div>May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation email-vm-router_stop_0 (59) Timed Out (timeout=20000ms)</div><div>May 10 20:50:22 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation email-vm-zabbix_stop_0 (61) Timed Out (timeout=20000ms)</div>
<div>May 10 20:50:23 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM operation email-vm-ipsecgw_stop_0 (63) Timed Out (timeout=20000ms)</div><div>May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order: Invalid inputs (nil).(nil) 0x127c090.(nil)</div>
<div>May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order: Invalid inputs (nil).(nil) 0x127c880.(nil)</div><div>May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order: Invalid inputs (nil).(nil) 0x127cd30.(nil)</div>
<div>May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order: Invalid inputs (nil).(nil) 0x127d1e0.(nil)</div><div><br></div></font></div>