[Pacemaker] pacemaker strange behaviour

Fri May 11 11:05:27 EDT 2012

Hello all,
Corosync/pacemaker with 3 nodes(vserver1, vserver2, vserver3), 28 resources
defined, with quorum, stonith (via ipmi/ilo).
Most of the time all these work correctly but when we shutdown servers and
try to restart them, then strange things may happen:

1) all servers could be stuck in an unclean state (unclean online itself,
unclean offline the others, OR ok itself, unclean offline the others). The
only way to resolve this is to shutdown the severs and start them with a
big time interval between, several times.

2) in the last startup of the servers, two servers were up, quorum was
true, and then when the 3rd started to boot one of the two issued stonith
to the 3rd, and then to the other one, and then to itself (the latter
failed). This happened several times. resulting in downtime.
In the logs below vserver1 should stonith vserver3 (since it's down) but it
should'nt stonith  vserver2 since it has quorum with it.

We are in the dark, please help if you can.

Best Regards,
Spiros Ioannou

(from hb_report's events.txt)

May 10 20:08:58 vserver1 corosync[3965]:   [MAIN  ] Corosync Cluster Engine
('1.2.1'): started and ready to provide service.
May 10 20:42:25 vserver1 corosync[3965]:   [pcmk  ] info: pcmk_peer_update:
memb: vserver1 218169610
May 10 20:42:25 vserver1 crmd: [3999]: info: crm_update_quorum: Updating
quorum status to false (call=36)
May 10 20:48:44 vserver1 corosync[3965]:   [pcmk  ] info: pcmk_peer_update:
memb: vserver1 218169610
May 10 20:48:44 vserver1 crmd: [3999]: notice: ais_dispatch: Membership
4804: quorum acquired
May 10 20:48:44 *vserver1* crmd: [3999]: info: crm_update_quorum:* Updating
quorum status to true* (call=87)
May 10 20:48:44 vserver2 corosync[4202]:   [MAIN  ] Corosync Cluster Engine
('1.2.1'): started and ready to provide service.
May 10 20:48:45 vserver2 crmd: [4233]: notice: ais_dispatch: Membership
4804: quorum acquired
May 10 20:48:46 *vserver1* crmd: [3999]: info: crm_update_quorum:
Updating *quorum
status to true* (call=102)
May 10 20:48:53 vserver1 crmd: [3999]: info: te_fence_node: Executing
poweroff fencing operation (50) on vserver3 (timeout=180000)
May 10 20:48:53 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
requests a STONITH operation POWEROFF on node vserver3
May 10 20:49:00 vserver1 stonithd: [3994]: info: Succeeded to STONITH the
node vserver3: optype=POWEROFF. whodoit: vserver1
May 10 20:50:59 vserver1 crmd: [3999]: info: te_fence_node: Executing
poweroff fencing operation (36) on vserver2 (timeout=180000)
May 10 20:50:59 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
requests a STONITH operation POWEROFF on node vserver2
May 10 20:51:06 vserver1 stonithd: [3994]: info: Succeeded to STONITH the
node vserver2: optype=POWEROFF. whodoit: vserver1
May 10 20:51:06 vserver1 crmd: [3999]: info: te_fence_node: Executing
poweroff fencing operation (19) on vserver1 (timeout=180000)
May 10 20:51:06 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
requests a STONITH operation POWEROFF on node vserver1
May 10 20:51:10 vserver1 corosync[3965]:   [pcmk  ] info: pcmk_peer_update:
memb: vserver1 218169610
May 10 20:51:10 vserver1 corosync[3965]:   [pcmk  ] info: pcmk_peer_update:
lost: vserver2 184615178
May 10 20:51:10 vserver1 crmd: [3999]: notice: ais_dispatch: Membership
4808: quorum lost
May 10 20:51:10 vserver1 crmd: [3999]: info: crm_update_quorum: Updating
quorum status to false (call=149)
May 10 20:54:06 vserver1 stonithd: [3994]: ERROR: Failed to STONITH the
node vserver1: optype=POWEROFF, op_result=TIMEOUT

AND (analysis.txt)

May 10 20:42:00 vserver1 crmd: [3999]: ERROR: crm_timer_popped: Integration
Timer (I_INTEGRATED) just popped!
May 10 20:42:56 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
vm-websrv:monitor process (PID 4191) will not die!
May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
vm-alm:monitor process (PID 4267) will not die!
May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
vm-be:monitor process (PID 4268) will not die!
May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
vm-cam:monitor process (PID 4269) will not die!
May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation vm-websrv_monitor_0 (20) Timed Out (timeout=20000ms)
May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation vm-alm_monitor_0 (10) Timed Out (timeout=20000ms)
May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation vm-be_monitor_0 (11) Timed Out (timeout=20000ms)
May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation vm-cam_monitor_0 (12) Timed Out (timeout=20000ms)
May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-cam_start_0 (44) Timed Out (timeout=20000ms)
May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-devops_start_0 (46) Timed Out (timeout=20000ms)
May 10 20:49:26 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-router_start_0 (48) Timed Out (timeout=20000ms)
May 10 20:49:27 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-zabbix_start_0 (49) Timed Out (timeout=20000ms)
May 10 20:49:40 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-ipsecgw_start_0 (51) Timed Out (timeout=20000ms)
May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-cam_stop_0 (56) Timed Out (timeout=20000ms)
May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-devops_stop_0 (58) Timed Out (timeout=20000ms)
May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-router_stop_0 (59) Timed Out (timeout=20000ms)
May 10 20:50:22 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-zabbix_stop_0 (61) Timed Out (timeout=20000ms)
May 10 20:50:23 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
operation email-vm-ipsecgw_stop_0 (63) Timed Out (timeout=20000ms)
May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
Invalid inputs (nil).(nil) 0x127c090.(nil)
May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
Invalid inputs (nil).(nil) 0x127c880.(nil)
May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
Invalid inputs (nil).(nil) 0x127cd30.(nil)
May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
Invalid inputs (nil).(nil) 0x127d1e0.(nil)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120511/ebc0891d/attachment-0002.html>