[Pacemaker] pacemaker strange behaviour

Dejan Muhamedagic dejanmm at fastmail.fm
Mon May 14 11:36:55 EDT 2012


Hi,

On Fri, May 11, 2012 at 06:05:27PM +0300, Spiros Ioannou wrote:
> Hello all,
> Corosync/pacemaker with 3 nodes(vserver1, vserver2, vserver3), 28 resources
> defined, with quorum, stonith (via ipmi/ilo).
> Most of the time all these work correctly but when we shutdown servers and
> try to restart them, then strange things may happen:
> 
> 1) all servers could be stuck in an unclean state (unclean online itself,
> unclean offline the others, OR ok itself, unclean offline the others). The
> only way to resolve this is to shutdown the severs and start them with a
> big time interval between, several times.
> 
> 2) in the last startup of the servers, two servers were up, quorum was
> true, and then when the 3rd started to boot one of the two issued stonith
> to the 3rd, and then to the other one, and then to itself (the latter
> failed).

A node cannot fence itself. That's prohibited.

> This happened several times. resulting in downtime.
> In the logs below vserver1 should stonith vserver3 (since it's down) but it
> should'nt stonith  vserver2 since it has quorum with it.
> 
> We are in the dark, please help if you can.

Is this new, i.e. was this cluster working fine for a while
earlier? I suppose that you did some testing before going live.

Below some notes, but in general cannot offer much help.

> Best Regards,
> Spiros Ioannou
> 
> (from hb_report's events.txt)
> 
> May 10 20:08:58 vserver1 corosync[3965]:   [MAIN  ] Corosync Cluster Engine
> ('1.2.1'): started and ready to provide service.
> May 10 20:42:25 vserver1 corosync[3965]:   [pcmk  ] info: pcmk_peer_update:
> memb: vserver1 218169610
> May 10 20:42:25 vserver1 crmd: [3999]: info: crm_update_quorum: Updating
> quorum status to false (call=36)
> May 10 20:48:44 vserver1 corosync[3965]:   [pcmk  ] info: pcmk_peer_update:
> memb: vserver1 218169610
> May 10 20:48:44 vserver1 crmd: [3999]: notice: ais_dispatch: Membership
> 4804: quorum acquired
> May 10 20:48:44 *vserver1* crmd: [3999]: info: crm_update_quorum:* Updating
> quorum status to true* (call=87)
> May 10 20:48:44 vserver2 corosync[4202]:   [MAIN  ] Corosync Cluster Engine
> ('1.2.1'): started and ready to provide service.
> May 10 20:48:45 vserver2 crmd: [4233]: notice: ais_dispatch: Membership
> 4804: quorum acquired
> May 10 20:48:46 *vserver1* crmd: [3999]: info: crm_update_quorum:
> Updating *quorum
> status to true* (call=102)
> May 10 20:48:53 vserver1 crmd: [3999]: info: te_fence_node: Executing
> poweroff fencing operation (50) on vserver3 (timeout=180000)
> May 10 20:48:53 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
> requests a STONITH operation POWEROFF on node vserver3
> May 10 20:49:00 vserver1 stonithd: [3994]: info: Succeeded to STONITH the
> node vserver3: optype=POWEROFF. whodoit: vserver1
> May 10 20:50:59 vserver1 crmd: [3999]: info: te_fence_node: Executing
> poweroff fencing operation (36) on vserver2 (timeout=180000)
> May 10 20:50:59 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
> requests a STONITH operation POWEROFF on node vserver2
> May 10 20:51:06 vserver1 stonithd: [3994]: info: Succeeded to STONITH the
> node vserver2: optype=POWEROFF. whodoit: vserver1
> May 10 20:51:06 vserver1 crmd: [3999]: info: te_fence_node: Executing
> poweroff fencing operation (19) on vserver1 (timeout=180000)
> May 10 20:51:06 vserver1 stonithd: [3994]: info: client tengine [pid: 3999]
> requests a STONITH operation POWEROFF on node vserver1
> May 10 20:51:10 vserver1 corosync[3965]:   [pcmk  ] info: pcmk_peer_update:
> memb: vserver1 218169610
> May 10 20:51:10 vserver1 corosync[3965]:   [pcmk  ] info: pcmk_peer_update:
> lost: vserver2 184615178
> May 10 20:51:10 vserver1 crmd: [3999]: notice: ais_dispatch: Membership
> 4808: quorum lost
> May 10 20:51:10 vserver1 crmd: [3999]: info: crm_update_quorum: Updating
> quorum status to false (call=149)
> May 10 20:54:06 vserver1 stonithd: [3994]: ERROR: Failed to STONITH the
> node vserver1: optype=POWEROFF, op_result=TIMEOUT
> 
> AND (analysis.txt)
> 
> May 10 20:42:00 vserver1 crmd: [3999]: ERROR: crm_timer_popped: Integration
> Timer (I_INTEGRATED) just popped!
> May 10 20:42:56 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
> vm-websrv:monitor process (PID 4191) will not die!
> May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
> vm-alm:monitor process (PID 4267) will not die!
> May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
> vm-be:monitor process (PID 4268) will not die!
> May 10 20:42:57 vserver1 lrmd: [3996]: ERROR: TrackedProcTimeoutFunction:
> vm-cam:monitor process (PID 4269) will not die!

This seems to point to processes in the D state (waiting for
device). Your storage hanging/temporarily gone?

> May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation vm-websrv_monitor_0 (20) Timed Out (timeout=20000ms)
> May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation vm-alm_monitor_0 (10) Timed Out (timeout=20000ms)
> May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation vm-be_monitor_0 (11) Timed Out (timeout=20000ms)
> May 10 20:43:35 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation vm-cam_monitor_0 (12) Timed Out (timeout=20000ms)
> May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-cam_start_0 (44) Timed Out (timeout=20000ms)
> May 10 20:49:20 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-devops_start_0 (46) Timed Out (timeout=20000ms)
> May 10 20:49:26 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-router_start_0 (48) Timed Out (timeout=20000ms)
> May 10 20:49:27 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-zabbix_start_0 (49) Timed Out (timeout=20000ms)
> May 10 20:49:40 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-ipsecgw_start_0 (51) Timed Out (timeout=20000ms)
> May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-cam_stop_0 (56) Timed Out (timeout=20000ms)
> May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-devops_stop_0 (58) Timed Out (timeout=20000ms)
> May 10 20:50:01 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-router_stop_0 (59) Timed Out (timeout=20000ms)
> May 10 20:50:22 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-zabbix_stop_0 (61) Timed Out (timeout=20000ms)
> May 10 20:50:23 vserver1 crmd: [3999]: ERROR: process_lrm_event: LRM
> operation email-vm-ipsecgw_stop_0 (63) Timed Out (timeout=20000ms)
> May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
> Invalid inputs (nil).(nil) 0x127c090.(nil)
> May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
> Invalid inputs (nil).(nil) 0x127c880.(nil)
> May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
> Invalid inputs (nil).(nil) 0x127cd30.(nil)
> May 10 20:50:59 vserver1 pengine: [3998]: ERROR: custom_action_order:
> Invalid inputs (nil).(nil) 0x127d1e0.(nil)

Hmm, that's rather cryptic. Don't know what does it mean.
Perhaps search the ML archives?

Thanks,

Dejan

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org





More information about the Pacemaker mailing list