[ClusterLabs] Unexpected behaviour of PEngine Recheck Timer while in maintenance mode

Mon Apr 20 16:24:07 EDT 2015

Hi all!

I encountered some strange behaviour of the PEngine Recheck Timer while in
maintenance mode ending in a reboot.

Setup is a 2 node cluster, 1 resource group consisting of several drbds and
filesystems that are exported via nfs and iscsi via a roaming IP (see
attached output of crm_report for details).

I put the cluster in maintenance mode using:
	crm configure property maintenance-mode=true
Cluster ended up in unmanaged mode, as expected.

	109 Apr 19 23:01:50 astorage1 crmd: [4150]: info: do_te_invoke: Processing graph 473 (ref=pe_calc-dc-1429477310-1155) derived from /var/lib/pengine/pe-input-688.bz2
	...
	112 Apr 19 23:01:50 astorage1 lrmd: [4147]: info: cancel_op: operation monitor[43] on astorage2-fencing for client 4150, its parameters: hostname=[astorage2] userid=[...] CRM_meta_timeout=[20000] CRM_meta_name=[monitor] passwd=[...] crm_feature_set=[3.0.6] ipaddr=[10.10.30.22] CRM_meta_interval=[60000]  cancelled
	...
	168 Apr 19 23:01:51 astorage1 crmd: [4150]: notice: run_graph: ==== Transition 473 (Complete=31, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-688.bz2): Complete

At that point I started to do some work on the resources.

However, about 15 mins later:
	191 Apr 19 23:16:51 astorage1 crmd: [4150]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
	192 Apr 19 23:16:51 astorage1 crmd: [4150]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
	193 Apr 19 23:16:51 astorage1 crmd: [4150]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED

The PEngine Recheck Timer triggered and tried to cancel the monitors again,
this appears to have worked for all monitors except:

	207 Apr 19 23:16:51 astorage1 crmd: [4150]: info: te_rsc_command: Initiating action 4: cancel astorage2-fencing_monitor_60000 on astorage1 (local)
	208 Apr 19 23:16:51 astorage1 crmd: [4150]: info: cancel_op: No pending op found for astorage2-fencing:43
	209 Apr 19 23:16:51 astorage1 lrmd: [4147]: info: on_msg_cancel_op: no operation with id 43
	...
	227 Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: do_lrm_invoke: Invalid resource definition

other node:

	108 Apr 19 23:16:52 astorage2 lrmd: [4015]: info: on_msg_cancel_op: no operation with id 43
	109 Apr 19 23:16:52 astorage2 crmd: [4018]: CRIT: lrm_connection_destroy: LRM Connection failed
	110 Apr 19 23:16:52 astorage2 crmd: [4018]: info: lrm_connection_destroy: LRM Connection disconnected
	111 Apr 19 23:16:52 astorage2 crmd: [4018]: ERROR: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC

crmd was not too happy about that, cleaned up and exited with rc=2.
As I have 'crm on' in ha.cfg, heartbeat rebooted the system (both nodes).

As a workaround I have (for now) disabled the PEngine Recheck Timer:
	crm_attribute -n cluster-recheck-interval -v 0

IMHO the PEngine Recheck Timer should not take any action while in
unmanaged mode, the same way regular action timers' timeouts are ignored.
Additionally, I don't quite understand why crmd got so confused by the
missing op id 43 that it decided to quit (lrmd appears not to have crashed).

Was my assumption about that timer wrong or is this indeed unexpected
behaviour?

Please CC replies to my email-address, I am not subscribed to the list.

Greets,
Rolf Weber

-------------- next part --------------
A non-text attachment was scrubbed...
Name: reboot-while-in-maintenance-mode.tbz2
Type: application/octet-stream
Size: 67861 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150420/1c6f67d8/attachment-0002.obj>