[ClusterLabs] Unexpected behaviour of PEngine Recheck Timer while in maintenance mode
Rolf.Weber at asamnet.de
Mon Apr 20 16:24:07 EDT 2015
I encountered some strange behaviour of the PEngine Recheck Timer while in
maintenance mode ending in a reboot.
Setup is a 2 node cluster, 1 resource group consisting of several drbds and
filesystems that are exported via nfs and iscsi via a roaming IP (see
attached output of crm_report for details).
I put the cluster in maintenance mode using:
crm configure property maintenance-mode=true
Cluster ended up in unmanaged mode, as expected.
109 Apr 19 23:01:50 astorage1 crmd: : info: do_te_invoke: Processing graph 473 (ref=pe_calc-dc-1429477310-1155) derived from /var/lib/pengine/pe-input-688.bz2
112 Apr 19 23:01:50 astorage1 lrmd: : info: cancel_op: operation monitor on astorage2-fencing for client 4150, its parameters: hostname=[astorage2] userid=[...] CRM_meta_timeout= CRM_meta_name=[monitor] passwd=[...] crm_feature_set=[3.0.6] ipaddr=[10.10.30.22] CRM_meta_interval= cancelled
168 Apr 19 23:01:51 astorage1 crmd: : notice: run_graph: ==== Transition 473 (Complete=31, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-688.bz2): Complete
At that point I started to do some work on the resources.
However, about 15 mins later:
191 Apr 19 23:16:51 astorage1 crmd: : info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
192 Apr 19 23:16:51 astorage1 crmd: : notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
193 Apr 19 23:16:51 astorage1 crmd: : info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
The PEngine Recheck Timer triggered and tried to cancel the monitors again,
this appears to have worked for all monitors except:
207 Apr 19 23:16:51 astorage1 crmd: : info: te_rsc_command: Initiating action 4: cancel astorage2-fencing_monitor_60000 on astorage1 (local)
208 Apr 19 23:16:51 astorage1 crmd: : info: cancel_op: No pending op found for astorage2-fencing:43
209 Apr 19 23:16:51 astorage1 lrmd: : info: on_msg_cancel_op: no operation with id 43
227 Apr 19 23:16:51 astorage1 crmd: : ERROR: do_lrm_invoke: Invalid resource definition
108 Apr 19 23:16:52 astorage2 lrmd: : info: on_msg_cancel_op: no operation with id 43
109 Apr 19 23:16:52 astorage2 crmd: : CRIT: lrm_connection_destroy: LRM Connection failed
110 Apr 19 23:16:52 astorage2 crmd: : info: lrm_connection_destroy: LRM Connection disconnected
111 Apr 19 23:16:52 astorage2 crmd: : ERROR: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
crmd was not too happy about that, cleaned up and exited with rc=2.
As I have 'crm on' in ha.cfg, heartbeat rebooted the system (both nodes).
As a workaround I have (for now) disabled the PEngine Recheck Timer:
crm_attribute -n cluster-recheck-interval -v 0
IMHO the PEngine Recheck Timer should not take any action while in
unmanaged mode, the same way regular action timers' timeouts are ignored.
Additionally, I don't quite understand why crmd got so confused by the
missing op id 43 that it decided to quit (lrmd appears not to have crashed).
Was my assumption about that timer wrong or is this indeed unexpected
Please CC replies to my email-address, I am not subscribed to the list.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 67861 bytes
Desc: not available
More information about the Users