[ClusterLabs] Unexpected behaviour of PEngine Recheck Timer while in maintenance mode
Rolf.Weber at asamnet.de
Fri Apr 24 21:51:08 EDT 2015
Hi all again!
can i at least get an ack that somebody has read this?
preferably from andrew or ulrich or so?
the PEngine Recheck Timer is unexpected,
lrm_connection_destroy() should _not_ happen... i think.
On 22:46 Wed 22 Apr , Rolf Weber wrote:
> nobody has an idea why this happened?
> rolf weber
> On 22:24 Mon 20 Apr , Rolf Weber wrote:
> > Hi all!
> > I encountered some strange behaviour of the PEngine Recheck Timer while in
> > maintenance mode ending in a reboot.
> > Setup is a 2 node cluster, 1 resource group consisting of several drbds and
> > filesystems that are exported via nfs and iscsi via a roaming IP (see
> > attached output of crm_report for details).
> > I put the cluster in maintenance mode using:
> > crm configure property maintenance-mode=true
> > Cluster ended up in unmanaged mode, as expected.
> > 109 Apr 19 23:01:50 astorage1 crmd: : info: do_te_invoke: Processing graph 473 (ref=pe_calc-dc-1429477310-1155) derived from /var/lib/pengine/pe-input-688.bz2
> > ...
> > 112 Apr 19 23:01:50 astorage1 lrmd: : info: cancel_op: operation monitor on astorage2-fencing for client 4150, its parameters: hostname=[astorage2] userid=[...] CRM_meta_timeout= CRM_meta_name=[monitor] passwd=[...] crm_feature_set=[3.0.6] ipaddr=[10.10.30.22] CRM_meta_interval= cancelled
> > ...
> > 168 Apr 19 23:01:51 astorage1 crmd: : notice: run_graph: ==== Transition 473 (Complete=31, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-688.bz2): Complete
> > At that point I started to do some work on the resources.
> > However, about 15 mins later:
> > 191 Apr 19 23:16:51 astorage1 crmd: : info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
> > 192 Apr 19 23:16:51 astorage1 crmd: : notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
> > 193 Apr 19 23:16:51 astorage1 crmd: : info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> > The PEngine Recheck Timer triggered and tried to cancel the monitors again,
> > this appears to have worked for all monitors except:
> > 207 Apr 19 23:16:51 astorage1 crmd: : info: te_rsc_command: Initiating action 4: cancel astorage2-fencing_monitor_60000 on astorage1 (local)
> > 208 Apr 19 23:16:51 astorage1 crmd: : info: cancel_op: No pending op found for astorage2-fencing:43
> > 209 Apr 19 23:16:51 astorage1 lrmd: : info: on_msg_cancel_op: no operation with id 43
> > ...
> > 227 Apr 19 23:16:51 astorage1 crmd: : ERROR: do_lrm_invoke: Invalid resource definition
> > other node:
> > 108 Apr 19 23:16:52 astorage2 lrmd: : info: on_msg_cancel_op: no operation with id 43
> > 109 Apr 19 23:16:52 astorage2 crmd: : CRIT: lrm_connection_destroy: LRM Connection failed
> > 110 Apr 19 23:16:52 astorage2 crmd: : info: lrm_connection_destroy: LRM Connection disconnected
> > 111 Apr 19 23:16:52 astorage2 crmd: : ERROR: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
> > crmd was not too happy about that, cleaned up and exited with rc=2.
> > As I have 'crm on' in ha.cfg, heartbeat rebooted the system (both nodes).
> > As a workaround I have (for now) disabled the PEngine Recheck Timer:
> > crm_attribute -n cluster-recheck-interval -v 0
> > IMHO the PEngine Recheck Timer should not take any action while in
> > unmanaged mode, the same way regular action timers' timeouts are ignored.
> > Additionally, I don't quite understand why crmd got so confused by the
> > missing op id 43 that it decided to quit (lrmd appears not to have crashed).
> > Was my assumption about that timer wrong or is this indeed unexpected
> > behaviour?
> > Please CC replies to my email-address, I am not subscribed to the list.
> > Greets,
> > Rolf Weber
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> Users mailing list: Users at clusterlabs.org
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users