[ClusterLabs] Antw: Re: Unexpected behaviour of PEngine Recheck Timer while in maintenance mode
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Apr 27 06:21:40 UTC 2015
>>> Rolf Weber <Rolf.Weber at asamnet.de> schrieb am 25.04.2015 um 03:51 in Nachricht
<20150425015108.GG16815 at kistl.waijb.lan>:
> Hi all again!
>
> can i at least get an ack that somebody has read this?
> preferably from andrew or ulrich or so?
I read it, but as I never had it, I cannot really help you with this. And usually I don't write messages like "I don't have any idea about it" ;-)
>
> the PEngine Recheck Timer is unexpected,
> lrm_connection_destroy() should _not_ happen... i think.
>
> cu
>
>
> On 22:46 Wed 22 Apr , Rolf Weber wrote:
>> nobody has an idea why this happened?
>>
>> cheers,
>> rolf weber
>>
>>
>> On 22:24 Mon 20 Apr , Rolf Weber wrote:
>> > Hi all!
>> >
>> > I encountered some strange behaviour of the PEngine Recheck Timer while in
>> > maintenance mode ending in a reboot.
>> >
>> > Setup is a 2 node cluster, 1 resource group consisting of several drbds and
>> > filesystems that are exported via nfs and iscsi via a roaming IP (see
>> > attached output of crm_report for details).
>> >
>> > I put the cluster in maintenance mode using:
>> > crm configure property maintenance-mode=true
>> > Cluster ended up in unmanaged mode, as expected.
>> >
>> > 109 Apr 19 23:01:50 astorage1 crmd: [4150]: info: do_te_invoke: Processing
> graph 473 (ref=pe_calc-dc-1429477310-1155) derived from
> /var/lib/pengine/pe-input-688.bz2
>> > ...
>> > 112 Apr 19 23:01:50 astorage1 lrmd: [4147]: info: cancel_op: operation
> monitor[43] on astorage2-fencing for client 4150, its parameters:
> hostname=[astorage2] userid=[...] CRM_meta_timeout=[20000]
> CRM_meta_name=[monitor] passwd=[...] crm_feature_set=[3.0.6]
> ipaddr=[10.10.30.22] CRM_meta_interval=[60000] cancelled
>> > ...
>> > 168 Apr 19 23:01:51 astorage1 crmd: [4150]: notice: run_graph: ====
> Transition 473 (Complete=31, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-688.bz2): Complete
>> >
>> >
>> > At that point I started to do some work on the resources.
>> >
>> > However, about 15 mins later:
>> > 191 Apr 19 23:16:51 astorage1 crmd: [4150]: info: crm_timer_popped: PEngine
> Recheck Timer (I_PE_CALC) just popped (900000ms)
>> > 192 Apr 19 23:16:51 astorage1 crmd: [4150]: notice: do_state_transition:
> State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
>> > 193 Apr 19 23:16:51 astorage1 crmd: [4150]: info: do_state_transition:
> Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
>> >
>> > The PEngine Recheck Timer triggered and tried to cancel the monitors again,
>> > this appears to have worked for all monitors except:
>> >
>> > 207 Apr 19 23:16:51 astorage1 crmd: [4150]: info: te_rsc_command:
> Initiating action 4: cancel astorage2-fencing_monitor_60000 on astorage1
> (local)
>> > 208 Apr 19 23:16:51 astorage1 crmd: [4150]: info: cancel_op: No pending op
> found for astorage2-fencing:43
>> > 209 Apr 19 23:16:51 astorage1 lrmd: [4147]: info: on_msg_cancel_op: no
> operation with id 43
>> > ...
>> > 227 Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: do_lrm_invoke: Invalid
> resource definition
>> >
>> > other node:
>> >
>> > 108 Apr 19 23:16:52 astorage2 lrmd: [4015]: info: on_msg_cancel_op: no
> operation with id 43
>> > 109 Apr 19 23:16:52 astorage2 crmd: [4018]: CRIT: lrm_connection_destroy:
> LRM Connection failed
>> > 110 Apr 19 23:16:52 astorage2 crmd: [4018]: info: lrm_connection_destroy:
> LRM Connection disconnected
>> > 111 Apr 19 23:16:52 astorage2 crmd: [4018]: ERROR: do_log: FSA: Input
> I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
>> >
>> > crmd was not too happy about that, cleaned up and exited with rc=2.
>> > As I have 'crm on' in ha.cfg, heartbeat rebooted the system (both nodes).
>> >
>> >
>> > As a workaround I have (for now) disabled the PEngine Recheck Timer:
>> > crm_attribute -n cluster-recheck-interval -v 0
>> >
>> >
>> > IMHO the PEngine Recheck Timer should not take any action while in
>> > unmanaged mode, the same way regular action timers' timeouts are ignored.
>> > Additionally, I don't quite understand why crmd got so confused by the
>> > missing op id 43 that it decided to quit (lrmd appears not to have
> crashed).
>> >
>> > Was my assumption about that timer wrong or is this indeed unexpected
>> > behaviour?
>> >
>> > Please CC replies to my email-address, I am not subscribed to the list.
>> >
>> > Greets,
>> > Rolf Weber
>> >
>>
>>
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org
>> > http://clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list