[ClusterLabs] clearing failed actions

Wed Jun 21 16:41:21 UTC 2017

On 06/19/2017 04:54 PM, Attila Megyeri wrote:
> One more thing to add.
> Two almost identical clusters, with the identical asterisk primitive produce a different crm_verify output. on one cluster, it returns no warnings, whereas the other once complains:
> 
> On the problematic one:
> 
> crm_verify --live-check -VV
> warning: get_failcount_full:   Setting asterisk.failure_timeout=120 in asterisk-stop-0 conflicts with on-fail=block: ignoring timeout
> Warnings found during check: config may not be valid
> 
> 
> The relevant primitive is in both clusters:
> 
> primitive asterisk ocf:heartbeat:asterisk \
>         op monitor interval="10s" timeout="45s" on-fail="restart" \
>         op start interval="0" timeout="60s" on-fail="standby" \
>         op stop interval="0" timeout="60s" on-fail="block" \
>         meta migration-threshold="3" failure-timeout="2m"
> 
> Why is the same configuration valid in one, but not in the other cluster?
> Shall I simply omit the "op stop" line?
> 
> thanks :)
> Attila

Ah, that could explain it.

If a failure occurs when on-fail=block applies, the resource's failure
timeout is disabled. This is partly because the point of on-fail=block
is to allow the administrator to investigate and manually clear the
error, and partly because blocking means nothing was done to recover the
resource, so the failure likely is still present (clearing it would make
on-fail=block similar to on-fail=ignore).

The failure timeout should be ignored only if there's an actual error to
be handled by on-fail=block, which would mean a stop failure in this
case. That could explain why it's valid in one situation, if there are
no stop failures there.

Stop failures default to block without fencing because fencing is the
only way to recover from a stop failure. Configuring fencing and using
on-fail=fence for stop would avoid the issue.

A future version of pacemaker will allow specifying the failure timeout
separately for different operations, which would allow you to set
failure timeout 0 on stop, and 1m on everything else. But that work
hasn't started yet.

> 
>> -----Original Message-----
>> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
>> Sent: Monday, June 19, 2017 9:47 PM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> <users at clusterlabs.org>; kgaillot at redhat.com
>> Subject: Re: [ClusterLabs] clearing failed actions
>>
>> I did another experiment, even simpler.
>>
>> Created one node, one resource, using pacemaker 1.1.14 on ubuntu.
>>
>> Configured failcount to 1, migration threshold to 2, failure timeout to 1
>> minute.
>>
>> crm_mon:
>>
>> Last updated: Mon Jun 19 19:43:41 2017          Last change: Mon Jun 19
>> 19:37:09 2017 by root via cibadmin on test
>> Stack: corosync
>> Current DC: test (version 1.1.14-70404b0) - partition with quorum
>> 1 node and 1 resource configured
>>
>> Online: [ test ]
>>
>> db-ip-master    (ocf::heartbeat:IPaddr2):       Started test
>>
>> Node Attributes:
>> * Node test:
>>
>> Migration Summary:
>> * Node test:
>>    db-ip-master: migration-threshold=2 fail-count=1
>>
>> crm verify:
>>
>> crm_verify --live-check -VVVV
>>     info: validate_with_relaxng:        Creating RNG parser context
>>     info: determine_online_status:      Node test is online
>>     info: get_failcount_full:   db-ip-master has failed 1 times on test
>>     info: get_failcount_full:   db-ip-master has failed 1 times on test
>>     info: get_failcount_full:   db-ip-master has failed 1 times on test
>>     info: get_failcount_full:   db-ip-master has failed 1 times on test
>>     info: native_print: db-ip-master    (ocf::heartbeat:IPaddr2):       Started test
>>     info: get_failcount_full:   db-ip-master has failed 1 times on test
>>     info: common_apply_stickiness:      db-ip-master can fail 1 more times on
>> test before being forced off
>>     info: LogActions:   Leave   db-ip-master    (Started test)
>>
>>
>> crm configure is:
>>
>> node 168362242: test \
>>         attributes standby=off
>> primitive db-ip-master IPaddr2 \
>>         params lvs_support=true ip=10.9.1.10 cidr_netmask=24
>> broadcast=10.9.1.255 \
>>         op start interval=0 timeout=20s on-fail=restart \
>>         op monitor interval=20s timeout=20s \
>>         op stop interval=0 timeout=20s on-fail=block \
>>         meta migration-threshold=2 failure-timeout=1m target-role=Started
>> location loc1 db-ip-master 0: test
>> property cib-bootstrap-options: \
>>         have-watchdog=false \
>>         dc-version=1.1.14-70404b0 \
>>         cluster-infrastructure=corosync \
>>         stonith-enabled=false \
>>         cluster-recheck-interval=30s \
>>         symmetric-cluster=false
>>
>>
>>
>>
>> Corosync log:
>>
>>
>> Jun 19 19:45:07 [331] test       crmd:   notice: do_state_transition:   State
>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_TIMER_POPPED origin=crm_timer_popped ]
>> Jun 19 19:45:07 [330] test    pengine:     info: process_pe_message:    Input has
>> not changed since last time, not saving to disk
>> Jun 19 19:45:07 [330] test    pengine:     info: determine_online_status:
>> Node test is online
>> Jun 19 19:45:07 [330] test    pengine:     info: get_failcount_full:    db-ip-master
>> has failed 1 times on test
>> Jun 19 19:45:07 [330] test    pengine:     info: get_failcount_full:    db-ip-master
>> has failed 1 times on test
>> Jun 19 19:45:07 [330] test    pengine:     info: get_failcount_full:    db-ip-master
>> has failed 1 times on test
>> Jun 19 19:45:07 [330] test    pengine:     info: get_failcount_full:    db-ip-master
>> has failed 1 times on test
>> Jun 19 19:45:07 [330] test    pengine:     info: native_print:  db-ip-master
>> (ocf::heartbeat:IPaddr2):       Started test
>> Jun 19 19:45:07 [330] test    pengine:     info: get_failcount_full:    db-ip-master
>> has failed 1 times on test
>> Jun 19 19:45:07 [330] test    pengine:     info: common_apply_stickiness:
>> db-ip-master can fail 1 more times on test before being forced off
>> Jun 19 19:45:07 [330] test    pengine:     info: LogActions:    Leave   db-ip-
>> master    (Started test)
>> Jun 19 19:45:07 [330] test    pengine:   notice: process_pe_message:
>> Calculated Transition 34: /var/lib/pacemaker/pengine/pe-input-6.bz2
>> Jun 19 19:45:07 [331] test       crmd:     info: do_state_transition:   State
>> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
>> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
>> Jun 19 19:45:07 [331] test       crmd:   notice: run_graph:     Transition 34
>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete
>> Jun 19 19:45:07 [331] test       crmd:     info: do_log:        FSA: Input
>> I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
>> Jun 19 19:45:07 [331] test       crmd:   notice: do_state_transition:   State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>
>>
>> I hope someone can help me figure this out :)
>>
>> Thanks!
>>
>>
>>
>>> -----Original Message-----
>>> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
>>> Sent: Monday, June 19, 2017 7:45 PM
>>> To: kgaillot at redhat.com; Cluster Labs - All topics related to open-source
>>> clustering welcomed <users at clusterlabs.org>
>>> Subject: Re: [ClusterLabs] clearing failed actions
>>>
>>> Hi Ken,
>>>
>>> /sorry for the long text/
>>>
>>> I have created a relatively simple setup to localize the issue.
>>> Three nodes, no fencing, just a master/slave mysql with two virual IPs.
>>> Just as a reminden, my primary issue is, that on cluster recheck intervals,
>> tha
>>> failcounts are not cleared.
>>>
>>> I simuated a failure with:
>>>
>>> crm_failcount -N ctdb1 -r db-ip-master -v 1
>>>
>>>
>>> crm_mon shows:
>>>
>>> Last updated: Mon Jun 19 17:34:35 2017
>>> Last change: Mon Jun 19 17:34:35 2017 via cibadmin on ctmgr
>>> Stack: corosync
>>> Current DC: ctmgr (168362243) - partition with quorum
>>> Version: 1.1.10-42f2063
>>> 3 Nodes configured
>>> 4 Resources configured
>>>
>>>
>>> Online: [ ctdb1 ctdb2 ctmgr ]
>>>
>>> db-ip-master    (ocf::heartbeat:IPaddr2):       Started ctdb1
>>> db-ip-slave     (ocf::heartbeat:IPaddr2):       Started ctdb2
>>>  Master/Slave Set: mysql [db-mysql]
>>>      Masters: [ ctdb1 ]
>>>      Slaves: [ ctdb2 ]
>>>
>>> Node Attributes:
>>> * Node ctdb1:
>>>     + master-db-mysql                   : 3601
>>>     + readable                          : 1
>>> * Node ctdb2:
>>>     + master-db-mysql                   : 3600
>>>     + readable                          : 1
>>> * Node ctmgr:
>>>
>>> Migration summary:
>>> * Node ctmgr:
>>> * Node ctdb1:
>>>    db-ip-master: migration-threshold=1000000 fail-count=1
>>> * Node ctdb2:
>>>
>>>
>>>
>>> When I check the pacemaker log on the DC, I see the following:
>>>
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:     info: crm_timer_popped:
>> PEngine
>>> Recheck Timer (I_PE_CALC) just popped (30000ms)
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: s_crmd_fsa:         Processing
>>> I_PE_CALC: [ state=S_IDLE cause=C_TIMER_POPPED
>>> origin=crm_timer_popped ]
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:   notice: do_state_transition:
>> State
>>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>>> cause=C_TIMER_POPPED origin=crm_timer_popped ]
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_state_transition:
>>> Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_state_transition:        All
>> 3
>>> cluster nodes are eligible to run resources.
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_pe_invoke:       Query
>>> 231: Requesting the current CIB: S_POLICY_ENGINE
>>> Jun 19 17:37:06 [18994] ctmgr        cib:     info: cib_process_request:
>>> Completed cib_query operation for section 'all': OK (rc=0,
>>> origin=local/crmd/231, version=0.12.9)
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_pe_invoke_callback:
>>> Invoking the PE: query=231, ref=pe_calc-dc-1497893826-144, seq=21884,
>>> quorate=1
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: process_pe_message:
>>> Input has not changed since last time, not saving to disk
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:
>> STONITH
>>> timeout: 60000
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:
>> STONITH
>>> of failed nodes is disabled
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      Stop all
>>> active resources: false
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      Default
>>> stickiness: 0
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      On loss
>>> of CCM Quorum: Stop ALL resources
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_config:      Node
>>> scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_domains:
>>> Unpacking domains
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: determine_online_status:
>>> Node ctmgr is online
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: determine_online_status:
>>> Node ctdb1 is online
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: determine_online_status:
>>> Node ctdb2 is online
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: find_anonymous_clone:
>>> Internally renamed db-mysql on ctmgr to db-mysql:0
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: find_anonymous_clone:
>>> Internally renamed db-mysql on ctdb1 to db-mysql:0
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_rsc_op:      db-
>>> mysql_last_failure_0 on ctdb1 returned 8 (master) instead of the expected
>>> value: 7 (not running)
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:   notice: unpack_rsc_op:
>>> Operation monitor found resource db-mysql:0 active in master mode on
>>> ctdb1
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: find_anonymous_clone:
>>> Internally renamed db-mysql on ctdb2 to db-mysql:1
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: unpack_rsc_op:      db-
>>> mysql_last_failure_0 on ctdb2 returned 0 (ok) instead of the expected
>> value:
>>> 7 (not running)
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: unpack_rsc_op:
>> Operation
>>> monitor found resource db-mysql:1 active on ctdb2
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: native_print:       db-ip-
>> master
>>> (ocf::heartbeat:IPaddr2):       Started ctdb1
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: native_print:       db-ip-slave
>>> (ocf::heartbeat:IPaddr2):       Started ctdb2
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: clone_print:
>>> Master/Slave Set: mysql [db-mysql]
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:
>> Resource
>>> db-mysql:0 active on ctdb1
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:
>> Resource
>>> db-mysql:0 active on ctdb1
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:
>> Resource
>>> db-mysql:1 active on ctdb2
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_active:
>> Resource
>>> db-mysql:1 active on ctdb2
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: short_print:             Masters:
>> [
>>> ctdb1 ]
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: short_print:             Slaves: [
>>> ctdb2 ]
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: get_failcount_full:         db-
>> ip-
>>> master has failed 1 times on ctdb1
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: common_apply_stickiness:
>>> db-ip-master can fail 999999 more times on ctdb1 before being forced off
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug:
>> common_apply_stickiness:
>>> Resource db-mysql:0: preferring current location (node=ctdb1, weight=1)
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug:
>> common_apply_stickiness:
>>> Resource db-mysql:1: preferring current location (node=ctdb2, weight=1)
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:
>>> Assigning ctdb1 to db-mysql:0
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:
>>> Assigning ctdb2 to db-mysql:1
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: clone_color:        Allocated
>> 2
>>> mysql instances of a possible 2
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: master_color:       db-
>>> mysql:0 master score: 3601
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: master_color:       Promoting
>>> db-mysql:0 (Master ctdb1)
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: master_color:       db-
>>> mysql:1 master score: 3600
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: master_color:       mysql:
>>> Promoted 1 instances of a possible 1 to master
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:
>>> Assigning ctdb1 to db-ip-master
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: native_assign_node:
>>> Assigning ctdb2 to db-ip-slave
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:    debug: master_create_actions:
>>> Creating actions for mysql
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave   db-
>> ip-
>>> master    (Started ctdb1)
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave   db-
>> ip-
>>> slave     (Started ctdb2)
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave   db-
>>> mysql:0      (Master ctdb1)
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:     info: LogActions:         Leave   db-
>>> mysql:1      (Slave ctdb2)
>>> Jun 19 17:37:06 [18997] ctmgr    pengine:   notice: process_pe_message:
>>> Calculated Transition 38: /var/lib/pacemaker/pengine/pe-input-16.bz2
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: s_crmd_fsa:         Processing
>>> I_PE_SUCCESS: [ state=S_POLICY_ENGINE cause=C_IPC_MESSAGE
>>> origin=handle_response ]
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_state_transition:
>> State
>>> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
>>> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: unpack_graph:
>> Unpacked
>>> transition 38: 0 actions in 0 synapses
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_te_invoke:       Processing
>>> graph 38 (ref=pe_calc-dc-1497893826-144) derived from
>>> /var/lib/pacemaker/pengine/pe-input-16.bz2
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: print_graph:        Empty
>>> transition graph
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:   notice: run_graph:  Transition 38
>>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>> Source=/var/lib/pacemaker/pengine/pe-input-16.bz2): Complete
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: print_graph:        Empty
>>> transition graph
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: te_graph_trigger:
>> Transition
>>> 38 is now complete
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: notify_crmd:        Processing
>>> transition completion in state S_TRANSITION_ENGINE
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: notify_crmd:        Transition
>>> 38 status: done - <null>
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: s_crmd_fsa:         Processing
>>> I_TE_SUCCESS: [ state=S_TRANSITION_ENGINE cause=C_FSA_INTERNAL
>>> origin=notify_crmd ]
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:     info: do_log:     FSA: Input
>>> I_TE_SUCCESS from notify_crmd() received in state
>> S_TRANSITION_ENGINE
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:   notice: do_state_transition:
>> State
>>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: do_state_transition:
>>> Starting PEngine Recheck Timer
>>> Jun 19 17:37:06 [18998] ctmgr       crmd:    debug: crm_timer_start:    Started
>>> PEngine Recheck Timer (I_PE_CALC:30000ms), src=277
>>>
>>>
>>>
>>> As you can see from the logs, pacemaker does not even try to re-monitor
>> the
>>> resource that had a failure, or at least I'm not seeing it.
>>> Cluster recheck interval is set to 30 seconds for troubleshooting reasons.
>>>
>>> If I execute a
>>>
>>> crm resource cleanup db-ip-master
>>>
>>> Tha failure is removed.
>>>
>>> Now am I taking something terribly wrong here?
>>> Or is this simply a bug in 1.1.10?
>>>
>>>
>>> Thanks,
>>> Attila
>>>
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
>>>> Sent: Wednesday, June 7, 2017 10:14 PM
>>>> To: Attila Megyeri <amegyeri at minerva-soft.com>; Cluster Labs - All
>> topics
>>>> related to open-source clustering welcomed <users at clusterlabs.org>
>>>> Subject: Re: [ClusterLabs] clearing failed actions
>>>>
>>>> On 06/01/2017 02:44 PM, Attila Megyeri wrote:
>>>>> Ken,
>>>>>
>>>>> I noticed something strange, this might be the issue.
>>>>>
>>>>> In some cases, even the manual cleanup does not work.
>>>>>
>>>>> I have a failed action of resource "A" on node "a". DC is node "b".
>>>>>
>>>>> e.g.
>>>>>     Failed actions:
>>>>>     jboss_imssrv1_monitor_10000 (node=ctims1, call=108, rc=1,
>>>> status=complete, last-rc-change=Thu Jun  1 14:13:36 2017
>>>>>
>>>>>
>>>>> When I attempt to do a "crm resource cleanup A" from node "b",
>> nothing
>>>> happens. Basically the lrmd on "a" is not notified that it should monitor
>> the
>>>> resource.
>>>>>
>>>>>
>>>>> When I execute a "crm resource cleanup A" command on node "a"
>>> (where
>>>> the operation failed) , the failed action is cleared properly.
>>>>>
>>>>> Why could this be happening?
>>>>> Which component should be responsible for this? pengine, crmd, lrmd?
>>>>
>>>> The crm shell will send commands to attrd (to clear fail counts) and
>>>> crmd (to clear the resource history), which in turn will record changes
>>>> in the cib.
>>>>
>>>> I'm not sure how crm shell implements it, but crm_resource sends
>>>> individual messages to each node when cleaning up a resource without
>>>> specifying a particular node. You could check the pacemaker log on each
>>>> node to see whether attrd and crmd are receiving those commands, and
>>>> what they do in response.
>>>>
>>>>
>>>>>> -----Original Message-----
>>>>>> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
>>>>>> Sent: Thursday, June 1, 2017 6:57 PM
>>>>>> To: kgaillot at redhat.com; Cluster Labs - All topics related to open-
>> source
>>>>>> clustering welcomed <users at clusterlabs.org>
>>>>>> Subject: Re: [ClusterLabs] clearing failed actions
>>>>>>
>>>>>> thanks Ken,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
>>>>>>> Sent: Thursday, June 1, 2017 12:04 AM
>>>>>>> To: users at clusterlabs.org
>>>>>>> Subject: Re: [ClusterLabs] clearing failed actions
>>>>>>>
>>>>>>> On 05/31/2017 12:17 PM, Ken Gaillot wrote:
>>>>>>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote:
>>>>>>>>> Hi Ken,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
>>>>>>>>>> Sent: Tuesday, May 30, 2017 4:32 PM
>>>>>>>>>> To: users at clusterlabs.org
>>>>>>>>>> Subject: Re: [ClusterLabs] clearing failed actions
>>>>>>>>>>
>>>>>>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shouldn't the
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> cluster-recheck-interval="2m"
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> property instruct pacemaker to recheck the cluster every 2
>>> minutes
>>>>>>> and
>>>>>>>>>>> clean the failcounts?
>>>>>>>>>>
>>>>>>>>>> It instructs pacemaker to recalculate whether any actions need
>> to
>>> be
>>>>>>>>>> taken (including expiring any failcounts appropriately).
>>>>>>>>>>
>>>>>>>>>>> At the primitive level I also have a
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> migration-threshold="30" failure-timeout="2m"
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> but whenever I have a failure, it remains there forever.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What could be causing this?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> thanks,
>>>>>>>>>>>
>>>>>>>>>>> Attila
>>>>>>>>>> Is it a single old failure, or a recurring failure? The failure timeout
>>>>>>>>>> works in a somewhat nonintuitive way. Old failures are not
>>>> individually
>>>>>>>>>> expired. Instead, all failures of a resource are simultaneously
>>> cleared
>>>>>>>>>> if all of them are older than the failure-timeout. So if something
>>>> keeps
>>>>>>>>>> failing repeatedly (more frequently than the failure-timeout),
>>> none
>>>> of
>>>>>>>>>> the failures will be cleared.
>>>>>>>>>>
>>>>>>>>>> If it's not a repeating failure, something odd is going on.
>>>>>>>>>
>>>>>>>>> It is not a repeating failure. Let's say that a resource fails for
>>> whatever
>>>>>>> action, It will remain in the failed actions (crm_mon -Af) until I issue a
>>>> "crm
>>>>>>> resource cleanup <resource name>". Even after days or weeks, even
>>>>>> though
>>>>>>> I see in the logs that cluster is rechecked every 120 seconds.
>>>>>>>>>
>>>>>>>>> How could I troubleshoot this issue?
>>>>>>>>>
>>>>>>>>> thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>> Ah, I see what you're saying. That's expected behavior.
>>>>>>>>
>>>>>>>> The failure-timeout applies to the failure *count* (which is used for
>>>>>>>> checking against migration-threshold), not the failure *history*
>>> (which
>>>>>>>> is used for the status display).
>>>>>>>>
>>>>>>>> The idea is to have it no longer affect the cluster behavior, but still
>>>>>>>> allow an administrator to know that it happened. That's why a
>> manual
>>>>>>>> cleanup is required to clear the history.
>>>>>>>
>>>>>>> Hmm, I'm wrong there ... failure-timeout does expire the failure
>>> history
>>>>>>> used for status display.
>>>>>>>
>>>>>>> It works with the current versions. It's possible 1.1.10 had issues with
>>>>>>> that.
>>>>>>>
>>>>>>
>>>>>> Well if nothing helps I will try to upgrade to a more recent version..
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Check the status to see which node is DC, and look at the pacemaker
>>> log
>>>>>>> there after the failure occurred. There should be a message about
>> the
>>>>>>> failcount expiring. You can also look at the live CIB and search for
>>>>>>> last_failure to see what is used for the display.
>>>>>> [AM]
>>>>>>
>>>>>> In the pacemaker log I see at every recheck interval the following
>> lines:
>>>>>>
>>>>>> Jun 01 16:54:08 [8700] ctabsws2    pengine:  warning: unpack_rsc_op:
>>>>>> Processing failed op start for jboss_admin2 on ctadmin2: unknown
>> error
>>>> (1)
>>>>>>
>>>>>> If I check the  CIB for the failure I see:
>>>>>>
>>>>>> <nvpair id="status-168362322-last-failure-jboss_admin2" name="last-
>>>> failure-
>>>>>> jboss_admin2" value="1496326649"/>
>>>>>>             <lrm_rsc_op id="jboss_admin2_last_failure_0"
>>>>>> operation_key="jboss_admin2_start_0" operation="start" crm-
>> debug-
>>>>>> origin="do_update_resource" crm_feature_set="3.0.7" transition-
>>>>>> key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition-
>>>>>> magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call-
>>> id="114"
>>>> rc-
>>>>>> code="1" op-status="2" interval="0" last-run="1496326469" last-rc-
>>>>>> change="1496326469" exec-time="180001" queue-time="0" op-
>>>>>> digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/>
>>>>>>
>>>>>>
>>>>>> Really have no clue why this isn't cleared...