[ClusterLabs] Pacemaker do not schedule resource which is in docker container after docker is restarted but the pacemaker cluster show the resource is started !
Ken Gaillot
kgaillot at redhat.com
Mon Feb 18 09:51:57 EST 2019
On Fri, 2019-02-15 at 08:55 +0800, ma.jinfeng at zte.com.cn wrote:
> There is a issue that pacemaker don't schedule resource which is in
> docker container after docker is restarted but the pacemaker cluster
> show the resource is started ,it seems to be a bug of pacemaker .
> I am very confused what happend when pengine print those
> logs(pengine: notice: check_operation_expiry: Clearing
> failure of event_agent on 120_120__fd4 because it expired |
> event_agent_clear_failcount_0). Does anyone know what they mean?
> Thank you very much!
> 1. pacemaker/corosync version: 1.1.16/2.4.3
> 2. corosync logs as follows;
> Feb 06 09:52:19 [58629] node-4 attrd: info:
> attrd_peer_update: Setting event_agent_status[120_120__fd4]: ok ->
> fail from 120_120__fd4
This is the attribute manager setting the "event_agent_status"
attribute for node "120_120__fd4" to "fail". That is a user-created
attribute; pacemaker does not do anything with it other than store it.
Most likely, the resource agent monitor action created it.
> Feb 06 09:52:19 [58629] node-4 attrd: info: write_attribute:
> Sent update 50 with 1 changes for event_agent_status, id=<n/a>,
> set=(null)
> Feb 06 09:52:19 [58629] node-4 attrd: info:
> attrd_cib_callback: Update 50 for event_agent_status: OK (0)
> Feb 06 09:52:19 [58629] node-4 attrd: info:
> attrd_cib_callback: Update 50 for
> event_agent_status[120_120__fd4]=fail: OK (0)
> Feb 06 09:52:19 [58630] node-4 pengine: notice: unpack_config:
> On loss of CCM Quorum: Ignore
> Feb 06 09:52:19 [58630] node-4 pengine: info:
> determine_online_status: Node 120_120__fd4 is online
> Feb 06 09:52:19 [58630] node-4 pengine: info:
> get_failcount_full: event_agent has failed 1 times on 120_120__fd4
> Feb 06 09:52:19 [58630] node-4 pengine: notice:
> check_operation_expiry: Clearing failure of event_agent on
> 120_120__fd4 because it expired | event_agent_clear_failcount_0
This indicates that there is a failure-timeout for the event_agent
resource, and the last failure happened more than that much time ago,
so the failure will be ignored (other than displaying it in status).
> Feb 06 09:52:19 [58630] node-4 pengine: notice: unpack_rsc_op:
> Re-initiated expired calculated failure event_agent_monitor_60000
> (rc=1, magic=0:1;9:18:0:9d1d66d2-2cbe-4182-89f6-c90ba008e2b7) on
> 120_120__fd4
> Feb 06 09:52:19 [58630] node-4 pengine: info:
> get_failcount_full: event_agent has failed 1 times on 120_120__fd4
> Feb 06 09:52:19 [58630] node-4 pengine: notice:
> check_operation_expiry: Clearing failure of event_agent on
> 120_120__fd4 because it expired | event_agent_clear_failcount_0
> Feb 06 09:52:19 [58630] node-4 pengine: info:
> get_failcount_full: event_agent has failed 1 times on 120_120__fd4
> Feb 06 09:52:19 [58630] node-4 pengine: notice:
> check_operation_expiry: Clearing failure of event_agent on
> 120_120__fd4 because it expired | event_agent_clear_failcount_0
> Feb 06 09:52:19 [58630] node-4 pengine: info:
> unpack_node_loop: Node 4052 is already processed
> Feb 06 09:52:19 [58630] node-4 pengine: info:
> unpack_node_loop: Node 4052 is already processed
> Feb 06 09:52:19 [58630] node-4 pengine: info: common_print:
> pm_agent (ocf::heartbeat:pm_agent): Started 120_120__fd4
> Feb 06 09:52:19 [58630] node-4 pengine: info: common_print:
> event_agent (ocf::heartbeat:event_agent): Started 120_120__fd4
> Feb 06 09:52:19 [58630] node-4 pengine: info: common_print:
> nwmonitor_vip (ocf::heartbeat:IPaddr2): Started 120_120__fd4
> Feb 06 09:52:19 [58630] node-4 pengine: info: common_print:
> nwmonitor (ocf::heartbeat:nwmonitor): Started 120_120__fd4
> Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions:
> Leave pm_agent (Started 120_120__fd4)
> Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions:
> Leave event_agent (Started 120_120__fd4)
Because the last failure has expired, pacemaker does not need to
recover event_agent.
> Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions:
> Leave nwmonitor_vip (Started 120_120__fd4)
> Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions:
> Leave nwmonitor (Started 120_120__fd4)
> 3. the event_agent resource is marked fail by attrd, that triggered
> pengine computing, but PE actually does't do anything about
> event_agent later. is it related to check_operation_expiry function
> in unpack.c ? I see some notes in this function as fllows:
> /* clearing recurring monitor operation failures automatically
> * needs to be carefully considered */
> if (safe_str_eq(crm_element_value(xml_op, XML_LRM_ATTR_TASK),
> "monitor") &&
> safe_str_neq(crm_element_value(xml_op,
> XML_LRM_ATTR_INTERVAL), "0")) {
> /* TODO, in the future we should consider not clearing
> recurring monitor
> * op failures unless the last action for a resource was a
> "stop" action.
> * otherwise it is possible that clearing the monitor failure
> will result
> * in the resource being in an undeterministic state.
Yes, this is relevant -- the event_agent monitor had previously failed,
but the failure has expired due to failure-timeout. The comment here
suggests that we may not want to expire monitor failures unless there
has been a stop since then, but that would defeat the intent of
failure-timeout, so it's not straightforward which is the better
handling.
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list