[ClusterLabs] Pacemaker do not schedule resource which is in docker container after docker is restarted but the pacemaker cluster show the resource is started !

Mon Feb 18 09:51:57 EST 2019

On Fri, 2019-02-15 at 08:55 +0800, ma.jinfeng at zte.com.cn wrote:
> There is a issue that pacemaker don't schedule resource which is  in
> docker  container after docker is restarted but the pacemaker cluster
> show the resource is started ,it seems to be a bug of pacemaker .
>  I am very confused what happend when pengine print those
> logs(pengine:   notice: check_operation_expiry:	Clearing
> failure of event_agent on 120_120__fd4 because it expired |
> event_agent_clear_failcount_0). Does anyone know what they mean?
> Thank you very much!
> 1. pacemaker/corosync version:  1.1.16/2.4.3
> 2. corosync logs as follows;
> Feb 06 09:52:19 [58629] node-4      attrd:     info:
> attrd_peer_update:	Setting event_agent_status[120_120__fd4]: ok ->
> fail from 120_120__fd4

This is the attribute manager setting the "event_agent_status"
attribute for node "120_120__fd4" to "fail". That is a user-created
attribute; pacemaker does not do anything with it other than store it.
Most likely, the resource agent monitor action created it.

> Feb 06 09:52:19 [58629] node-4      attrd:     info: write_attribute:
> 	Sent update 50 with 1 changes for event_agent_status, id=<n/a>,
> set=(null)
> Feb 06 09:52:19 [58629] node-4      attrd:     info:
> attrd_cib_callback:	Update 50 for event_agent_status: OK (0)
> Feb 06 09:52:19 [58629] node-4      attrd:     info:
> attrd_cib_callback:	Update 50 for
> event_agent_status[120_120__fd4]=fail: OK (0)
> Feb 06 09:52:19 [58630] node-4    pengine:   notice: unpack_config:	
> On loss of CCM Quorum: Ignore
> Feb 06 09:52:19 [58630] node-4    pengine:     info:
> determine_online_status:	Node 120_120__fd4 is online
> Feb 06 09:52:19 [58630] node-4    pengine:     info:
> get_failcount_full:	event_agent has failed 1 times on 120_120__fd4
> Feb 06 09:52:19 [58630] node-4    pengine:   notice:
> check_operation_expiry:	Clearing failure of event_agent on
> 120_120__fd4 because it expired | event_agent_clear_failcount_0

This indicates that there is a failure-timeout for the event_agent
resource, and the last failure happened more than that much time ago,
so the failure will be ignored (other than displaying it in status).

> Feb 06 09:52:19 [58630] node-4    pengine:   notice: unpack_rsc_op:	
> Re-initiated expired calculated failure event_agent_monitor_60000
> (rc=1, magic=0:1;9:18:0:9d1d66d2-2cbe-4182-89f6-c90ba008e2b7) on
> 120_120__fd4
> Feb 06 09:52:19 [58630] node-4    pengine:     info:
> get_failcount_full:	event_agent has failed 1 times on 120_120__fd4
> Feb 06 09:52:19 [58630] node-4    pengine:   notice:
> check_operation_expiry:	Clearing failure of event_agent on
> 120_120__fd4 because it expired | event_agent_clear_failcount_0
> Feb 06 09:52:19 [58630] node-4    pengine:     info:
> get_failcount_full:	event_agent has failed 1 times on 120_120__fd4
> Feb 06 09:52:19 [58630] node-4    pengine:   notice:
> check_operation_expiry:	Clearing failure of event_agent on
> 120_120__fd4 because it expired | event_agent_clear_failcount_0
> Feb 06 09:52:19 [58630] node-4    pengine:     info:
> unpack_node_loop:	Node 4052 is already processed
> Feb 06 09:52:19 [58630] node-4    pengine:     info:
> unpack_node_loop:	Node 4052 is already processed
> Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:	
> pm_agent	(ocf::heartbeat:pm_agent):	Started 120_120__fd4
> Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:	
> event_agent	(ocf::heartbeat:event_agent):	Started 120_120__fd4
> Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:	
> nwmonitor_vip	(ocf::heartbeat:IPaddr2):	Started 120_120__fd4
> Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:	
> nwmonitor	(ocf::heartbeat:nwmonitor):	Started 120_120__fd4
> Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:	
> Leave   pm_agent	(Started 120_120__fd4)
> Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:	
> Leave   event_agent	(Started 120_120__fd4)

Because the last failure has expired, pacemaker does not need to
recover event_agent.

> Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:	
> Leave   nwmonitor_vip	(Started 120_120__fd4)
> Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:	
> Leave   nwmonitor	(Started 120_120__fd4)
> 3. the event_agent resource is marked fail by attrd, that triggered
> pengine computing, but PE actually does't  do anything about
>  event_agent later. is it related to check_operation_expiry function
> in unpack.c ?  I see some notes in this function as fllows:
> /* clearing recurring monitor operation failures automatically
>      * needs to be carefully considered */
>     if (safe_str_eq(crm_element_value(xml_op, XML_LRM_ATTR_TASK),
> "monitor") &&
>         safe_str_neq(crm_element_value(xml_op,
> XML_LRM_ATTR_INTERVAL), "0")) {
>         /* TODO, in the future we should consider not clearing
> recurring monitor
>          * op failures unless the last action for a resource was a
> "stop" action.
>          * otherwise it is possible that clearing the monitor failure
> will result
>          * in the resource being in an undeterministic state.

Yes, this is relevant -- the event_agent monitor had previously failed,
but the failure has expired due to failure-timeout. The comment here
suggests that we may not want to expire monitor failures unless there
has been a stop since then, but that would defeat the intent of
failure-timeout, so it's not straightforward which is the better
handling.
-- 
Ken Gaillot <kgaillot at redhat.com>