[ClusterLabs] Pacemaker do not schedule  resource which is  in docker  container after docker is restarted but the pacemaker cluster show the resource is started ! 

ma.jinfeng at zte.com.cn ma.jinfeng at zte.com.cn
Thu Feb 14 19:55:19 EST 2019


There is a issue that pacemaker don't schedule resource which is  in docker  container after docker is restarted but the pacemaker cluster show the resource is started ,it seems to be a bug of pacemaker .

 I am very confused what happend when pengine print those logs(pengine:   notice: check_operation_expiry:	Clearing failure of event_agent on 120_120__fd4 because it expired | event_agent_clear_failcount_0). Does anyone know what they mean? Thank you very much!


1. pacemaker/corosync version:  1.1.16/2.4.3


2. corosync logs as follows;

Feb 06 09:52:19 [58629] node-4      attrd:     info: attrd_peer_update:	Setting event_agent_status[120_120__fd4]: ok -> fail from 120_120__fd4

Feb 06 09:52:19 [58629] node-4      attrd:     info: write_attribute:	Sent update 50 with 1 changes for event_agent_status, id=<n/a>, set=(null)

Feb 06 09:52:19 [58629] node-4      attrd:     info: attrd_cib_callback:	Update 50 for event_agent_status: OK (0)

Feb 06 09:52:19 [58629] node-4      attrd:     info: attrd_cib_callback:	Update 50 for event_agent_status[120_120__fd4]=fail: OK (0)

Feb 06 09:52:19 [58630] node-4    pengine:   notice: unpack_config:	On loss of CCM Quorum: Ignore

Feb 06 09:52:19 [58630] node-4    pengine:     info: determine_online_status:	Node 120_120__fd4 is online

Feb 06 09:52:19 [58630] node-4    pengine:     info: get_failcount_full:	event_agent has failed 1 times on 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:   notice: check_operation_expiry:	Clearing failure of event_agent on 120_120__fd4 because it expired | event_agent_clear_failcount_0

Feb 06 09:52:19 [58630] node-4    pengine:   notice: unpack_rsc_op:	Re-initiated expired calculated failure event_agent_monitor_60000 (rc=1, magic=0:1;9:18:0:9d1d66d2-2cbe-4182-89f6-c90ba008e2b7) on 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:     info: get_failcount_full:	event_agent has failed 1 times on 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:   notice: check_operation_expiry:	Clearing failure of event_agent on 120_120__fd4 because it expired | event_agent_clear_failcount_0

Feb 06 09:52:19 [58630] node-4    pengine:     info: get_failcount_full:	event_agent has failed 1 times on 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:   notice: check_operation_expiry:	Clearing failure of event_agent on 120_120__fd4 because it expired | event_agent_clear_failcount_0

Feb 06 09:52:19 [58630] node-4    pengine:     info: unpack_node_loop:	Node 4052 is already processed

Feb 06 09:52:19 [58630] node-4    pengine:     info: unpack_node_loop:	Node 4052 is already processed

Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:	pm_agent	(ocf::heartbeat:pm_agent):	Started 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:	event_agent	(ocf::heartbeat:event_agent):	Started 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:	nwmonitor_vip	(ocf::heartbeat:IPaddr2):	Started 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:     info: common_print:	nwmonitor	(ocf::heartbeat:nwmonitor):	Started 120_120__fd4

Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:	Leave   pm_agent	(Started 120_120__fd4)

Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:	Leave   event_agent	(Started 120_120__fd4)

Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:	Leave   nwmonitor_vip	(Started 120_120__fd4)

Feb 06 09:52:19 [58630] node-4    pengine:     info: LogActions:	Leave   nwmonitor	(Started 120_120__fd4)

3. the event_agent resource is marked fail by attrd, that triggered pengine computing, but PE actually does't  do anything about  event_agent later. is it related to check_operation_expiry function in unpack.c ?  I see some notes in this function as fllows:

/* clearing recurring monitor operation failures automatically

     * needs to be carefully considered */

    if (safe_str_eq(crm_element_value(xml_op, XML_LRM_ATTR_TASK), "monitor") &&

        safe_str_neq(crm_element_value(xml_op, XML_LRM_ATTR_INTERVAL), "0")) {

        /* TODO, in the future we should consider not clearing recurring monitor

         * op failures unless the last action for a resource was a "stop" action.

         * otherwise it is possible that clearing the monitor failure will result

         * in the resource being in an undeterministic state.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190215/f4d94f98/attachment.html>


More information about the Users mailing list