[ClusterLabs] Pacemaker do not schedule resource which is in docker container after docker is restarted but the pacemaker cluster show the resource is started !
ma.jinfeng at zte.com.cn
ma.jinfeng at zte.com.cn
Thu Feb 14 19:55:19 EST 2019
There is a issue that pacemaker don't schedule resource which is in docker container after docker is restarted but the pacemaker cluster show the resource is started ,it seems to be a bug of pacemaker .
I am very confused what happend when pengine print those logs(pengine: notice: check_operation_expiry: Clearing failure of event_agent on 120_120__fd4 because it expired | event_agent_clear_failcount_0). Does anyone know what they mean? Thank you very much!
1. pacemaker/corosync version: 1.1.16/2.4.3
2. corosync logs as follows;
Feb 06 09:52:19 [58629] node-4 attrd: info: attrd_peer_update: Setting event_agent_status[120_120__fd4]: ok -> fail from 120_120__fd4
Feb 06 09:52:19 [58629] node-4 attrd: info: write_attribute: Sent update 50 with 1 changes for event_agent_status, id=<n/a>, set=(null)
Feb 06 09:52:19 [58629] node-4 attrd: info: attrd_cib_callback: Update 50 for event_agent_status: OK (0)
Feb 06 09:52:19 [58629] node-4 attrd: info: attrd_cib_callback: Update 50 for event_agent_status[120_120__fd4]=fail: OK (0)
Feb 06 09:52:19 [58630] node-4 pengine: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 06 09:52:19 [58630] node-4 pengine: info: determine_online_status: Node 120_120__fd4 is online
Feb 06 09:52:19 [58630] node-4 pengine: info: get_failcount_full: event_agent has failed 1 times on 120_120__fd4
Feb 06 09:52:19 [58630] node-4 pengine: notice: check_operation_expiry: Clearing failure of event_agent on 120_120__fd4 because it expired | event_agent_clear_failcount_0
Feb 06 09:52:19 [58630] node-4 pengine: notice: unpack_rsc_op: Re-initiated expired calculated failure event_agent_monitor_60000 (rc=1, magic=0:1;9:18:0:9d1d66d2-2cbe-4182-89f6-c90ba008e2b7) on 120_120__fd4
Feb 06 09:52:19 [58630] node-4 pengine: info: get_failcount_full: event_agent has failed 1 times on 120_120__fd4
Feb 06 09:52:19 [58630] node-4 pengine: notice: check_operation_expiry: Clearing failure of event_agent on 120_120__fd4 because it expired | event_agent_clear_failcount_0
Feb 06 09:52:19 [58630] node-4 pengine: info: get_failcount_full: event_agent has failed 1 times on 120_120__fd4
Feb 06 09:52:19 [58630] node-4 pengine: notice: check_operation_expiry: Clearing failure of event_agent on 120_120__fd4 because it expired | event_agent_clear_failcount_0
Feb 06 09:52:19 [58630] node-4 pengine: info: unpack_node_loop: Node 4052 is already processed
Feb 06 09:52:19 [58630] node-4 pengine: info: unpack_node_loop: Node 4052 is already processed
Feb 06 09:52:19 [58630] node-4 pengine: info: common_print: pm_agent (ocf::heartbeat:pm_agent): Started 120_120__fd4
Feb 06 09:52:19 [58630] node-4 pengine: info: common_print: event_agent (ocf::heartbeat:event_agent): Started 120_120__fd4
Feb 06 09:52:19 [58630] node-4 pengine: info: common_print: nwmonitor_vip (ocf::heartbeat:IPaddr2): Started 120_120__fd4
Feb 06 09:52:19 [58630] node-4 pengine: info: common_print: nwmonitor (ocf::heartbeat:nwmonitor): Started 120_120__fd4
Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions: Leave pm_agent (Started 120_120__fd4)
Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions: Leave event_agent (Started 120_120__fd4)
Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions: Leave nwmonitor_vip (Started 120_120__fd4)
Feb 06 09:52:19 [58630] node-4 pengine: info: LogActions: Leave nwmonitor (Started 120_120__fd4)
3. the event_agent resource is marked fail by attrd, that triggered pengine computing, but PE actually does't do anything about event_agent later. is it related to check_operation_expiry function in unpack.c ? I see some notes in this function as fllows:
/* clearing recurring monitor operation failures automatically
* needs to be carefully considered */
if (safe_str_eq(crm_element_value(xml_op, XML_LRM_ATTR_TASK), "monitor") &&
safe_str_neq(crm_element_value(xml_op, XML_LRM_ATTR_INTERVAL), "0")) {
/* TODO, in the future we should consider not clearing recurring monitor
* op failures unless the last action for a resource was a "stop" action.
* otherwise it is possible that clearing the monitor failure will result
* in the resource being in an undeterministic state.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190215/f4d94f98/attachment.html>
More information about the Users
mailing list