[ClusterLabs] clear_failcount operation times out, makes it impossible to use the cluster

Mon Jan 2 06:16:25 EST 2023

Hello Clusterlabs,

I'm getting this error in the logs:

Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:    error: 
print_synapse:     [Action    7]: In-flight crm op 
ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0, 
waiting: none)

My specfications:

OS: Debian 8
Pacemaker version: 1.1.12
Kernel version: 4.19.190

I'd like to know what can cause this error to happen and how to prevent 
it in the future. I'm also currently unable to update to a newer version 
of pacemaker.

Here is some context for when it happens. It seems that the 
ping_resource resources are in 'Restart' state:

Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
LogActions:        Restart ping_resource:0 (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
LogActions:        Restart ping_resource:1 (Started swdal1-ISCSI02)

which causes pacemaker to try to clear the failcounts on those resources:

Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
do_state_transition:       State transition S_POLICY_ENGINE -> 
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE 
origin=handle_response ]
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
do_te_invoke:      Processing graph 11 (ref=pe_calc-dc-1671528262-59) 
derived from /var/lib/pacemaker/pengine/pe-input-518.bz2
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
te_crm_command:    Executing crm-event (7): clear_failcount on 
swdal1-ISCSI01
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
handle_failcount_op:       Removing failcount for ping_resource
Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info: 
cib_process_request:       Forwarding cib_delete operation for section 
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0'] 
to master (origin=local/crmd/118)
Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info: 
cib_process_request:       Completed cib_delete operation for section 
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0']: 
OK (rc=0, origin=swdal1-ISCSI01/crmd/118, version=0.60.0)
Dec 20 09:24:28 [57841] swdal1-ISCSI01        cib:     info: 
cib_process_ping:  Reporting our current digest to swdal1-ISCSI01: 
ccf71244504d3deb02d0da64fa72cedc for 0.60.0 (0x55788a83c4b0 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:  warning: 
action_timer_callback:     Timer popped (timeout=20000, abort_level=0, 
complete=false)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:    error: 
print_synapse:     [Action    7]: In-flight crm op 
ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0, 
waiting: none)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice: 
abort_transition_graph:    Transition aborted: Action lost 
(source=action_timer_callback:772, 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice: 
run_graph:         Transition 11 (Complete=1, Pending=0, Fired=0, 
Skipped=9, Incomplete=2, 
Source=/var/lib/pacemaker/pengine/pe-input-518.bz2): Stopped

Clearing the failcount fails, so the whole transition is aborted. This 
make it impossible to do anything in the cluster, for example move 
Pool-0 resource, as it also trigger the clear_failcount operation which 
fails and aborts the transition, for example:

Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info: 
RecurringOp:        Start recurring monitor (5s) for ping_resource:0 on 
swdal1-ISCSI01
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info: 
RecurringOp:        Start recurring monitor (5s) for ping_resource:1 on 
swdal1-ISCSI02
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info: 
RecurringOp:        Start recurring monitor (10s) for Pool-0 on 
swdal1-ISCSI02
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice: 
LogActions:        Restart ping_resource:0 (Started swdal1-ISCSI01)
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice: 
LogActions:        Restart ping_resource:1 (Started swdal1-ISCSI02)
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info: 
LogActions:        Leave   Pool-1  (Started swdal1-ISCSI01)
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice: 
LogActions:        Move    Pool-0  (Started swdal1-ISCSI01 -> 
swdal1-ISCSI02)
Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice: 
process_pe_message:        Calculated Transition 19: 
/var/lib/pacemaker/pengine/pe-input-519.bz2
Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info: 
do_state_transition:       State transition S_POLICY_ENGINE -> 
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE 
origin=handle_response ]
Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info: 
do_te_invoke:      Processing graph 19 (ref=pe_calc-dc-1671528904-75) 
derived from /var/lib/pacemaker/pengine/pe-input-519.bz2
Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info: 
te_crm_command:    Executing crm-event (7): clear_failcount on 
swdal1-ISCSI01
Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info: 
handle_failcount_op:       Removing failcount for ping_resource
Dec 20 09:35:04 [57841] swdal1-ISCSI01        cib:     info: 
cib_process_request:       Forwarding cib_delete operation for section 
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0'] 
to master (origin=local/crmd/134)
Dec 20 09:35:04 [57841] swdal1-ISCSI01        cib:     info: 
cib_process_request:       Completed cib_delete operation for section 
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0']: 
OK (rc=0, origin=swdal1-ISCSI01/crmd/134, version=0.61.0)
Dec 20 09:35:09 [57841] swdal1-ISCSI01        cib:     info: 
cib_process_ping:  Reporting our current digest to swdal1-ISCSI01: 
decc3ad1315820648f242167998a5880 for 0.61.0 (0x55788a8408e0 0)
Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:  warning: 
action_timer_callback:     Timer popped (timeout=20000, abort_level=0, 
complete=false)
Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:    error: 
print_synapse:     [Action    7]: In-flight crm op 
ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0, 
waiting: none)
Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:   notice: 
abort_transition_graph:    Transition aborted: Action lost 
(source=action_timer_callback:772, 0)
Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:   notice: 
run_graph:         Transition 19 (Complete=1, Pending=0, Fired=0, 
Skipped=12, Incomplete=2, 
Source=/var/lib/pacemaker/pengine/pe-input-519.bz2): Stopped

As you can see the 'stop' operation for resource Pool-0 did not even 
run, as the transition was stopped by the clear_failcount error. This 
error kept happening until we restarted pacemaker. Here is some more 
context from one of the times this error has happened:

Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
process_pe_message:        Input has not changed since last time, not 
saving to disk
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
unpack_config:     On loss of CCM Quorum: Ignore
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
determine_online_status:   Node swdal1-ISCSI01 is online
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
determine_online_status:   Node swdal1-ISCSI02 is online
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
determine_op_status:       Operation monitor found resource Pool-0 
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
determine_op_status:       Operation monitor found resource Pool-0 
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
determine_op_status:       Operation monitor found resource Pool-1 
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
determine_op_status:       Operation monitor found resource Pool-1 
active on swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
clone_print:        Clone Set: ping_resource-clone [ping_resource]
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
short_print:            Started: [ swdal1-ISCSI01 swdal1-ISCSI02 ]
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
native_print:      Pool-1  (ocf::oe:zfs):  Started swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
native_print:      Pool-0  (ocf::oe:zfs):  Started swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
get_failcount_full:        ping_resource:0 has failed 8 times on 
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
common_apply_stickiness:   ping_resource-clone can fail 999992 more 
times on swdal1-ISCSI01 before being forced off
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
get_failcount_full:        ping_resource:1 has failed 8 times on 
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
common_apply_stickiness:   ping_resource-clone can fail 999992 more 
times on swdal1-ISCSI01 before being forced off
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
check_action_definition:   params:reload   <parameters multiplier="1000" 
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60 
10.151.16.60" attempts="4" timeout="3"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
check_action_definition:   Parameters to ping_resource:0_start_0 on 
swdal1-ISCSI01 changed: was 57524cd0b7204dd60c127ba66fb83cd2 vs. now 
1a37c0e0391890df8549f5fda647f4d9 (reload:3.0.9) 
0:0;14:28:0:a0f1b96e-5089-4dad-9073-8c8feac4ea3a
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
get_failcount_full:        ping_resource:0 has failed 8 times on 
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
check_action_definition:   params:reload   <parameters multiplier="1000" 
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60 
10.151.16.60" attempts="4" timeout="3" CRM_meta_timeout="15000"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
check_action_definition:   Parameters to ping_resource:0_monitor_5000 on 
swdal1-ISCSI01 changed: was f3b4adf4d46692f312296263faa50a75 vs. now 
c0d10fc8996c295dd1213d4ca058c0e7 (reload:3.0.9) 
0:0;15:28:0:a0f1b96e-5089-4dad-9073-8c8feac4ea3a
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
get_failcount_full:        ping_resource:0 has failed 8 times on 
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
check_action_definition:   params:reload   <parameters multiplier="1000" 
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60 
10.151.16.60" attempts="4" timeout="3"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
check_action_definition:   Parameters to ping_resource:1_start_0 on 
swdal1-ISCSI02 changed: was 57524cd0b7204dd60c127ba66fb83cd2 vs. now 
1a37c0e0391890df8549f5fda647f4d9 (reload:3.0.9) 
0:0;17:7:0:0ea53274-56ef-48f6-9de1-38d635fa2530
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
check_action_definition:   params:reload   <parameters multiplier="1000" 
dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60 
10.151.16.60" attempts="4" timeout="3" CRM_meta_timeout="15000"/>
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
check_action_definition:   Parameters to ping_resource:1_monitor_5000 on 
swdal1-ISCSI02 changed: was f3b4adf4d46692f312296263faa50a75 vs. now 
c0d10fc8996c295dd1213d4ca058c0e7 (reload:3.0.9) 
0:0;18:7:0:0ea53274-56ef-48f6-9de1-38d635fa2530
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
RecurringOp:        Start recurring monitor (5s) for ping_resource:0 on 
swdal1-ISCSI01
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
RecurringOp:        Start recurring monitor (5s) for ping_resource:1 on 
swdal1-ISCSI02
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
LogActions:        Restart ping_resource:0 (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
LogActions:        Restart ping_resource:1 (Started swdal1-ISCSI02)
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
LogActions:        Leave   Pool-1  (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
LogActions:        Leave   Pool-0  (Started swdal1-ISCSI01)
Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
process_pe_message:        Calculated Transition 11: 
/var/lib/pacemaker/pengine/pe-input-518.bz2
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
do_state_transition:       State transition S_POLICY_ENGINE -> 
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE 
origin=handle_response ]
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
do_te_invoke:      Processing graph 11 (ref=pe_calc-dc-1671528262-59) 
derived from /var/lib/pacemaker/pengine/pe-input-518.bz2
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
te_crm_command:    Executing crm-event (7): clear_failcount on 
swdal1-ISCSI01
Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
handle_failcount_op:       Removing failcount for ping_resource
Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info: 
cib_process_request:       Forwarding cib_delete operation for section 
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0'] 
to master (origin=local/crmd/118)
Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info: 
cib_process_request:       Completed cib_delete operation for section 
//node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm_rsc_op[@id='ping_resource_last_failure_0']: 
OK (rc=0, origin=swdal1-ISCSI01/crmd/118, version=0.60.0)
Dec 20 09:24:28 [57841] swdal1-ISCSI01        cib:     info: 
cib_process_ping:  Reporting our current digest to swdal1-ISCSI01: 
ccf71244504d3deb02d0da64fa72cedc for 0.60.0 (0x55788a83c4b0 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:  warning: 
action_timer_callback:     Timer popped (timeout=20000, abort_level=0, 
complete=false)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:    error: 
print_synapse:     [Action    7]: In-flight crm op 
ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0, 
waiting: none)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice: 
abort_transition_graph:    Transition aborted: Action lost 
(source=action_timer_callback:772, 0)
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice: 
run_graph:         Transition 11 (Complete=1, Pending=0, Fired=0, 
Skipped=9, Incomplete=2, 
Source=/var/lib/pacemaker/pengine/pe-input-518.bz2): Stopped
Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:     info: 
do_state_transition:       State transition S_TRANSITION_ENGINE -> 
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]

I'd appreciate some information about this topic.