[ClusterLabs] Antw: [EXT] clear_failcount operation times out, makes it impossible to use the cluster

Mon Jan 2 07:12:06 EST 2023

Hi!

I wonder: Is this a new installation, or is it a new bug in an old installation?
For the first case I'd recommend to start with current software, and for the second case please describe what had changed or what had triggered the situation.
Also provide basic configuration data, please.

Regards,
Ulrich

>>> Krzysztof Bodora <krzysztof.bodora at open-e.com> schrieb am 02.01.2023 um 12:16
in Nachricht <37d86b8d-c59f-5fe3-cba1-41d2c84fcb5b at open-e.com>:
> Hello Clusterlabs,
> 
> I'm getting this error in the logs:
> 
> Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:    error: 
> print_synapse:     [Action    7]: In-flight crm op 
> ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0, 
> waiting: none)
> 
> My specfications:
> 
> OS: Debian 8
> Pacemaker version: 1.1.12
> Kernel version: 4.19.190
> 
> I'd like to know what can cause this error to happen and how to prevent 
> it in the future. I'm also currently unable to update to a newer version 
> of pacemaker.
> 
> Here is some context for when it happens. It seems that the 
> ping_resource resources are in 'Restart' state:
> 
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
> LogActions:        Restart ping_resource:0 (Started swdal1-ISCSI01)
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
> LogActions:        Restart ping_resource:1 (Started swdal1-ISCSI02)
> 
> which causes pacemaker to try to clear the failcounts on those resources:
> 
> Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
> do_state_transition:       State transition S_POLICY_ENGINE -> 
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE 
> origin=handle_response ]
> Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
> do_te_invoke:      Processing graph 11 (ref=pe_calc-dc-1671528262-59) 
> derived from /var/lib/pacemaker/pengine/pe-input-518.bz2
> Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
> te_crm_command:    Executing crm-event (7): clear_failcount on 
> swdal1-ISCSI01
> Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
> handle_failcount_op:       Removing failcount for ping_resource
> Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info: 
> cib_process_request:       Forwarding cib_delete operation for section 
> //node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
> _rsc_op[@id='ping_resource_last_failure_0'] 
> to master (origin=local/crmd/118)
> Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info: 
> cib_process_request:       Completed cib_delete operation for section 
> //node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
> _rsc_op[@id='ping_resource_last_failure_0']: 
> OK (rc=0, origin=swdal1-ISCSI01/crmd/118, version=0.60.0)
> Dec 20 09:24:28 [57841] swdal1-ISCSI01        cib:     info: 
> cib_process_ping:  Reporting our current digest to swdal1-ISCSI01: 
> ccf71244504d3deb02d0da64fa72cedc for 0.60.0 (0x55788a83c4b0 0)
> Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:  warning: 
> action_timer_callback:     Timer popped (timeout=20000, abort_level=0, 
> complete=false)
> Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:    error: 
> print_synapse:     [Action    7]: In-flight crm op 
> ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0, 
> waiting: none)
> Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice: 
> abort_transition_graph:    Transition aborted: Action lost 
> (source=action_timer_callback:772, 0)
> Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice: 
> run_graph:         Transition 11 (Complete=1, Pending=0, Fired=0, 
> Skipped=9, Incomplete=2, 
> Source=/var/lib/pacemaker/pengine/pe-input-518.bz2): Stopped
> 
> Clearing the failcount fails, so the whole transition is aborted. This 
> make it impossible to do anything in the cluster, for example move 
> Pool-0 resource, as it also trigger the clear_failcount operation which 
> fails and aborts the transition, for example:
> 
> Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info: 
> RecurringOp:        Start recurring monitor (5s) for ping_resource:0 on 
> swdal1-ISCSI01
> Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info: 
> RecurringOp:        Start recurring monitor (5s) for ping_resource:1 on 
> swdal1-ISCSI02
> Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info: 
> RecurringOp:        Start recurring monitor (10s) for Pool-0 on 
> swdal1-ISCSI02
> Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice: 
> LogActions:        Restart ping_resource:0 (Started swdal1-ISCSI01)
> Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice: 
> LogActions:        Restart ping_resource:1 (Started swdal1-ISCSI02)
> Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:     info: 
> LogActions:        Leave   Pool-1  (Started swdal1-ISCSI01)
> Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice: 
> LogActions:        Move    Pool-0  (Started swdal1-ISCSI01 -> 
> swdal1-ISCSI02)
> Dec 20 09:35:04 [57851] swdal1-ISCSI01    pengine:   notice: 
> process_pe_message:        Calculated Transition 19: 
> /var/lib/pacemaker/pengine/pe-input-519.bz2
> Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info: 
> do_state_transition:       State transition S_POLICY_ENGINE -> 
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE 
> origin=handle_response ]
> Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info: 
> do_te_invoke:      Processing graph 19 (ref=pe_calc-dc-1671528904-75) 
> derived from /var/lib/pacemaker/pengine/pe-input-519.bz2
> Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info: 
> te_crm_command:    Executing crm-event (7): clear_failcount on 
> swdal1-ISCSI01
> Dec 20 09:35:04 [57862] swdal1-ISCSI01       crmd:     info: 
> handle_failcount_op:       Removing failcount for ping_resource
> Dec 20 09:35:04 [57841] swdal1-ISCSI01        cib:     info: 
> cib_process_request:       Forwarding cib_delete operation for section 
> //node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
> _rsc_op[@id='ping_resource_last_failure_0'] 
> to master (origin=local/crmd/134)
> Dec 20 09:35:04 [57841] swdal1-ISCSI01        cib:    info: 
> cib_process_request:       Completed cib_delete operation for section 
> //node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
> _rsc_op[@id='ping_resource_last_failure_0']: 
> OK (rc=0, origin=swdal1-ISCSI01/crmd/134, version=0.61.0)
> Dec 20 09:35:09 [57841] swdal1-ISCSI01        cib:     info: 
> cib_process_ping:  Reporting our current digest to swdal1-ISCSI01: 
> decc3ad1315820648f242167998a5880 for 0.61.0 (0x55788a8408e0 0)
> Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:  warning: 
> action_timer_callback:     Timer popped (timeout=20000, abort_level=0, 
> complete=false)
> Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:    error: 
> print_synapse:     [Action    7]: In-flight crm op 
> ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0, 
> waiting: none)
> Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:   notice: 
> abort_transition_graph:    Transition aborted: Action lost 
> (source=action_timer_callback:772, 0)
> Dec 20 09:36:24 [57862] swdal1-ISCSI01       crmd:   notice: 
> run_graph:         Transition 19 (Complete=1, Pending=0, Fired=0, 
> Skipped=12, Incomplete=2, 
> Source=/var/lib/pacemaker/pengine/pe-input-519.bz2): Stopped
> 
> As you can see the 'stop' operation for resource Pool-0 did not even 
> run, as the transition was stopped by the clear_failcount error. This 
> error kept happening until we restarted pacemaker. Here is some more 
> context from one of the times this error has happened:
> 
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> process_pe_message:        Input has not changed since last time, not 
> saving to disk
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
> unpack_config:     On loss of CCM Quorum: Ignore
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> determine_online_status:   Node swdal1-ISCSI01 is online
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> determine_online_status:   Node swdal1-ISCSI02 is online
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> determine_op_status:       Operation monitor found resource Pool-0 
> active on swdal1-ISCSI01
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> determine_op_status:       Operation monitor found resource Pool-0 
> active on swdal1-ISCSI01
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> determine_op_status:       Operation monitor found resource Pool-1 
> active on swdal1-ISCSI01
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> determine_op_status:       Operation monitor found resource Pool-1 
> active on swdal1-ISCSI01
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> clone_print:        Clone Set: ping_resource-clone [ping_resource]
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> short_print:            Started: [ swdal1-ISCSI01 swdal1-ISCSI02 ]
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> native_print:      Pool-1  (ocf::oe:zfs):  Started swdal1-ISCSI01
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> native_print:      Pool-0  (ocf::oe:zfs):  Started swdal1-ISCSI01
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> get_failcount_full:        ping_resource:0 has failed 8 times on 
> swdal1-ISCSI01
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> common_apply_stickiness:   ping_resource-clone can fail 999992 more 
> times on swdal1-ISCSI01 before being forced off
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> get_failcount_full:        ping_resource:1 has failed 8 times on 
> swdal1-ISCSI01
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> common_apply_stickiness:   ping_resource-clone can fail 999992 more 
> times on swdal1-ISCSI01 before being forced off
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> check_action_definition:   params:reload   <parameters multiplier="1000" 
> dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60 
> 10.151.16.60" attempts="4" timeout="3"/>
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> check_action_definition:   Parameters to ping_resource:0_start_0 on 
> swdal1-ISCSI01 changed: was 57524cd0b7204dd60c127ba66fb83cd2 vs. now 
> 1a37c0e0391890df8549f5fda647f4d9 (reload:3.0.9) 
> 0:0;14:28:0:a0f1b96e-5089-4dad-9073-8c8feac4ea3a
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> get_failcount_full:        ping_resource:0 has failed 8 times on 
> swdal1-ISCSI01
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> check_action_definition:   params:reload   <parameters multiplier="1000" 
> dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60 
> 10.151.16.60" attempts="4" timeout="3" CRM_meta_timeout="15000"/>
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> check_action_definition:   Parameters to ping_resource:0_monitor_5000 on 
> swdal1-ISCSI01 changed: was f3b4adf4d46692f312296263faa50a75 vs. now 
> c0d10fc8996c295dd1213d4ca058c0e7 (reload:3.0.9) 
> 0:0;15:28:0:a0f1b96e-5089-4dad-9073-8c8feac4ea3a
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> get_failcount_full:        ping_resource:0 has failed 8 times on 
> swdal1-ISCSI01
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> check_action_definition:   params:reload   <parameters multiplier="1000" 
> dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60 
> 10.151.16.60" attempts="4" timeout="3"/>
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> check_action_definition:   Parameters to ping_resource:1_start_0 on 
> swdal1-ISCSI02 changed: was 57524cd0b7204dd60c127ba66fb83cd2 vs. now 
> 1a37c0e0391890df8549f5fda647f4d9 (reload:3.0.9) 
> 0:0;17:7:0:0ea53274-56ef-48f6-9de1-38d635fa2530
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> check_action_definition:   params:reload   <parameters multiplier="1000" 
> dampen="15s" host_list="10.151.17.50 10.151.16.50 10.151.17.60 
> 10.151.16.60" attempts="4" timeout="3" CRM_meta_timeout="15000"/>
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> check_action_definition:   Parameters to ping_resource:1_monitor_5000 on 
> swdal1-ISCSI02 changed: was f3b4adf4d46692f312296263faa50a75 vs. now 
> c0d10fc8996c295dd1213d4ca058c0e7 (reload:3.0.9) 
> 0:0;18:7:0:0ea53274-56ef-48f6-9de1-38d635fa2530
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> RecurringOp:        Start recurring monitor (5s) for ping_resource:0 on 
> swdal1-ISCSI01
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> RecurringOp:        Start recurring monitor (5s) for ping_resource:1 on 
> swdal1-ISCSI02
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
> LogActions:        Restart ping_resource:0 (Started swdal1-ISCSI01)
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
> LogActions:        Restart ping_resource:1 (Started swdal1-ISCSI02)
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> LogActions:        Leave   Pool-1  (Started swdal1-ISCSI01)
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:     info: 
> LogActions:        Leave   Pool-0  (Started swdal1-ISCSI01)
> Dec 20 09:24:23 [57851] swdal1-ISCSI01    pengine:   notice: 
> process_pe_message:        Calculated Transition 11: 
> /var/lib/pacemaker/pengine/pe-input-518.bz2
> Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
> do_state_transition:       State transition S_POLICY_ENGINE -> 
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE 
> origin=handle_response ]
> Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
> do_te_invoke:      Processing graph 11 (ref=pe_calc-dc-1671528262-59) 
> derived from /var/lib/pacemaker/pengine/pe-input-518.bz2
> Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
> te_crm_command:    Executing crm-event (7): clear_failcount on 
> swdal1-ISCSI01
> Dec 20 09:24:23 [57862] swdal1-ISCSI01       crmd:     info: 
> handle_failcount_op:       Removing failcount for ping_resource
> Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info: 
> cib_process_request:       Forwarding cib_delete operation for section 
> //node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
> _rsc_op[@id='ping_resource_last_failure_0'] 
> to master (origin=local/crmd/118)
> Dec 20 09:24:23 [57841] swdal1-ISCSI01        cib:     info: 
> cib_process_request:       Completed cib_delete operation for section 
> //node_state[@uname='swdal1-ISCSI01']//lrm_resource[@id='ping_resource']/lrm
> _rsc_op[@id='ping_resource_last_failure_0']: 
> OK (rc=0, origin=swdal1-ISCSI01/crmd/118, version=0.60.0)
> Dec 20 09:24:28 [57841] swdal1-ISCSI01        cib:     info: 
> cib_process_ping:  Reporting our current digest to swdal1-ISCSI01: 
> ccf71244504d3deb02d0da64fa72cedc for 0.60.0 (0x55788a83c4b0 0)
> Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:  warning: 
> action_timer_callback:     Timer popped (timeout=20000, abort_level=0, 
> complete=false)
> Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:    error: 
> print_synapse:     [Action    7]: In-flight crm op 
> ping_resource_clear_failcount_0   on swdal1-ISCSI01 (priority: 0, 
> waiting: none)
> Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice: 
> abort_transition_graph:    Transition aborted: Action lost 
> (source=action_timer_callback:772, 0)
> Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:   notice: 
> run_graph:         Transition 11 (Complete=1, Pending=0, Fired=0, 
> Skipped=9, Incomplete=2, 
> Source=/var/lib/pacemaker/pengine/pe-input-518.bz2): Stopped
> Dec 20 09:25:43 [57862] swdal1-ISCSI01       crmd:     info: 
> do_state_transition:       State transition S_TRANSITION_ENGINE -> 
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> I'd appreciate some information about this topic.
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/