<div dir="ltr"><div>Hi Ken,<br><br>I rolled back settings to 100:100 scores without ping and did simulation again<br>I checked pacemaker.log and the only meaningful entry is the following, still it doesn't make sense to me.<br>Actions: Stop OST4 ( lustre4 ) blocked<br>crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0)<br><br><br><br>Entries in pacemaker.log of 1st cluster node (lustre-mgs):<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-based [3833057] (log_info) info: ++ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='lustre4']: <lrm_rsc_op id="lustre4_last_failure_0" operation_key="lustre4_monitor_5000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="48:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="2:1;48:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Remote executor did<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-based [3833057] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was pend<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting last-failure-lustre4#monitor_5000[lustre-mds1] in instance_attributes: (unset) -> 1702968493 | from lustre-mds2 with no write delay<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting fail-count-lustre4#monitor_5000[lustre-mds1] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702968493 | from lustre-mds2 with no write delay<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-controld [3833062] (update_peer_state_iter) notice: Node lustre4 state is now lost | nodeid=0 previous=member source=handle_remote_state<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-attrd [3833060] (attrd_peer_remove) notice: Removing all lustre4 attributes for peer lustre-mds1<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-controld [3833062] (peer_update_callback) info: Remote node lustre4 is now lost (was member)<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-attrd [3833060] (reap_crm_member) info: No peers with id=0 and/or uname=lustre4 to purge from the membership cache<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-based [3833057] (log_info) info: + /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='lustre4']/lrm_rsc_op[@id='lustre4_last_0']: @operation_key=lustre4_stop_0, @operation=stop, @transition-key=9:5:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390, @transition-magic=-1:193;9:5:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1702968493<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-based [3833057] (cib_process_request) info: Completed cib_delete operation for section //node_state[@uname='lustre4']/transient_attributes: OK (rc=0, origin=lustre-mds1/crmd/352, version=0.467.217)<br>Dec 19 09:48:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-based [3833057] (log_info) info: + /cib/status/node_state[@id='lustre4']: @in_ccm=false, @crm-debug-origin=remote_node_down<br>>>>> logs are silent for 1 MIN !!! And nothing regarding "OST4" later in logs !!!<br>Dec 19 09:49:14 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-based [3833057] (cib_process_request) info: Completed cib_delete operation for section /cib/status/node_state[@uname='lustre-mds1']/lrm/lrm_resources/lrm_resource[@id='lustre4']/lrm_rsc_op[@id='lustre4_last_failure_0']: OK (rc=0, origin=lustre-mds1/crmd/367, version=0.467.221)<br>Dec 19 09:49:14 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-based [3833057] (log_info) info: -- /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-last-failure-lustre4.monitor_5000']<br>Dec 19 09:49:14 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-based [3833057] (log_info) info: -- /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-fail-count-lustre4.monitor_5000']<br>...<br><br>perhaps lustre4 RA was running on lustre-mds1 cluster node, so below its log (same silence for 1 min following the node failure)<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-controld [2457591] (monitor_timeout_cb) info: Timed out waiting for remote poke response from lustre4<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-controld [2457591] (log_executor_event) error: Result of monitor operation for lustre4 on lustre-mds1: Timed Out after 10s (Remote executor did not respond) | graph action unconfirmed; call=8 key=lustre4_monitor_5000<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-controld [2457591] (remote_lrm_op_callback) error: Lost connection to Pacemaker Remote node lustre4<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-controld [2457591] (log_executor_event) error: Result of monitor operation for OST4 on lustre4: Internal communication failure (Action was pending when executor connection was dropped) | graph action confirmed; call=28122 key=OST4_monitor_20000<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='lustre4']: <lrm_rsc_op id="lustre4_last_failure_0" operation_key="lustre4_monitor_5000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="48:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="2:1;48:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Remote executor di<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was pen<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702968493 | from lustre-mds2 with no write delay<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_write_attribute) info: Sent CIB request 9 with 1 change for last-failure-OST4#monitor_20000 (id n/a, set n/a)<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_write_attribute) info: Sent CIB request 10 with 1 change for fail-count-OST4#monitor_20000 (id n/a, set n/a)<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-based [2457586] (log_info) info: ++ <nvpair id="status-lustre4-last-failure-OST4.monitor_20000" name="last-failure-OST4#monitor_20000" value="1702968493"/><br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_cib_callback) info: CIB update 9 result for last-failure-OST4#monitor_20000: OK | rc=0<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_cib_callback) info: * last-failure-OST4#monitor_20000[lustre4]=1702968493<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-fail-count-OST4.monitor_20000" name="fail-count-OST4#monitor_20000" value="1"/><br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_cib_callback) info: CIB update 10 result for fail-count-OST4#monitor_20000: OK | rc=0<br>Dec 19 09:48:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_cib_callback) info: * fail-count-OST4#monitor_20000[lustre4]=1<br>again OST4 resource in not mentioned except for first seconds of failure, no logged attempts to restart it elsewhere<br><br>last pacemaker.log from 3rd cluster node - same 1 min silence 09:48:13 - 09:49:14, but this time more entries regarding OST4<br>[root@lustre-mds2 ~]# grep OST4 /var/log/pacemaker/pacemaker.log<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was pend<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-controld [785108] (abort_transition_graph) info: Transition 3 aborted by operation OST4_monitor_20000 'create' on lustre-mds1: Change in recurring result | magic=8:1;52:2:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390 cib=0.467.208 source=process_graph_event:500 complete=true<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-controld [785108] (update_failcount) info: Updating failcount for OST4 on lustre4 after failed monitor: rc=1 (update=value++, time=1702968493)<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-attrd [785106] (handle_value_expansion) info: Expanded fail-count-OST4#monitor_20000=value++ to 1<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-controld [785108] (process_graph_event) notice: Transition 2 action 52 (OST4_monitor_20000 on lustre-mds1): expected 'ok' but got 'error' | target-rc=0 rc=1 call-id=28122 event='arrived after initial scheduling'<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-attrd [785106] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702968493 | from lustre-mds2 with no write delay<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-attrd [785106] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-based [785103] (log_info) info: ++ <nvpair id="status-lustre4-last-failure-OST4.monitor_20000" name="last-failure-OST4#monitor_20000" value="1702968493"/><br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-fail-count-OST4.monitor_20000" name="fail-count-OST4#monitor_20000" value="1"/><br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-controld [785108] (abort_transition_graph) info: Transition 3 aborted by status-lustre4-fail-count-OST4.monitor_20000 doing create fail-count-OST4#monitor_20000=1: Transient attribute change | cib=0.467.213 source=abort_unless_down:297 path=/cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4'] complete=true<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__threshold_reached) info: OST4 can fail 999999 more times on lustre4 before reaching migration threshold (1000000)<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline)<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0)<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__threshold_reached) info: OST4 can fail 999999 more times on lustre4 before reaching migration threshold (1000000)<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline)<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked<br>Dec 19 09:48:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0)<br>Dec 19 09:49:14 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 09:49:14 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 09:49:14 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 )<br>Dec 19 09:49:17 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 09:49:17 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 09:49:17 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 09:49:17 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 09:49:17 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 09:49:17 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 09:49:17 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 09:49:17 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 09:49:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 09:49:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 09:49:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 09:49:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 09:49:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 09:49:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 09:49:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 09:49:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 09:49:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 09:49:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 09:49:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 09:49:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 09:49:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 09:49:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 09:49:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline)<br>Dec 19 09:49:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3<br>Dec 19 09:49:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked<br>Dec 19 09:49:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0)<br>Dec 19 09:50:18 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 09:50:18 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 09:50:18 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 )<br>Dec 19 09:50:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 09:50:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 09:50:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 09:50:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 09:50:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:37:12 2023 | exit-status=1 id=OST4_last_failure_0<br><br><br><br><br>Do I need to set up migration-threshold=1? All the basic guides doesn't mention it. Why, if it might be critical for failover?<br><br><br>OK, lets try this way:<br>[root@lustre-mgs ~]# pcs resource update OST3 meta migration-threshold=1 failure-timeout=180s<br>Warning: Agent 'ocf:lustre:Lustre' implements unsupported OCF version '1.0.1', supported versions are: '1.0', '1.1'; assumed version '1.0'<br>[root@lustre-mgs ~]# pcs resource update OST4 meta migration-threshold=1 failure-timeout=180s<br>Warning: Agent 'ocf:lustre:Lustre' implements unsupported OCF version '1.0.1', supported versions are: '1.0', '1.1'; assumed version '1.0'<br><br>10:26:01 VM lustre4 was OFF<br>result is the same - OST4 didn't restart on lustre3<br>[root@lustre-mgs ~]# grep OST4 /var/log/pacemaker/pacemaker.log<br>Dec 19 10:26:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-based [3833057] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was pe<br>Dec 19 10:26:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702970773 | from lustre-mds2 with no write delay<br>Dec 19 10:26:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-attrd [3833060] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay<br>Dec 19 10:26:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-based [3833057] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-last-failure-OST4.monitor_20000" name="last-failure-OST4#monitor_20000" value="1702970773"/><br>Dec 19 10:26:13 <a href="http://lustre-mgs.ntslab.ru">lustre-mgs.ntslab.ru</a> pacemaker-based [3833057] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-fail-count-OST4.monitor_20000" name="fail-count-OST4#monitor_20000" value="1"/><br>[root@lustre-mds1 ~]# grep OST4 /var/log/pacemaker/pacemaker.log<br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-controld [2457591] (log_executor_event) error: Result of monitor operation for OST4 on lustre4: Internal communication failure (Action was pending when executor connection was dropped) | graph action confirmed; call=74 key=OST4_monitor_20000<br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was p<br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702970773 | from lustre-mds2 with no write delay<br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_write_attribute) info: Sent CIB request 80 with 1 change for last-failure-OST4#monitor_20000 (id n/a, set n/a)<br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay<br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_write_attribute) info: Sent CIB request 81 with 1 change for fail-count-OST4#monitor_20000 (id n/a, set n/a)<br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-last-failure-OST4.monitor_20000" name="last-failure-OST4#monitor_20000" value="1702970773"/><br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_cib_callback) info: CIB update 80 result for last-failure-OST4#monitor_20000: OK | rc=0<br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_cib_callback) info: * last-failure-OST4#monitor_20000[lustre4]=1702970773<br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-based [2457586] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-fail-count-OST4.monitor_20000" name="fail-count-OST4#monitor_20000" value="1"/><br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_cib_callback) info: CIB update 81 result for fail-count-OST4#monitor_20000: OK | rc=0<br>Dec 19 10:26:13 <a href="http://lustre-mds1.ntslab.ru">lustre-mds1.ntslab.ru</a> pacemaker-attrd [2457589] (attrd_cib_callback) info: * fail-count-OST4#monitor_20000[lustre4]=1<br>[root@lustre-mds2 ~]# grep OST4 /var/log/pacemaker/pacemaker.log<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/lrm[@id='lustre4']/lrm_resources/lrm_resource[@id='OST4']: <lrm_rsc_op id="OST4_last_failure_0" operation_key="OST4_monitor_20000" operation="monitor" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.17.4" transition-key="42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" transition-magic="8:1;42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390" exit-reason="Action was pe<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-controld [785108] (abort_transition_graph) info: Transition 54 aborted by operation OST4_monitor_20000 'create' on lustre-mds1: Change in recurring result | magic=8:1;42:48:0:c84c4c30-a2cb-4e2b-a6e8-98c14e15e390 cib=0.469.3 source=process_graph_event:500 complete=true<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-controld [785108] (update_failcount) info: Updating failcount for OST4 on lustre4 after failed monitor: rc=1 (update=value++, time=1702970773)<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-attrd [785106] (handle_value_expansion) info: Expanded fail-count-OST4#monitor_20000=value++ to 1<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-controld [785108] (process_graph_event) notice: Transition 48 action 42 (OST4_monitor_20000 on lustre-mds1): expected 'ok' but got 'error' | target-rc=0 rc=1 call-id=74 event='arrived after initial scheduling'<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-attrd [785106] (update_attr_on_host) notice: Setting last-failure-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1702970773 | from lustre-mds2 with no write delay<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-attrd [785106] (update_attr_on_host) notice: Setting fail-count-OST4#monitor_20000[lustre4] in instance_attributes: (unset) -> 1 | from lustre-mds2 with no write delay<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-last-failure-OST4.monitor_20000" name="last-failure-OST4#monitor_20000" value="1702970773"/><br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-based [785103] (log_info) info: ++ /cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4']: <nvpair id="status-lustre4-fail-count-OST4.monitor_20000" name="fail-count-OST4#monitor_20000" value="1"/><br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-controld [785108] (abort_transition_graph) info: Transition 54 aborted by status-lustre4-last-failure-OST4.monitor_20000 doing create last-failure-OST4#monitor_20000=1702970773: Transient attribute change | cib=0.469.8 source=abort_unless_down:297 path=/cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4'] complete=true<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-controld [785108] (abort_transition_graph) info: Transition 54 aborted by status-lustre4-fail-count-OST4.monitor_20000 doing create fail-count-OST4#monitor_20000=1: Transient attribute change | cib=0.469.9 source=abort_unless_down:297 path=/cib/status/node_state[@id='lustre4']/transient_attributes[@id='lustre4']/instance_attributes[@id='status-lustre4'] complete=true<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__threshold_reached) warning: OST4 cannot run on lustre4 due to reaching migration threshold (clean up resource to allow again)| failures=1 migration-threshold=1<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline)<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0)<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pe_get_failcount) info: OST4 has failed 1 time on lustre4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__threshold_reached) warning: OST4 cannot run on lustre4 due to reaching migration threshold (clean up resource to allow again)| failures=1 migration-threshold=1<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline)<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked<br>Dec 19 10:26:13 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0)<br>Dec 19 10:27:14 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:27:14 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:27:14 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 )<br>Dec 19 10:27:18 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:27:18 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:27:18 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:27:18 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 10:27:18 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:27:18 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:27:18 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:27:18 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 10:27:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:27:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:27:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:27:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 10:27:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:27:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:27:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:27:21 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 10:27:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:27:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:27:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:27:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 10:27:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:27:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:27:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline)<br>Dec 19 10:27:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3<br>Dec 19 10:27:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked<br>Dec 19 10:27:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0)<br>Dec 19 10:28:19 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:28:19 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:28:19 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 )<br>Dec 19 10:28:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:28:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:28:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:28:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 10:28:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:28:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:28:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:28:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 10:28:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:28:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:28:22 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 )<br>Dec 19 10:28:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:28:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:28:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:28:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 10:28:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:28:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:28:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is unrunnable (node is offline)<br>Dec 19 10:28:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre3<br>Dec 19 10:28:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Stop OST4 ( lustre4 ) blocked<br>Dec 19 10:28:26 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: blocked (OST4_stop_0)<br>Dec 19 10:28:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:28:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:28:27 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (log_list_item) notice: Actions: Start OST4 ( lustre4 )<br>Dec 19 10:28:30 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:28:30 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br>Dec 19 10:28:30 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on lustre4<br>Dec 19 10:28:30 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (rsc_action_default) info: Leave OST4 (Started lustre4)<br>Dec 19 10:28:31 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (unpack_rsc_op_failure) warning: Unexpected result (error: Action was pending when executor connection was dropped) was recorded for monitor of OST4 on lustre4 at Dec 19 09:55:27 2023 | exit-status=1 id=OST4_last_failure_0<br>Dec 19 10:28:31 <a href="http://lustre-mds2.ntslab.ru">lustre-mds2.ntslab.ru</a> pacemaker-schedulerd[785107] (pcmk__unassign_resource) info: Unassigning OST4<br><br><br><br>Sorry for so many log lines, but I don't understand what`s going on<br></div><div><br></div><div><br></div><div>best regards,</div><div>Artem<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 19 Dec 2023 at 00:13, Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Mon, 2023-12-18 at 23:39 +0300, Artem wrote:<br>
> Hello experts.<br>
> <br>
> I previously played with a dummy resource and it worked as expected.<br>
> Now I'm switching to a Lustre OST resource and cannot make it.<br>
> Neither can I understand.<br>
> <br>
> <br>
> ### Initial setup:<br>
> # pcs resource defaults update resource-stickness=110<br>
> # for i in {1..4}; do pcs cluster node add-remote lustre$i<br>
> reconnect_interval=60; done <br>
> # for i in {1..4}; do pcs constraint location lustre$i prefers<br>
> lustre-mgs lustre-mds1 lustre-mds2; done<br>
> # pcs resource create OST3 ocf:lustre:Lustre target=/dev/disk/by-<br>
> id/wwn-0x6000c291b7f7147f826bb95153e2eaca mountpoint=/lustre/oss3<br>
> # pcs resource create OST4 ocf:lustre:Lustre target=/dev/disk/by-<br>
> id/wwn-0x6000c292c41eaae60bccdd3a752913b3 mountpoint=/lustre/oss4<br>
> (I also tried ocf:heartbeat:Filesystem device=... directory=...<br>
> fstype=lustre force_unmount=safe --> same behavior)<br>
> <br>
> # pcs constraint location OST3 prefers lustre3=100<br>
> # pcs constraint location OST3 prefers lustre4=100<br>
> # pcs constraint location OST4 prefers lustre3=100<br>
> # pcs constraint location OST4 prefers lustre4=100<br>
> # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs<br>
> constraint location OST3 avoids $i; done<br>
> # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs<br>
> constraint location OST4 avoids $i; done<br>
> <br>
> ### Checking all is good<br>
> # crm_simulate --simulate --live-check --show-scores<br>
> pcmk__primitive_assign: OST4 allocation score on lustre3: 100<br>
> pcmk__primitive_assign: OST4 allocation score on lustre4: 210<br>
> # pcs status<br>
> * OST3 (ocf::lustre:Lustre): Started lustre3<br>
> * OST4 (ocf::lustre:Lustre): Started lustre4<br>
> <br>
> ### VM with lustre4 (OST4) is OFF<br>
> <br>
> # crm_simulate --simulate --live-check --show-scores<br>
> pcmk__primitive_assign: OST4 allocation score on lustre3: 100<br>
> pcmk__primitive_assign: OST4 allocation score on lustre4: 100<br>
> Start OST4 ( lustre3 )<br>
> Resource action: OST4 start on lustre3<br>
> Resource action: OST4 monitor=20000 on lustre3<br>
> # pcs status<br>
> * OST3 (ocf::lustre:Lustre): Started lustre3<br>
> * OST4 (ocf::lustre:Lustre): Stopped<br>
> <br>
> 1) I see crm_simulate guesed that it has to restart failed OST4 on<br>
> lustre3. After making such decision I suspect it evaluates 100:100<br>
> scores of both lustre3 and lustre4, but lustre3 is already running a<br>
> service. So it decides to run OST4 again on lustre4, which is failed.<br>
> Thus it cannot restart on surviving nodes. Right?<br>
<br>
No. I'd start with figuring out this case. There's no reason, given the<br>
configuration above, why OST4 would be stopped. In fact the simulation<br>
shows it should be started, so that suggests that maybe the actual<br>
start failed.<br>
<br>
Do the logs show any errors around this time?<br>
<br>
> 2) Ok, let's try not to give specific score - nothing changed, see<br>
> below:<br>
> ### did remove old constraints; clear all resources; cleanup all<br>
> resources; cluster stop; cluster start<br>
> <br>
> # pcs constraint location OST3 prefers lustre3 lustre4<br>
> # pcs constraint location OST4 prefers lustre3 lustre4<br>
> # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs<br>
> constraint location OST3 avoids $i; done<br>
> # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs<br>
> constraint location OST4 avoids $i; done<br>
> # crm_simulate --simulate --live-check --show-scores<br>
> pcmk__primitive_assign: OST4 allocation score on lustre3: INFINITY<br>
> pcmk__primitive_assign: OST4 allocation score on lustre4: INFINITY<br>
> # pcs status<br>
> * OST3 (ocf::lustre:Lustre): Started lustre3<br>
> * OST4 (ocf::lustre:Lustre): Started lustre4<br>
> <br>
> ### VM with lustre4 (OST4) is OFF<br>
> <br>
> # crm_simulate --simulate --live-check --show-scores<br>
> pcmk__primitive_assign: OST4 allocation score on lustre3: INFINITY<br>
> pcmk__primitive_assign: OST4 allocation score on lustre4: INFINITY<br>
> Start OST4 ( lustre3 )<br>
> Resource action: OST4 start on lustre3<br>
> Resource action: OST4 monitor=20000 on lustre3<br>
> # pcs status<br>
> * OST3 (ocf::lustre:Lustre): Started lustre3<br>
> * OST4 (ocf::lustre:Lustre): Stopped<br>
> <br>
> 3) Ok lets try to set different scores with preference to nodes and<br>
> affect it with pingd:<br>
> ### did remove old constraints; clear all resources; cleanup all<br>
> resources; cluster stop; cluster start<br>
> <br>
> # pcs constraint location OST3 prefers lustre3=100<br>
> # pcs constraint location OST3 prefers lustre4=90<br>
> # pcs constraint location OST4 prefers lustre3=90<br>
> # pcs constraint location OST4 prefers lustre4=100<br>
> # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs<br>
> constraint location OST3 avoids $i; done<br>
> # for i in lustre-mgs lustre-mds1 lustre-mds2 lustre{1..2}; do pcs<br>
> constraint location OST4 avoids $i; done<br>
> # pcs resource create ping ocf:pacemaker:ping dampen=5s<br>
> host_list=192.168.34.250 op monitor interval=3s timeout=7s meta<br>
> target-role="started" globally-unique="false" clone<br>
> # for i in lustre-mgs lustre-mds{1..2} lustre{1..4}; do pcs<br>
> constraint location ping-clone prefers $i; done<br>
> # pcs constraint location OST3 rule score=0 pingd lt 1 or not_defined<br>
> pingd<br>
> # pcs constraint location OST4 rule score=0 pingd lt 1 or not_defined<br>
> pingd<br>
> # pcs constraint location OST3 rule score=125 defined pingd<br>
> # pcs constraint location OST4 rule score=125 defined pingd<br>
> <br>
> ### same home base:<br>
> # crm_simulate --simulate --live-check --show-scores<br>
> pcmk__primitive_assign: OST4 allocation score on lustre3: 90<br>
> pcmk__primitive_assign: OST4 allocation score on lustre4: 210<br>
> # pcs status<br>
> * OST3 (ocf::lustre:Lustre): Started lustre3<br>
> * OST4 (ocf::lustre:Lustre): Started lustre4<br>
> <br>
> ### VM with lustre4 (OST4) is OFF. <br>
> <br>
> # crm_simulate --simulate --live-check --show-scores<br>
> pcmk__primitive_assign: OST4 allocation score on lustre3: 90<br>
> pcmk__primitive_assign: OST4 allocation score on lustre4: 100<br>
> Start OST4 ( lustre3 )<br>
> Resource action: OST4 start on lustre3<br>
> Resource action: OST4 monitor=20000 on lustre3<br>
> # pcs status<br>
> * OST3 (ocf::lustre:Lustre): Started lustre3<br>
> * OST4 (ocf::lustre:Lustre): Stopped<br>
> <br>
> Again lustre3 seems unable to overrule due to lower score and pingd<br>
> DOESN'T help at all!<br>
> <br>
> <br>
> 4) Can I make a reliable HA failover without pingd to keep things as<br>
> simple as possible?<br>
> 5) Pings might help to affect cluster decisions in case GW is lost,<br>
> but its not working as all the guides say. Why?<br>
> <br>
> <br>
> Thanks in advance,<br>
> Artem<br>
> _______________________________________________<br>
> Manage your subscription:<br>
> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
> <br>
> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
-- <br>
Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>><br>
<br>
_______________________________________________<br>
Manage your subscription:<br>
<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
<br>
ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
</blockquote></div>