[ClusterLabs] RemoteOFFLINE status, permanently
Ken Gaillot
kgaillot at redhat.com
Mon Dec 4 15:21:16 EST 2023
On Wed, 2023-11-29 at 12:56 +0300, Artem wrote:
> Hello,
>
> I deployed a Lustre cluster with 3 nodes (metadata) as
> pacemaker/corosync and 4 nodes as Remote Agents (for data). Initially
> all went well, I've set up MGS and MDS resources, checked failover
> and failback, remote agents were online.
>
> Then I tried to create a resource for OST on two nodes which are
> remote agents. I also set location constraint preference for them,
> collocation (OST1 and OST2 score=-50) and ordering constraint (MDS
> then OST[12]). Then I read that colocation and ordering constraints
> should not be used for RA. I deleted these constraints. At some stage
> I used reconnect_interval=5s, but then found a bug report advising to
> set it higher, so reverted to defaults.
>
> Only then I checked pcs status, and noticed then RA were Offline.
> I tried to remove RA, add again, restart cluster, destroy it and
> recreate, reboot nodes - nothing helped: at the very beginning of
> cluster setup agents were persistently RemoteOFFLINE even before
> creation of OST resource and locating it preferably on RA (lustre1
> and lustre2). I found nothing helpful in
> /var/log/pacemaker/pacemaker.log. Please help me investigate and fix
> it.
>
>
> [root at lustre-mgs ~]# rpm -qa | grep -E "corosync|pacemaker|pcs"
> pacemaker-cli-2.1.6-8.el8.x86_64
> pacemaker-schemas-2.1.6-8.el8.noarch
> pcs-0.10.17-2.el8.x86_64
> pacemaker-libs-2.1.6-8.el8.x86_64
> corosync-3.1.7-1.el8.x86_64
> pacemaker-cluster-libs-2.1.6-8.el8.x86_64
> pacemaker-2.1.6-8.el8.x86_64
> corosynclib-3.1.7-1.el8.x86_64
>
> [root at lustre-mgs ~]# ssh lustre1 "rpm -qa | grep resource-agents"
> resource-agents-4.9.0-49.el8.x86_64
>
> [root at lustre-mgs ~]# pcs status
> Cluster name: cl-lustre
> Cluster Summary:
> * Stack: corosync (Pacemaker is running)
> * Current DC: lustre-mds1 (version 2.1.6-8.el8-6fdc9deea29) -
> partition with quorum
> * Last updated: Wed Nov 29 12:40:37 2023 on lustre-mgs
> * Last change: Wed Nov 29 12:11:21 2023 by root via cibadmin on
> lustre-mgs
> * 7 nodes configured
> * 6 resource instances configured
> Node List:
> * Online: [ lustre-mds1 lustre-mds2 lustre-mgs ]
> * RemoteOFFLINE: [ lustre1 lustre2 lustre3 lustre4 ]
> Full List of Resources:
> * lustre2 (ocf::pacemaker:remote): Stopped
> * lustre3 (ocf::pacemaker:remote): Stopped
> * lustre4 (ocf::pacemaker:remote): Stopped
> * lustre1 (ocf::pacemaker:remote): Stopped
> * MGT (ocf::heartbeat:Filesystem): Started lustre-mgs
> * MDT00 (ocf::heartbeat:Filesystem): Started lustre-mds1
> Daemon Status:
> corosync: active/disabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
> [root at lustre-mgs ~]# pcs cluster verify --full
> [root at lustre-mgs ~]#
>
> [root at lustre-mgs ~]# pcs constraint show --full
> Warning: This command is deprecated and will be removed. Please use
> 'pcs constraint config' instead.
> Location Constraints:
> Resource: MDT00
> Enabled on:
> Node: lustre-mds1 (score:100) (id:location-MDT00-lustre-mds1-
> 100)
> Node: lustre-mds2 (score:100) (id:location-MDT00-lustre-mds2-
> 100)
> Resource: MGT
> Enabled on:
> Node: lustre-mgs (score:100) (id:location-MGT-lustre-mgs-100)
> Node: lustre-mds2 (score:50) (id:location-MGT-lustre-mds2-50)
> Ordering Constraints:
> start MGT then start MDT00 (kind:Optional) (id:order-MGT-MDT00-
> Optional)
> Colocation Constraints:
> Ticket Constraints:
>
> [root at lustre-mgs ~]# pcs resource show lustre1
> Warning: This command is deprecated and will be removed. Please use
> 'pcs resource config' instead.
> Resource: lustre1 (class=ocf provider=pacemaker type=remote)
> Attributes: lustre1-instance_attributes
> server=lustre1
> Operations:
> migrate_from: lustre1-migrate_from-interval-0s
> interval=0s
> timeout=60s
> migrate_to: lustre1-migrate_to-interval-0s
> interval=0s
> timeout=60s
> monitor: lustre1-monitor-interval-60s
> interval=60s
> timeout=30s
> reload: lustre1-reload-interval-0s
> interval=0s
> timeout=60s
> reload-agent: lustre1-reload-agent-interval-0s
> interval=0s
> timeout=60s
> start: lustre1-start-interval-0s
> interval=0s
> timeout=60s
> stop: lustre1-stop-interval-0s
> interval=0s
> timeout=60s
>
> I also changed some properties:
> pcs property set stonith-enabled=false
> pcs property set symmetric-cluster=false
Hi,
An asymmetric cluster requires that all resources be enabled on
particular nodes with location constraints. Since you don't have any
for your remote connections, they can't start anywhere.
> pcs property set batch-limit=100
> pcs resource defaults update resource-stickness=1000
> pcs cluster config update
>
> [root at lustre-mgs ~]# ssh lustre1 "systemctl status pcsd pacemaker-
> remote resource-agents-deps.target"
> ● pcsd.service - PCS GUI and remote configuration interface
> Loaded: loaded (/usr/lib/systemd/system/pcsd.service; enabled;
> vendor preset: disabled)
> Active: active (running) since Tue 2023-11-28 19:01:49 MSK; 17h
> ago
> Docs: man:pcsd(8)
> man:pcs(8)
> Main PID: 1752 (pcsd)
> Tasks: 1 (limit: 408641)
> Memory: 28.0M
> CGroup: /system.slice/pcsd.service
> └─1752 /usr/libexec/platform-python -Es /usr/sbin/pcsd
> Nov 28 19:01:49 lustre1.ntslab.ru systemd[1]: Starting PCS GUI and
> remote configuration interface...
> Nov 28 19:01:49 lustre1.ntslab.ru systemd[1]: Started PCS GUI and
> remote configuration interface.
>
> ● pacemaker_remote.service - Pacemaker Remote executor daemon
> Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service;
> enabled; vendor preset: disabled)
> Active: active (running) since Wed 2023-11-29 11:08:14 MSK; 1h
> 37min ago
> Docs: man:pacemaker-remoted
> https://clusterlabs.org/pacemaker/doc/
> Main PID: 3040 (pacemaker-remot)
> Tasks: 1
> Memory: 1.4M
> CGroup: /system.slice/pacemaker_remote.service
> └─3040 /usr/sbin/pacemaker-remoted
> Nov 29 11:08:14 lustre1.ntslab.ru systemd[1]: Started Pacemaker
> Remote executor daemon.
>
> ● resource-agents-deps.target - resource-agents dependencies
> Loaded: loaded (/usr/lib/systemd/system/resource-agents-
> deps.target; static; vendor preset: disabled)
> Active: active since Tue 2023-11-28 19:01:47 MSK; 17h ago
>
>
> attempt to readd:
> [root at lustre-mgs ~]# date;pcs cluster node remove-remote lustre1
> Wed Nov 29 12:49:59 MSK 2023
> Requesting 'pacemaker_remote disable', 'pacemaker_remote stop' on
> 'lustre1'
> lustre1: successful run of 'pacemaker_remote disable'
> lustre1: successful run of 'pacemaker_remote stop'
> Requesting remove 'pacemaker authkey' from 'lustre1'
> lustre1: successful removal of the file 'pacemaker authkey'
> Deleting Resource - lustre1
> [root at lustre-mgs ~]# date;pcs cluster node add-remote lustre1
> Wed Nov 29 12:50:08 MSK 2023
> No addresses specified for host 'lustre1', using 'lustre1'
> Sending 'pacemaker authkey' to 'lustre1'
> lustre1: successful distribution of the file 'pacemaker authkey'
> Requesting 'pacemaker_remote enable', 'pacemaker_remote start' on
> 'lustre1'
> lustre1: successful run of 'pacemaker_remote enable'
> lustre1: successful run of 'pacemaker_remote start'
> [root at lustre-mgs ~]# date; pcs status
> Wed Nov 29 12:50:35 MSK 2023
> Cluster name: cl-lustre
> Cluster Summary:
> * Stack: corosync (Pacemaker is running)
> * Current DC: lustre-mds1 (version 2.1.6-8.el8-6fdc9deea29) -
> partition with quorum
> * Last updated: Wed Nov 29 12:50:35 2023 on lustre-mgs
> * Last change: Wed Nov 29 12:50:11 2023 by root via cibadmin on
> lustre-mgs
> * 7 nodes configured
> * 6 resource instances configured
> Node List:
> * Online: [ lustre-mds1 lustre-mds2 lustre-mgs ]
> * RemoteOFFLINE: [ lustre1 lustre2 lustre3 lustre4 ]
> Full List of Resources:
> * lustre2 (ocf::pacemaker:remote): Stopped
> * lustre3 (ocf::pacemaker:remote): Stopped
> * lustre4 (ocf::pacemaker:remote): Stopped
> * MGT (ocf::heartbeat:Filesystem): Started lustre-mgs
> * MDT00 (ocf::heartbeat:Filesystem): Started lustre-mds1
> * lustre1 (ocf::pacemaker:remote): Stopped
> Daemon Status:
> corosync: active/disabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
> [root at lustre-mgs ~]# grep lustre1 /var/log/pacemaker/pacemaker.log
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (cib_process_request) info: Forwarding cib_delete operation for
> section //primitive[@id='lustre1'] to all (origin=local/cibadmin/2)
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: --
> /cib/configuration/resources/primitive[@id='lustre1']
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (cib_process_request) info: Completed cib_delete operation for
> section //primitive[@id='lustre1']: OK (rc=0, origin=lustre-
> mgs/cibadmin/2, version=0.25.0)
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-fenced [2482]
> (stonith_device_remove) info: Device 'lustre1' not found (0
> active devices)
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: --
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resour
> ce[@id='lustre1']
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (cib_process_request) info: Completed cib_delete operation for
> section //node_state[@uname='lustre-
> mds1']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0,
> origin=lustre-mds1/crmd/157, version=0.25.0)
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: --
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resour
> ce[@id='lustre1']
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (cib_process_request) info: Completed cib_delete operation for
> section //node_state[@uname='lustre-
> mds1']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0,
> origin=lustre-mds1/crmd/158, version=0.25.1)
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (cib_process_request) info: Forwarding cib_delete operation for
> section //node_state[@uname='lustre-
> mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1'] to all
> (origin=local/crmd/39)
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: --
> /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resour
> ce[@id='lustre1']
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (cib_process_request) info: Completed cib_delete operation for
> section //node_state[@uname='lustre-
> mds2']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0,
> origin=lustre-mds2/crmd/35, version=0.25.1)
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: --
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour
> ce[@id='lustre1']
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (cib_process_request) info: Completed cib_delete operation for
> section //node_state[@uname='lustre-
> mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0,
> origin=lustre-mgs/crmd/39, version=0.25.1)
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-controld [2486]
> (delete_resource) info: Removing resource lustre1 from executor
> for tengine
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-controld [2486]
> (controld_delete_resource_history) info: Clearing resource
> history for lustre1 on lustre-mgs (via CIB call 40) |
> xpath=//node_state[@uname='lustre-
> mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1']
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-controld [2486]
> (notify_deleted) info: Notifying tengine on lustre-mds1 that
> lustre1 was deleted
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: --
> /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resour
> ce[@id='lustre1']
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (cib_process_request) info: Completed cib_delete operation for
> section //node_state[@uname='lustre-
> mds2']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0,
> origin=lustre-mds2/crmd/36, version=0.25.2)
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (cib_process_request) info: Forwarding cib_delete operation for
> section //node_state[@uname='lustre-
> mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1'] to all
> (origin=local/crmd/40)
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: --
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour
> ce[@id='lustre1']
> Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (cib_process_request) info: Completed cib_delete operation for
> section //node_state[@uname='lustre-
> mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1']: OK (rc=0,
> origin=lustre-mgs/crmd/40, version=0.25.3)
> Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-controld [2486]
> (reap_crm_member) info: No peers with id=0 and/or uname=lustre1
> to purge from the membership cache
> Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-fenced [2482]
> (reap_crm_member) info: No peers with id=0 and/or uname=lustre1
> to purge from the membership cache
> Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-attrd [2484]
> (attrd_client_peer_remove) info: Client e1142409-f793-4839-a938-
> f512958a925e is requesting all values for lustre1 be removed
> Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-attrd [2484]
> (attrd_peer_remove) notice: Removing all lustre1 attributes for
> peer lustre-mgs
> Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-attrd [2484]
> (reap_crm_member) info: No peers with id=0 and/or uname=lustre1
> to purge from the membership cache
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++ /cib/configuration/resources: <primitive
> class="ocf" id="lustre1" provider="pacemaker" type="remote"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++
> <instance_attributes id="lustre1-instance_attributes">
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++ <nvpair
> id="lustre1-instance_attributes-server" name="server"
> value="lustre1"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++ <op
> id="lustre1-migrate_from-interval-0s" interval="0s"
> name="migrate_from" timeout="60s"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++ <op
> id="lustre1-migrate_to-interval-0s" interval="0s" name="migrate_to"
> timeout="60s"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++ <op
> id="lustre1-monitor-interval-60s" interval="60s" name="monitor"
> timeout="30s"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++ <op
> id="lustre1-reload-interval-0s" interval="0s" name="reload"
> timeout="60s"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++ <op
> id="lustre1-reload-agent-interval-0s" interval="0s" name="reload-
> agent" timeout="60s"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++ <op
> id="lustre1-start-interval-0s" interval="0s" name="start"
> timeout="60s"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++ <op
> id="lustre1-stop-interval-0s" interval="0s" name="stop"
> timeout="60s"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-execd [2483]
> (process_lrmd_get_rsc_info) info: Agent information for 'lustre1'
> not in cache
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-controld [2486]
> (do_lrm_rsc_op) notice: Requesting local execution of probe
> operation for lustre1 on lustre-mgs | transition_key=5:88:7:288b2e10-
> 0bee-498d-b9eb-4bc5f0f8d5bf op_key=lustre1_monitor_0
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-controld [2486]
> (log_executor_event) notice: Result of probe operation for lustre1
> on lustre-mgs: not running (Remote connection inactive) | graph
> action confirmed; call=7 key=lustre1_monitor_0 rc=7
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources:
> <lrm_resource id="lustre1" class="ocf" provider="pacemaker"
> type="remote"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++
> <lrm_rsc_op id="lustre1_last_0"
> operation_key="lustre1_monitor_0" operation="monitor" crm-debug-
> origin="controld_update_resource_history" crm_feature_set="3.17.4"
> transition-key="3:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
> transition-magic="-1:193;3:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
> exit-reason="" on_node="lustre-mds1" call-id="-1" rc-code="193" op-st
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: +
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resou
> rce[@id='lustre1']/lrm_rsc_op[@id='lustre1_last_0']: @transition-
> magic=0:7;3:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf, @call-id=7,
> @rc-code=7, @op-status=0
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources:
> <lrm_resource id="lustre1" class="ocf" provider="pacemaker"
> type="remote"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++
> <lrm_rsc_op id="lustre1_last_0"
> operation_key="lustre1_monitor_0" operation="monitor" crm-debug-
> origin="controld_update_resource_history" crm_feature_set="3.17.4"
> transition-key="5:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
> transition-magic="-1:193;5:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
> exit-reason="" on_node="lustre-mgs" call-id="-1" rc-code="193" op-sta
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++
> /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources:
> <lrm_resource id="lustre1" class="ocf" provider="pacemaker"
> type="remote"/>
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: ++
> <lrm_rsc_op id="lustre1_last_0"
> operation_key="lustre1_monitor_0" operation="monitor" crm-debug-
> origin="controld_update_resource_history" crm_feature_set="3.17.4"
> transition-key="4:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
> transition-magic="-1:193;4:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
> exit-reason="" on_node="lustre-mds2" call-id="-1" rc-code="193" op-st
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: +
> /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resou
> rce[@id='lustre1']/lrm_rsc_op[@id='lustre1_last_0']: @transition-
> magic=0:7;4:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf, @call-id=7,
> @rc-code=7, @op-status=0
> Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based [2481]
> (log_info) info: +
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resou
> rce[@id='lustre1']/lrm_rsc_op[@id='lustre1_last_0']: @transition-
> magic=0:7;5:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf, @call-id=7,
> @rc-code=7, @op-status=0
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list