[ClusterLabs] RemoteOFFLINE status, permanently

Wed Nov 29 04:56:02 EST 2023

Hello,

I deployed a Lustre cluster with 3 nodes (metadata) as pacemaker/corosync
and 4 nodes as Remote Agents (for data). Initially all went well, I've set
up MGS and MDS resources, checked failover and failback, remote agents were
online.

Then I tried to create a resource for OST on two nodes which are remote
agents. I also set location constraint preference for them, collocation
(OST1 and OST2 score=-50) and ordering constraint (MDS then OST[12]). Then
I read that colocation and ordering constraints should not be used for RA.
I deleted these constraints. At some stage I used reconnect_interval=5s,
but then found a bug report advising to set it higher, so reverted to
defaults.

Only then I checked pcs status, and noticed then RA were Offline.
I tried to remove RA, add again, restart cluster, destroy it and recreate,
reboot nodes - nothing helped: at the very beginning of cluster setup
agents were persistently RemoteOFFLINE even before creation of OST resource
and locating it preferably on RA (lustre1 and lustre2). I found nothing
helpful in /var/log/pacemaker/pacemaker.log. Please help me investigate and
fix it.

[root at lustre-mgs ~]# rpm -qa | grep -E "corosync|pacemaker|pcs"
pacemaker-cli-2.1.6-8.el8.x86_64
pacemaker-schemas-2.1.6-8.el8.noarch
pcs-0.10.17-2.el8.x86_64
pacemaker-libs-2.1.6-8.el8.x86_64
corosync-3.1.7-1.el8.x86_64
pacemaker-cluster-libs-2.1.6-8.el8.x86_64
pacemaker-2.1.6-8.el8.x86_64
corosynclib-3.1.7-1.el8.x86_64

[root at lustre-mgs ~]# ssh lustre1 "rpm -qa | grep resource-agents"
resource-agents-4.9.0-49.el8.x86_64

[root at lustre-mgs ~]# pcs status
Cluster name: cl-lustre
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: lustre-mds1 (version 2.1.6-8.el8-6fdc9deea29) - partition
with quorum
  * Last updated: Wed Nov 29 12:40:37 2023 on lustre-mgs
  * Last change:  Wed Nov 29 12:11:21 2023 by root via cibadmin on
lustre-mgs
  * 7 nodes configured
  * 6 resource instances configured
Node List:
  * Online: [ lustre-mds1 lustre-mds2 lustre-mgs ]
  * RemoteOFFLINE: [ lustre1 lustre2 lustre3 lustre4 ]
Full List of Resources:
  * lustre2     (ocf::pacemaker:remote):         Stopped
  * lustre3     (ocf::pacemaker:remote):         Stopped
  * lustre4     (ocf::pacemaker:remote):         Stopped
  * lustre1     (ocf::pacemaker:remote):         Stopped
  * MGT (ocf::heartbeat:Filesystem):     Started lustre-mgs
  * MDT00       (ocf::heartbeat:Filesystem):     Started lustre-mds1
Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root at lustre-mgs ~]# pcs cluster verify --full
[root at lustre-mgs ~]#

[root at lustre-mgs ~]# pcs constraint show --full
Warning: This command is deprecated and will be removed. Please use 'pcs
constraint config' instead.
Location Constraints:
  Resource: MDT00
    Enabled on:
      Node: lustre-mds1 (score:100) (id:location-MDT00-lustre-mds1-100)
      Node: lustre-mds2 (score:100) (id:location-MDT00-lustre-mds2-100)
  Resource: MGT
    Enabled on:
      Node: lustre-mgs (score:100) (id:location-MGT-lustre-mgs-100)
      Node: lustre-mds2 (score:50) (id:location-MGT-lustre-mds2-50)
Ordering Constraints:
  start MGT then start MDT00 (kind:Optional) (id:order-MGT-MDT00-Optional)
Colocation Constraints:
Ticket Constraints:

[root at lustre-mgs ~]# pcs resource show lustre1
Warning: This command is deprecated and will be removed. Please use 'pcs
resource config' instead.
Resource: lustre1 (class=ocf provider=pacemaker type=remote)
  Attributes: lustre1-instance_attributes
    server=lustre1
  Operations:
    migrate_from: lustre1-migrate_from-interval-0s
      interval=0s
      timeout=60s
    migrate_to: lustre1-migrate_to-interval-0s
      interval=0s
      timeout=60s
    monitor: lustre1-monitor-interval-60s
      interval=60s
      timeout=30s
    reload: lustre1-reload-interval-0s
      interval=0s
      timeout=60s
    reload-agent: lustre1-reload-agent-interval-0s
      interval=0s
      timeout=60s
    start: lustre1-start-interval-0s
      interval=0s
      timeout=60s
    stop: lustre1-stop-interval-0s
      interval=0s
      timeout=60s

I also changed some properties:
pcs property set stonith-enabled=false
pcs property set symmetric-cluster=false
pcs property set batch-limit=100
pcs resource defaults update resource-stickness=1000
pcs cluster config update

[root at lustre-mgs ~]# ssh lustre1 "systemctl status pcsd pacemaker-remote
resource-agents-deps.target"
● pcsd.service - PCS GUI and remote configuration interface
   Loaded: loaded (/usr/lib/systemd/system/pcsd.service; enabled; vendor
preset: disabled)
   Active: active (running) since Tue 2023-11-28 19:01:49 MSK; 17h ago
     Docs: man:pcsd(8)
           man:pcs(8)
 Main PID: 1752 (pcsd)
    Tasks: 1 (limit: 408641)
   Memory: 28.0M
   CGroup: /system.slice/pcsd.service
           └─1752 /usr/libexec/platform-python -Es /usr/sbin/pcsd
Nov 28 19:01:49 lustre1.ntslab.ru systemd[1]: Starting PCS GUI and remote
configuration interface...
Nov 28 19:01:49 lustre1.ntslab.ru systemd[1]: Started PCS GUI and remote
configuration interface.

● pacemaker_remote.service - Pacemaker Remote executor daemon
   Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service;
enabled; vendor preset: disabled)
   Active: active (running) since Wed 2023-11-29 11:08:14 MSK; 1h 37min ago
     Docs: man:pacemaker-remoted
           https://clusterlabs.org/pacemaker/doc/
 Main PID: 3040 (pacemaker-remot)
    Tasks: 1
   Memory: 1.4M
   CGroup: /system.slice/pacemaker_remote.service
           └─3040 /usr/sbin/pacemaker-remoted
Nov 29 11:08:14 lustre1.ntslab.ru systemd[1]: Started Pacemaker Remote
executor daemon.

● resource-agents-deps.target - resource-agents dependencies
   Loaded: loaded (/usr/lib/systemd/system/resource-agents-deps.target;
static; vendor preset: disabled)
   Active: active since Tue 2023-11-28 19:01:47 MSK; 17h ago

attempt to readd:
[root at lustre-mgs ~]# date;pcs cluster node remove-remote lustre1
Wed Nov 29 12:49:59 MSK 2023
Requesting 'pacemaker_remote disable', 'pacemaker_remote stop' on 'lustre1'
lustre1: successful run of 'pacemaker_remote disable'
lustre1: successful run of 'pacemaker_remote stop'
Requesting remove 'pacemaker authkey' from 'lustre1'
lustre1: successful removal of the file 'pacemaker authkey'
Deleting Resource - lustre1
[root at lustre-mgs ~]# date;pcs cluster node add-remote lustre1
Wed Nov 29 12:50:08 MSK 2023
No addresses specified for host 'lustre1', using 'lustre1'
Sending 'pacemaker authkey' to 'lustre1'
lustre1: successful distribution of the file 'pacemaker authkey'
Requesting 'pacemaker_remote enable', 'pacemaker_remote start' on 'lustre1'
lustre1: successful run of 'pacemaker_remote enable'
lustre1: successful run of 'pacemaker_remote start'
[root at lustre-mgs ~]# date; pcs status
Wed Nov 29 12:50:35 MSK 2023
Cluster name: cl-lustre
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: lustre-mds1 (version 2.1.6-8.el8-6fdc9deea29) - partition
with quorum
  * Last updated: Wed Nov 29 12:50:35 2023 on lustre-mgs
  * Last change:  Wed Nov 29 12:50:11 2023 by root via cibadmin on
lustre-mgs
  * 7 nodes configured
  * 6 resource instances configured
Node List:
  * Online: [ lustre-mds1 lustre-mds2 lustre-mgs ]
  * RemoteOFFLINE: [ lustre1 lustre2 lustre3 lustre4 ]
Full List of Resources:
  * lustre2     (ocf::pacemaker:remote):         Stopped
  * lustre3     (ocf::pacemaker:remote):         Stopped
  * lustre4     (ocf::pacemaker:remote):         Stopped
  * MGT (ocf::heartbeat:Filesystem):     Started lustre-mgs
  * MDT00       (ocf::heartbeat:Filesystem):     Started lustre-mds1
  * lustre1     (ocf::pacemaker:remote):         Stopped
Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root at lustre-mgs ~]# grep lustre1 /var/log/pacemaker/pacemaker.log
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481]
(cib_process_request)   info: Forwarding cib_delete operation for section
//primitive[@id='lustre1'] to all (origin=local/cibadmin/2)
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: -- /cib/configuration/resources/primitive[@id='lustre1']
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481]
(cib_process_request)   info: Completed cib_delete operation for section
//primitive[@id='lustre1']: OK (rc=0, origin=lustre-mgs/cibadmin/2,
version=0.25.0)
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-fenced    [2482]
(stonith_device_remove)         info: Device 'lustre1' not found (0 active
devices)
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: --
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='lustre1']
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481]
(cib_process_request)   info: Completed cib_delete operation for section
//node_state[@uname='lustre-mds1']/lrm/lrm_resources/lrm_resource[@id='lustre1']:
OK (rc=0, origin=lustre-mds1/crmd/157, version=0.25.0)
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: --
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='lustre1']
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481]
(cib_process_request)   info: Completed cib_delete operation for section
//node_state[@uname='lustre-mds1']/lrm/lrm_resources/lrm_resource[@id='lustre1']:
OK (rc=0, origin=lustre-mds1/crmd/158, version=0.25.1)
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481]
(cib_process_request)   info: Forwarding cib_delete operation for section
//node_state[@uname='lustre-mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1']
to all (origin=local/crmd/39)
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: --
/cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='lustre1']
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481]
(cib_process_request)   info: Completed cib_delete operation for section
//node_state[@uname='lustre-mds2']/lrm/lrm_resources/lrm_resource[@id='lustre1']:
OK (rc=0, origin=lustre-mds2/crmd/35, version=0.25.1)
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: --
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='lustre1']
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481]
(cib_process_request)   info: Completed cib_delete operation for section
//node_state[@uname='lustre-mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1']:
OK (rc=0, origin=lustre-mgs/crmd/39, version=0.25.1)
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-controld  [2486]
(delete_resource)       info: Removing resource lustre1 from executor for
tengine
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-controld  [2486]
(controld_delete_resource_history)      info: Clearing resource history for
lustre1 on lustre-mgs (via CIB call 40) |
xpath=//node_state[@uname='lustre-mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1']
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-controld  [2486]
(notify_deleted)        info: Notifying tengine on lustre-mds1 that lustre1
was deleted
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: --
/cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='lustre1']
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481]
(cib_process_request)   info: Completed cib_delete operation for section
//node_state[@uname='lustre-mds2']/lrm/lrm_resources/lrm_resource[@id='lustre1']:
OK (rc=0, origin=lustre-mds2/crmd/36, version=0.25.2)
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481]
(cib_process_request)   info: Forwarding cib_delete operation for section
//node_state[@uname='lustre-mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1']
to all (origin=local/crmd/40)
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: --
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='lustre1']
Nov 29 12:50:01 lustre-mgs.ntslab.ru pacemaker-based     [2481]
(cib_process_request)   info: Completed cib_delete operation for section
//node_state[@uname='lustre-mgs']/lrm/lrm_resources/lrm_resource[@id='lustre1']:
OK (rc=0, origin=lustre-mgs/crmd/40, version=0.25.3)
Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-controld  [2486]
(reap_crm_member)       info: No peers with id=0 and/or uname=lustre1 to
purge from the membership cache
Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-fenced    [2482]
(reap_crm_member)       info: No peers with id=0 and/or uname=lustre1 to
purge from the membership cache
Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-attrd     [2484]
(attrd_client_peer_remove)      info: Client
e1142409-f793-4839-a938-f512958a925e is requesting all values for lustre1
be removed
Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-attrd     [2484]
(attrd_peer_remove)     notice: Removing all lustre1 attributes for peer
lustre-mgs
Nov 29 12:50:03 lustre-mgs.ntslab.ru pacemaker-attrd     [2484]
(reap_crm_member)       info: No peers with id=0 and/or uname=lustre1 to
purge from the membership cache
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++ /cib/configuration/resources:  <primitive class="ocf"
id="lustre1" provider="pacemaker" type="remote"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++                                  <instance_attributes
id="lustre1-instance_attributes">
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++                                    <nvpair
id="lustre1-instance_attributes-server" name="server" value="lustre1"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++                                    <op
id="lustre1-migrate_from-interval-0s" interval="0s" name="migrate_from"
timeout="60s"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++                                    <op
id="lustre1-migrate_to-interval-0s" interval="0s" name="migrate_to"
timeout="60s"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++                                    <op
id="lustre1-monitor-interval-60s" interval="60s" name="monitor"
timeout="30s"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++                                    <op
id="lustre1-reload-interval-0s" interval="0s" name="reload" timeout="60s"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++                                    <op
id="lustre1-reload-agent-interval-0s" interval="0s" name="reload-agent"
timeout="60s"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++                                    <op
id="lustre1-start-interval-0s" interval="0s" name="start" timeout="60s"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++                                    <op
id="lustre1-stop-interval-0s" interval="0s" name="stop" timeout="60s"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-execd     [2483]
(process_lrmd_get_rsc_info)     info: Agent information for 'lustre1' not
in cache
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-controld  [2486]
(do_lrm_rsc_op)         notice: Requesting local execution of probe
operation for lustre1 on lustre-mgs |
transition_key=5:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf
op_key=lustre1_monitor_0
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-controld  [2486]
(log_executor_event)    notice: Result of probe operation for lustre1 on
lustre-mgs: not running (Remote connection inactive) | graph action
confirmed; call=7 key=lustre1_monitor_0 rc=7
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources:
 <lrm_resource id="lustre1" class="ocf" provider="pacemaker" type="remote"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++
 <lrm_rsc_op id="lustre1_last_0" operation_key="lustre1_monitor_0"
operation="monitor" crm-debug-origin="controld_update_resource_history"
crm_feature_set="3.17.4"
transition-key="3:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
transition-magic="-1:193;3:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
exit-reason="" on_node="lustre-mds1" call-id="-1" rc-code="193" op-st
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: +
 /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='lustre1']/lrm_rsc_op[@id='lustre1_last_0']:
 @transition-magic=0:7;3:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf,
@call-id=7, @rc-code=7, @op-status=0
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++ /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources:
 <lrm_resource id="lustre1" class="ocf" provider="pacemaker" type="remote"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++
 <lrm_rsc_op id="lustre1_last_0" operation_key="lustre1_monitor_0"
operation="monitor" crm-debug-origin="controld_update_resource_history"
crm_feature_set="3.17.4"
transition-key="5:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
transition-magic="-1:193;5:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
exit-reason="" on_node="lustre-mgs" call-id="-1" rc-code="193" op-sta
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++ /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources:
 <lrm_resource id="lustre1" class="ocf" provider="pacemaker" type="remote"/>
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: ++
 <lrm_rsc_op id="lustre1_last_0" operation_key="lustre1_monitor_0"
operation="monitor" crm-debug-origin="controld_update_resource_history"
crm_feature_set="3.17.4"
transition-key="4:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
transition-magic="-1:193;4:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf"
exit-reason="" on_node="lustre-mds2" call-id="-1" rc-code="193" op-st
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: +
 /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='lustre1']/lrm_rsc_op[@id='lustre1_last_0']:
 @transition-magic=0:7;4:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf,
@call-id=7, @rc-code=7, @op-status=0
Nov 29 12:50:11 lustre-mgs.ntslab.ru pacemaker-based     [2481] (log_info)
     info: +
 /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='lustre1']/lrm_rsc_op[@id='lustre1_last_0']:
 @transition-magic=0:7;5:88:7:288b2e10-0bee-498d-b9eb-4bc5f0f8d5bf,
@call-id=7, @rc-code=7, @op-status=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20231129/ba5a8270/attachment-0001.htm>