[ClusterLabs] [Question] About movement of pacemaker_remote.

Mon Mar 16 00:48:14 UTC 2015

Hi David,

Thank you for comments.

> Pacemaker is attempting to restore connection to the remote node here, are you
> sure the remote is accessible? The "Timed Out" error means that 
> pacemaker was
> unable to establish the connection during the timeout period.

I extended timeout to 60 seconds, but the result was the same.

Mar 16 09:36:13 sl7-01 crmd[5480]: notice: te_rsc_command: Initiating action 7: probe_complete probe_complete-sl7-01 on sl7-01 (local) - no waiting
Mar 16 09:36:13 sl7-01 crmd[5480]: info: te_rsc_command: Action 7 confirmed - no wait
Mar 16 09:36:13 sl7-01 crmd[5480]: notice: te_rsc_command: Initiating action 20: start snmp2_start_0 on sl7-01 (local)
Mar 16 09:36:13 sl7-01 crmd[5480]: info: do_lrm_rsc_op: Performing key=20:9:0:421ed4a6-85e6-4be8-8c22-d5ff051fc00a op=snmp2_start_0
Mar 16 09:36:13 sl7-01 crmd[5480]: info: crm_remote_tcp_connect_async: Attempting to connect to remote server at 192.168.40.110:3121
Mar 16 09:36:13 sl7-01 cib[5475]: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=sl7-01/crmd/77, version=0.6.1)
Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_write_with_digest: Wrote version 0.6.0 of the CIB to disk (digest: afb3e14165d2f8318bac8f1027a49338)
Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_read_and_verify: Reading cluster configuration file /var/lib/pacemaker/cib/cib.D4k0LO
Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_read_and_verify: Verifying cluster configuration signature from /var/lib/pacemaker/cib/cib.zZMn4s
Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_tcp_connect_cb: Remote lrmd client TLS connection established with server snmp2:3121
Mar 16 09:36:13 sl7-01 crmd[5480]: error: lrmd_tls_recv_reply: Unable to receive expected reply, disconnecting.
Mar 16 09:36:13 sl7-01 crmd[5480]: error: lrmd_tls_send_recv: Remote lrmd server disconnected while waiting for reply with id 63.
Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_tls_connection_destroy: TLS connection destroyed
Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_api_disconnect: Disconnecting from lrmd service
Mar 16 09:36:14 sl7-01 crmd[5480]: info: crm_remote_tcp_connect_async: Attempting to connect to remote server at 192.168.40.110:3121
Mar 16 09:36:15 sl7-01 crmd[5480]: info: lrmd_tcp_connect_cb: Remote lrmd client TLS connection established with server snmp2:3121
Mar 16 09:36:15 sl7-01 crmd[5480]: error: lrmd_tls_recv_reply: Unable to receive expected reply, disconnecting.
Mar 16 09:36:15 sl7-01 crmd[5480]: error: lrmd_tls_send_recv: Remote lrmd server disconnected while waiting for reply with id 64.
(snip)
Mar 16 09:37:11 sl7-01 crmd[5480]: info: lrmd_api_disconnect: Disconnecting from lrmd service
Mar 16 09:37:11 sl7-01 crmd[5480]: info: action_synced_wait: Managed remote_meta-data_0 process 5675 exited with rc=0
Mar 16 09:37:11 sl7-01 crmd[5480]: error: process_lrm_event: Operation snmp2_start_0: Timed Out (node=sl7-01, call=8, timeout=60000ms)
(snip)

In the node that rebooted pacemaker_remoted, pacemaker_remoted seem to do listen definitely.

[root at snmp2 ~]# netstat -an | more
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address             State      
tcp        0      0 0.0.0.0:22                  0.0.0.0:*                   LISTEN      
tcp        0      0 192.168.40.110:22           192.168.40.1:40510          ESTABLISHED 
tcp        0      0 :::3121                     :::*                        LISTEN      
tcp        0      0 :::22                       :::*                        LISTEN      
udp        0      0 0.0.0.0:631                 0.0.0.0:*                               
(snip)

> Maybe this is expected. It's impossible to tell without looking at the 

> configuration
> in more detail.

We use the following crm file.
Besides, is there the information necessary?

-----------------------------
property no-quorum-policy="ignore" \
        stonith-enabled="false" \
        startup-fencing="false" \

rsc_defaults resource-stickiness="INFINITY" \
        migration-threshold="1"

primitive snmp1 ocf:pacemaker:remote \
        params \
                server="snmp1" \
        op monitor interval="3s" timeout="15s" \
        op stop interval="0s" timeout="60s" on-fail="ignore"

primitive snmp2 ocf:pacemaker:remote \
        params \
                server="snmp2" \
        op monitor interval="3s" timeout="15s" \
        op stop interval="0s" timeout="60s" on-fail="stop"

primitive Host-rsc1 ocf:heartbeat:Dummy \        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \        op stop interval="0s" timeout="60s" on-fail="ignore"

primitive Remote-rsc1 ocf:heartbeat:Dummy \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="ignore"

primitive Remote-rsc2 ocf:heartbeat:Dummy \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="ignore"

location loc1 Remote-rsc1 \
        rule 200: #uname eq snmp1 \        rule 100: #uname eq snmp2
location loc2 Remote-rsc2 \
        rule 200: #uname eq snmp2 \
        rule 100: #uname eq snmp1
location loc3 Host-rsc1 \
        rule 200: #uname eq sl7-01
-----------------------------

Best Regards,
Hideo Yamauchi.

----- Original Message -----
> From: David Vossel <dvossel at redhat.com>
> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc: 
> Date: 2015/3/14, Sat 08:14
> Subject: Re: [ClusterLabs] [Question] About movement of pacemaker_remote.
> 
> 
> 
> ----- Original Message -----
>>  Hi All,
>> 
>>  We confirm a function of pacemaker_remote.(stonith is invalid.)
>> 
>>  There are two questions.
>> 
>>  * Questsion 1 : The pacemaker_remote does not restore from trouble. Is this
>>  right movement?
>> 
>>  - Step1 - Start a cluster.
>>  -----------------------
>>  [root at sl7-01 ~]# crm_mon -1 -Af
>>  Last updated: Thu Mar 12 14:25:05 2015
>>  Last change: Thu Mar 12 14:24:31 2015
>>  Stack: corosync
>>  Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>  Version: 1.1.12-ce09802
>>  3 Nodes configured
>>  5 Resources configured
>> 
>> 
>>  Online: [ sl7-01 ]
>>  RemoteOnline: [ snmp1 snmp2 ]
>> 
>>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>   Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2
>>   snmp1  (ocf::pacemaker:remote):        Started sl7-01
>>   snmp2  (ocf::pacemaker:remote):        Started sl7-01
>> 
>>  -----------------------
>> 
>>  - Step2 - Let pacemaker_remote break down.
>> 
>>  -----------------------
>>  [root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
>>  [1] 24202
>>  [root at snmp2 ~]# kill -TERM 24202
>> 
>>  [root at sl7-01 ~]# crm_mon -1 -Af
>>  Last updated: Thu Mar 12 14:25:55 2015
>>  Last change: Thu Mar 12 14:24:31 2015
>>  Stack: corosync
>>  Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>  Version: 1.1.12-ce09802
>>  3 Nodes configured
>>  5 Resources configured
>> 
>> 
>>  Online: [ sl7-01 ]
>>  RemoteOnline: [ snmp1 ]
>>  RemoteOFFLINE: [ snmp2 ]
>> 
>>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>   snmp1  (ocf::pacemaker:remote):        Started sl7-01
>>   snmp2  (ocf::pacemaker:remote):        FAILED sl7-01
>> 
>>  Migration summary:
>>  * Node sl7-01:
>>     snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12
>>     14:25:40 2015'
>>  * Node snmp1:
>> 
>>  Failed actions:
>>      snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, 
> status=Error,
>>      exit-reason='none', last-rc-change='Thu Mar 12 14:25:40 
> 2015',
>>      queued=0ms, exec=0ms
>>  -----------------------
>> 
>>  - Step3 - Reboot pacemaker_remote. And remote clear it, but a node is
>>  offline.
>> 
>>  -----------------------
>>  [root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
>>  [2] 24248
>> 
>>  [root at sl7-01 ~]# crm_resource -C -r snmp2
>>  Cleaning up snmp2 on sl7-01
>>  Cleaning up snmp2 on snmp1
>>  Waiting for 1 replies from the CRMd. OK
>> 
>> 
>>  [root at sl7-01 ~]# crm_mon -1 -Af
>>  Last updated: Thu Mar 12 14:26:46 2015
>>  Last change: Thu Mar 12 14:26:26 2015
>>  Stack: corosync
>>  Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>  Version: 1.1.12-ce09802
>>  3 Nodes configured
>>  5 Resources configured
>> 
>> 
>>  Online: [ sl7-01 ]
>>  RemoteOnline: [ snmp1 ]
>>  RemoteOFFLINE: [ snmp2 ]
>> 
>>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>   snmp1  (ocf::pacemaker:remote):        Started sl7-01
>>   snmp2  (ocf::pacemaker:remote):        FAILED sl7-01
>> 
>>  Migration summary:
>>  * Node sl7-01:
>>     snmp2: migration-threshold=1 fail-count=1000000 last-failure='Thu 
> Mar 12
>>     14:26:44 2015'
>>  * Node snmp1:
>> 
>>  Failed actions:
>>      snmp2_start_0 on sl7-01 'unknown error' (1): call=8, 
> status=Timed Out,
>>      exit-reason='none', last-rc-change='Thu Mar 12 14:26:26 
> 2015',
>>      queued=0ms, exec=0ms
>>      snmp2_start_0 on sl7-01 'unknown error' (1): call=8, 
> status=Timed Out,
>>      exit-reason='none', last-rc-change='Thu Mar 12 14:26:26 
> 2015',
>>      queued=0ms, exec=0ms
>>  -----------------------
> 
> Pacemaker is attempting to restore connection to the remote node here, are you
> sure the remote is accessible? The "Timed Out" error means that 
> pacemaker was
> unable to establish the connection during the timeout period.
> 
>> 
>>  * Questsion 2 : When pacemaker_remote broke down, is the method that 
> stonith
>>  moves a resource by constitution of the invalidity right in the next
>>  procedure? Is there a procedure to move without deleting the node?
>> 
>>  - Step1 - Start a cluster.
>>  -----------------------
>>  [root at sl7-01 ~]# crm_mon -1 -Af
>>  Last updated: Thu Mar 12 14:30:27 2015
>>  Last change: Thu Mar 12 14:29:14 2015
>>  Stack: corosync
>>  Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>  Version: 1.1.12-ce09802
>>  3 Nodes configured
>>  5 Resources configured
>> 
>> 
>>  Online: [ sl7-01 ]
>>  RemoteOnline: [ snmp1 snmp2 ]
>> 
>>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>   Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2
>>   snmp1  (ocf::pacemaker:remote):        Started sl7-01
>>   snmp2  (ocf::pacemaker:remote):        Started sl7-01
>> 
>>  -----------------------
>> 
>>  - Step2 - Let pacemaker_remote break down.
>>  -----------------------
>>  [root at snmp2 ~]# kill -TERM 24248
>> 
>>  [root at sl7-01 ~]# crm_mon -1 -Af
>>  Last updated: Thu Mar 12 14:31:59 2015
>>  Last change: Thu Mar 12 14:29:14 2015
>>  Stack: corosync
>>  Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>  Version: 1.1.12-ce09802
>>  3 Nodes configured
>>  5 Resources configured
>> 
>> 
>>  Online: [ sl7-01 ]
>>  RemoteOnline: [ snmp1 ]
>>  RemoteOFFLINE: [ snmp2 ]
>> 
>>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>   snmp1  (ocf::pacemaker:remote):        Started sl7-01
>>   snmp2  (ocf::pacemaker:remote):        FAILED sl7-01
>> 
>>  Migration summary:
>>  * Node sl7-01:
>>     snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12
>>     14:31:42 2015'
>>  * Node snmp1:
>> 
>>  Failed actions:
>>      snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, 
> status=Error,
>>      exit-reason='none', last-rc-change='Thu Mar 12 14:31:42 
> 2015',
>>      queued=0ms, exec=0ms
>>  -----------------------
>> 
>>  - Step3 - We delete the inoperative node. Then a resource moves.
> 
> Maybe this is expected. It's impossible to tell without looking at the 
> configuration
> in more detail.
> 
> -- David
> 
>> 
>>  -----------------------
>>  [root at sl7-01 ~]# crm
>>  crm(live)# node
>>  crm(live)node# delete snmp2
>>  INFO: node snmp2 deleted
>> 
>>  [root at sl7-01 ~]# crm_mon -1 -Af
>>  Last updated: Thu Mar 12 14:35:00 2015
>>  Last change: Thu Mar 12 14:34:20 2015
>>  Stack: corosync
>>  Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>  Version: 1.1.12-ce09802
>>  3 Nodes configured
>>  5 Resources configured
>> 
>> 
>>  Online: [ sl7-01 ]
>>  RemoteOnline: [ snmp1 ]
>>  RemoteOFFLINE: [ snmp2 ]
>> 
>>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>   Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp1
>>   snmp1  (ocf::pacemaker:remote):        Started sl7-01
>> 
>>  Migration summary:
>>  * Node sl7-01:
>>     snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12
>>     14:51:44 2015'
>>  * Node snmp1:
>> 
>>  Failed actions:
>>      snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, 
> status=Error,
>>      exit-reason='none', last-rc-change='Thu Mar 12 14:51:44 
> 2015',
>>      queued=0ms, exec=0ms
>>  -----------------------
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>>  _______________________________________________
>>  Users mailing list: Users at clusterlabs.org
>>  http://clusterlabs.org/mailman/listinfo/users
>> 
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>> 
>