[ClusterLabs] [Question] About movement of pacemaker_remote.

Thu Mar 26 00:08:40 UTC 2015

Hi David,

Does my setting include a mistake?
Can the movement in conjunction with this pacemaker_remote be improved by setting?

Best Regards,
Hideo Yamauchi.

----- Original Message -----
> From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
> To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc: 
> Date: 2015/3/16, Mon 09:48
> Subject: Re: [ClusterLabs] [Question] About movement of pacemaker_remote.
> 
> Hi David,
> 
> Thank you for comments.
> 
> 
>>  Pacemaker is attempting to restore connection to the remote node here, are 
> you
>>  sure the remote is accessible? The "Timed Out" error means that 
>>  pacemaker was
>>  unable to establish the connection during the timeout period.
> 
> 
> I extended timeout to 60 seconds, but the result was the same.
> 
> Mar 16 09:36:13 sl7-01 crmd[5480]: notice: te_rsc_command: Initiating action 7: 
> probe_complete probe_complete-sl7-01 on sl7-01 (local) - no waiting
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: te_rsc_command: Action 7 confirmed - no 
> wait
> Mar 16 09:36:13 sl7-01 crmd[5480]: notice: te_rsc_command: Initiating action 20: 
> start snmp2_start_0 on sl7-01 (local)
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: do_lrm_rsc_op: Performing 
> key=20:9:0:421ed4a6-85e6-4be8-8c22-d5ff051fc00a op=snmp2_start_0
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: crm_remote_tcp_connect_async: 
> Attempting to connect to remote server at 192.168.40.110:3121
> Mar 16 09:36:13 sl7-01 cib[5475]: info: cib_process_request: Completed 
> cib_modify operation for section status: OK (rc=0, origin=sl7-01/crmd/77, 
> version=0.6.1)
> Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_write_with_digest: Wrote 
> version 0.6.0 of the CIB to disk (digest: afb3e14165d2f8318bac8f1027a49338)
> Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_read_and_verify: Reading 
> cluster configuration file /var/lib/pacemaker/cib/cib.D4k0LO
> Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_read_and_verify: Verifying 
> cluster configuration signature from /var/lib/pacemaker/cib/cib.zZMn4s
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_tcp_connect_cb: Remote lrmd client 
> TLS connection established with server snmp2:3121
> Mar 16 09:36:13 sl7-01 crmd[5480]: error: lrmd_tls_recv_reply: Unable to receive 
> expected reply, disconnecting.
> Mar 16 09:36:13 sl7-01 crmd[5480]: error: lrmd_tls_send_recv: Remote lrmd server 
> disconnected while waiting for reply with id 63.
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_tls_connection_destroy: TLS 
> connection destroyed
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_api_disconnect: Disconnecting from 
> lrmd service
> Mar 16 09:36:14 sl7-01 crmd[5480]: info: crm_remote_tcp_connect_async: 
> Attempting to connect to remote server at 192.168.40.110:3121
> Mar 16 09:36:15 sl7-01 crmd[5480]: info: lrmd_tcp_connect_cb: Remote lrmd client 
> TLS connection established with server snmp2:3121
> Mar 16 09:36:15 sl7-01 crmd[5480]: error: lrmd_tls_recv_reply: Unable to receive 
> expected reply, disconnecting.
> Mar 16 09:36:15 sl7-01 crmd[5480]: error: lrmd_tls_send_recv: Remote lrmd server 
> disconnected while waiting for reply with id 64.
> (snip)
> Mar 16 09:37:11 sl7-01 crmd[5480]: info: lrmd_api_disconnect: Disconnecting from 
> lrmd service
> Mar 16 09:37:11 sl7-01 crmd[5480]: info: action_synced_wait: Managed 
> remote_meta-data_0 process 5675 exited with rc=0
> Mar 16 09:37:11 sl7-01 crmd[5480]: error: process_lrm_event: Operation 
> snmp2_start_0: Timed Out (node=sl7-01, call=8, timeout=60000ms)
> (snip)
> 
> In the node that rebooted pacemaker_remoted, pacemaker_remoted seem to do listen 
> definitely.
> 
> [root at snmp2 ~]# netstat -an | more
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address               Foreign Address             
> State      
> tcp        0      0 0.0.0.0:22                  0.0.0.0:*                   
> LISTEN      
> tcp        0      0 192.168.40.110:22           192.168.40.1:40510         
>  ESTABLISHED 
> tcp        0      0 :::3121                     :::*                       
>  LISTEN      
> tcp        0      0 :::22                       :::*                       
>  LISTEN      
> udp        0      0 0.0.0.0:631                 0.0.0.0:*                       
>         
> (snip)
> 
>>  Maybe this is expected. It's impossible to tell without looking at the 
> 
>>  configuration
>>  in more detail.
> 
> 
> We use the following crm file.
> Besides, is there the information necessary?
> 
> -----------------------------
> property no-quorum-policy="ignore" \
>         stonith-enabled="false" \
>         startup-fencing="false" \
> 
> rsc_defaults resource-stickiness="INFINITY" \
>         migration-threshold="1"
> 
> primitive snmp1 ocf:pacemaker:remote \
>         params \
>                 server="snmp1" \
>         op monitor interval="3s" timeout="15s" \
>         op stop interval="0s" timeout="60s" 
> on-fail="ignore"
> 
> primitive snmp2 ocf:pacemaker:remote \
>         params \
>                 server="snmp2" \
>         op monitor interval="3s" timeout="15s" \
>         op stop interval="0s" timeout="60s" 
> on-fail="stop"
> 
> primitive Host-rsc1 ocf:heartbeat:Dummy \        op start 
> interval="0s" timeout="60s" on-fail="restart" 
> \
>         op monitor interval="10s" timeout="60s" 
> on-fail="restart" \        op stop interval="0s" 
> timeout="60s" on-fail="ignore"
> 
> primitive Remote-rsc1 ocf:heartbeat:Dummy \
>         op start interval="0s" timeout="60s" 
> on-fail="restart" \
>         op monitor interval="10s" timeout="60s" 
> on-fail="restart" \
>         op stop interval="0s" timeout="60s" 
> on-fail="ignore"
> 
> primitive Remote-rsc2 ocf:heartbeat:Dummy \
>         op start interval="0s" timeout="60s" 
> on-fail="restart" \
>         op monitor interval="10s" timeout="60s" 
> on-fail="restart" \
>         op stop interval="0s" timeout="60s" 
> on-fail="ignore"
> 
> location loc1 Remote-rsc1 \
>         rule 200: #uname eq snmp1 \        rule 100: #uname eq snmp2
> location loc2 Remote-rsc2 \
>         rule 200: #uname eq snmp2 \
>         rule 100: #uname eq snmp1
> location loc3 Host-rsc1 \
>         rule 200: #uname eq sl7-01
> -----------------------------
> 
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> 
> ----- Original Message -----
>>  From: David Vossel <dvossel at redhat.com>
>>  To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed <users at clusterlabs.org>
>>  Cc: 
>>  Date: 2015/3/14, Sat 08:14
>>  Subject: Re: [ClusterLabs] [Question] About movement of pacemaker_remote.
>> 
>> 
>> 
>>  ----- Original Message -----
>>>   Hi All,
>>> 
>>>   We confirm a function of pacemaker_remote.(stonith is invalid.)
>>> 
>>>   There are two questions.
>>> 
>>>   * Questsion 1 : The pacemaker_remote does not restore from trouble. Is 
> this
>>>   right movement?
>>> 
>>>   - Step1 - Start a cluster.
>>>   -----------------------
>>>   [root at sl7-01 ~]# crm_mon -1 -Af
>>>   Last updated: Thu Mar 12 14:25:05 2015
>>>   Last change: Thu Mar 12 14:24:31 2015
>>>   Stack: corosync
>>>   Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>>   Version: 1.1.12-ce09802
>>>   3 Nodes configured
>>>   5 Resources configured
>>> 
>>> 
>>>   Online: [ sl7-01 ]
>>>   RemoteOnline: [ snmp1 snmp2 ]
>>> 
>>>    Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>>    Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>>    Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2
>>>    snmp1  (ocf::pacemaker:remote):        Started sl7-01
>>>    snmp2  (ocf::pacemaker:remote):        Started sl7-01
>>> 
>>>   -----------------------
>>> 
>>>   - Step2 - Let pacemaker_remote break down.
>>> 
>>>   -----------------------
>>>   [root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
>>>   [1] 24202
>>>   [root at snmp2 ~]# kill -TERM 24202
>>> 
>>>   [root at sl7-01 ~]# crm_mon -1 -Af
>>>   Last updated: Thu Mar 12 14:25:55 2015
>>>   Last change: Thu Mar 12 14:24:31 2015
>>>   Stack: corosync
>>>   Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>>   Version: 1.1.12-ce09802
>>>   3 Nodes configured
>>>   5 Resources configured
>>> 
>>> 
>>>   Online: [ sl7-01 ]
>>>   RemoteOnline: [ snmp1 ]
>>>   RemoteOFFLINE: [ snmp2 ]
>>> 
>>>    Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>>    Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>>    snmp1  (ocf::pacemaker:remote):        Started sl7-01
>>>    snmp2  (ocf::pacemaker:remote):        FAILED sl7-01
>>> 
>>>   Migration summary:
>>>   * Node sl7-01:
>>>      snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 
> 12
>>>      14:25:40 2015'
>>>   * Node snmp1:
>>> 
>>>   Failed actions:
>>>       snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, 
>>  status=Error,
>>>       exit-reason='none', last-rc-change='Thu Mar 12 
> 14:25:40 
>>  2015',
>>>       queued=0ms, exec=0ms
>>>   -----------------------
>>> 
>>>   - Step3 - Reboot pacemaker_remote. And remote clear it, but a node is
>>>   offline.
>>> 
>>>   -----------------------
>>>   [root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
>>>   [2] 24248
>>> 
>>>   [root at sl7-01 ~]# crm_resource -C -r snmp2
>>>   Cleaning up snmp2 on sl7-01
>>>   Cleaning up snmp2 on snmp1
>>>   Waiting for 1 replies from the CRMd. OK
>>> 
>>> 
>>>   [root at sl7-01 ~]# crm_mon -1 -Af
>>>   Last updated: Thu Mar 12 14:26:46 2015
>>>   Last change: Thu Mar 12 14:26:26 2015
>>>   Stack: corosync
>>>   Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>>   Version: 1.1.12-ce09802
>>>   3 Nodes configured
>>>   5 Resources configured
>>> 
>>> 
>>>   Online: [ sl7-01 ]
>>>   RemoteOnline: [ snmp1 ]
>>>   RemoteOFFLINE: [ snmp2 ]
>>> 
>>>    Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>>    Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>>    snmp1  (ocf::pacemaker:remote):        Started sl7-01
>>>    snmp2  (ocf::pacemaker:remote):        FAILED sl7-01
>>> 
>>>   Migration summary:
>>>   * Node sl7-01:
>>>      snmp2: migration-threshold=1 fail-count=1000000 
> last-failure='Thu 
>>  Mar 12
>>>      14:26:44 2015'
>>>   * Node snmp1:
>>> 
>>>   Failed actions:
>>>       snmp2_start_0 on sl7-01 'unknown error' (1): call=8, 
>>  status=Timed Out,
>>>       exit-reason='none', last-rc-change='Thu Mar 12 
> 14:26:26 
>>  2015',
>>>       queued=0ms, exec=0ms
>>>       snmp2_start_0 on sl7-01 'unknown error' (1): call=8, 
>>  status=Timed Out,
>>>       exit-reason='none', last-rc-change='Thu Mar 12 
> 14:26:26 
>>  2015',
>>>       queued=0ms, exec=0ms
>>>   -----------------------
>> 
>>  Pacemaker is attempting to restore connection to the remote node here, are 
> you
>>  sure the remote is accessible? The "Timed Out" error means that 
>>  pacemaker was
>>  unable to establish the connection during the timeout period.
>> 
>>> 
>>>   * Questsion 2 : When pacemaker_remote broke down, is the method that 
>>  stonith
>>>   moves a resource by constitution of the invalidity right in the next
>>>   procedure? Is there a procedure to move without deleting the node?
>>> 
>>>   - Step1 - Start a cluster.
>>>   -----------------------
>>>   [root at sl7-01 ~]# crm_mon -1 -Af
>>>   Last updated: Thu Mar 12 14:30:27 2015
>>>   Last change: Thu Mar 12 14:29:14 2015
>>>   Stack: corosync
>>>   Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>>   Version: 1.1.12-ce09802
>>>   3 Nodes configured
>>>   5 Resources configured
>>> 
>>> 
>>>   Online: [ sl7-01 ]
>>>   RemoteOnline: [ snmp1 snmp2 ]
>>> 
>>>    Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>>    Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>>    Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2
>>>    snmp1  (ocf::pacemaker:remote):        Started sl7-01
>>>    snmp2  (ocf::pacemaker:remote):        Started sl7-01
>>> 
>>>   -----------------------
>>> 
>>>   - Step2 - Let pacemaker_remote break down.
>>>   -----------------------
>>>   [root at snmp2 ~]# kill -TERM 24248
>>> 
>>>   [root at sl7-01 ~]# crm_mon -1 -Af
>>>   Last updated: Thu Mar 12 14:31:59 2015
>>>   Last change: Thu Mar 12 14:29:14 2015
>>>   Stack: corosync
>>>   Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>>   Version: 1.1.12-ce09802
>>>   3 Nodes configured
>>>   5 Resources configured
>>> 
>>> 
>>>   Online: [ sl7-01 ]
>>>   RemoteOnline: [ snmp1 ]
>>>   RemoteOFFLINE: [ snmp2 ]
>>> 
>>>    Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>>    Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>>    snmp1  (ocf::pacemaker:remote):        Started sl7-01
>>>    snmp2  (ocf::pacemaker:remote):        FAILED sl7-01
>>> 
>>>   Migration summary:
>>>   * Node sl7-01:
>>>      snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 
> 12
>>>      14:31:42 2015'
>>>   * Node snmp1:
>>> 
>>>   Failed actions:
>>>       snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, 
>>  status=Error,
>>>       exit-reason='none', last-rc-change='Thu Mar 12 
> 14:31:42 
>>  2015',
>>>       queued=0ms, exec=0ms
>>>   -----------------------
>>> 
>>>   - Step3 - We delete the inoperative node. Then a resource moves.
>> 
>>  Maybe this is expected. It's impossible to tell without looking at the 
>>  configuration
>>  in more detail.
>> 
>>  -- David
>> 
>>> 
>>>   -----------------------
>>>   [root at sl7-01 ~]# crm
>>>   crm(live)# node
>>>   crm(live)node# delete snmp2
>>>   INFO: node snmp2 deleted
>>> 
>>>   [root at sl7-01 ~]# crm_mon -1 -Af
>>>   Last updated: Thu Mar 12 14:35:00 2015
>>>   Last change: Thu Mar 12 14:34:20 2015
>>>   Stack: corosync
>>>   Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>>   Version: 1.1.12-ce09802
>>>   3 Nodes configured
>>>   5 Resources configured
>>> 
>>> 
>>>   Online: [ sl7-01 ]
>>>   RemoteOnline: [ snmp1 ]
>>>   RemoteOFFLINE: [ snmp2 ]
>>> 
>>>    Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>>>    Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>>>    Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp1
>>>    snmp1  (ocf::pacemaker:remote):        Started sl7-01
>>> 
>>>   Migration summary:
>>>   * Node sl7-01:
>>>      snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 
> 12
>>>      14:51:44 2015'
>>>   * Node snmp1:
>>> 
>>>   Failed actions:
>>>       snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, 
>>  status=Error,
>>>       exit-reason='none', last-rc-change='Thu Mar 12 
> 14:51:44 
>>  2015',
>>>       queued=0ms, exec=0ms
>>>   -----------------------
>>> 
>>>   Best Regards,
>>>   Hideo Yamauchi.
>>> 
>>> 
>>>   _______________________________________________
>>>   Users mailing list: Users at clusterlabs.org
>>>   http://clusterlabs.org/mailman/listinfo/users
>>> 
>>>   Project Home: http://www.clusterlabs.org
>>>   Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>   Bugs: http://bugs.clusterlabs.org
>>> 
>> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>