[ClusterLabs] Antw: Re: [Question] About movement of pacemaker_remote.

Mon Apr 13 06:59:42 UTC 2015

Hi Andrew,

Thank you for comments.

>> Step 4) We clear snmp2 of remote by crm_resource command,
> 
> Was pacemaker_remoted running at this point?

Yes.

In the node that rebooted pacemaker_remote, it becomes the following log.

------------------------------
Apr 13 15:47:29 snmp2 pacemaker_remoted[1494]:     info: main: Starting ----> #### RESTARTED pacemaker_remote.
Apr 13 15:47:42 snmp2 pacemaker_remoted[1494]:   notice: lrmd_remote_listen: LRMD client connection established. 0x24f4ca0 id: 5b56e54e-b9da-4804-afda-5c72038d089c
Apr 13 15:47:43 snmp2 pacemaker_remoted[1494]:     info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
Apr 13 15:47:43 snmp2 pacemaker_remoted[1494]:   notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-snmp2:3121 id: 5b56e54e-b9da-4804-afda-5c72038d089c
Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]:   notice: lrmd_remote_listen: LRMD client connection established. 0x24f4ca0 id: 907cd1fc-6c1d-40f1-8c60-34bc8b66715f
Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]:     info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]:   notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-snmp2:3121 id: 907cd1fc-6c1d-40f1-8c60-34bc8b66715f
Apr 13 15:47:45 snmp2 pacemaker_remoted[1494]:   notice: lrmd_remote_listen: LRMD client connection established. 0x24f4ca0 id: 8b38c0dd-9338-478a-8f23-523aee4cc1a6
Apr 13 15:47:46 snmp2 pacemaker_remoted[1494]:     info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
(snip)
After that the log is repeated.

------------------------------

> I mentioned this earlier today, we need to improve the experience in this area.
> 
> Probably a good excuse to fix on-fail=ignore for start actions. 
> 
>> but remote cannot participate in a cluster.

I changed crm file as follows.(on-fail=ignore for start)

(snip)
primitive snmp1 ocf:pacemaker:remote \
        params \
                server="snmp1" \
        op start interval="0s" timeout="60s" on-fail="ignore" \
        op monitor interval="3s" timeout="15s" \
        op stop interval="0s" timeout="60s" on-fail="ignore"

primitive snmp2 ocf:pacemaker:remote \
        params \
                server="snmp2" \
        op start interval="0s" timeout="60s" on-fail="ignore" \
        op monitor interval="3s" timeout="15s" \
        op stop interval="0s" timeout="60s" on-fail="stop"

(snip)

However, the result was the same.
Even if the node of pacemaker_remote which rebooted carries out crm_resource -C, the node does not participate in a cluster.

[root at sl7-01 ~]# crm_mon -1 -Af
Last updated: Mon Apr 13 15:51:58 2015
Last change: Mon Apr 13 15:47:41 2015
Stack: corosync
Current DC: sl7-01 - partition WITHOUT quorum
Version: 1.1.12-3e93bc1
3 Nodes configured
5 Resources configured

Online: [ sl7-01 ]
RemoteOnline: [ snmp1 ]
RemoteOFFLINE: [ snmp2 ]

 Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
 Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
 Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp1 (failure ignored)
 snmp1  (ocf::pacemaker:remote):        Started sl7-01 

Node Attributes:
* Node sl7-01:
* Node snmp1:

Migration summary:
* Node sl7-01: 
   snmp2: migration-threshold=1 fail-count=1000000 last-failure='Mon Apr 13 15:48:40 2015'
* Node snmp1: 

Failed actions:
    snmp2_start_0 on sl7-01 'unknown error' (1): call=8, status=Timed Out, exit-reason='none', last-rc-change='Mon Apr 13 15:47:42 2015', queued=0ms, exec=0ms

Best Regards,
Hidoe Yamauchi.

----- Original Message -----
> From: Andrew Beekhof <andrew at beekhof.net>
> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc: 
> Date: 2015/4/13, Mon 14:11
> Subject: Re: [ClusterLabs] Antw: Re: [Question] About movement of pacemaker_remote.
> 
> 
>>  On 8 Apr 2015, at 12:27 pm, renayama19661014 at ybb.ne.jp wrote:
>> 
>>  Hi All,
>> 
>>  Let me confirm the first question once again.
>> 
>>  I confirmed the next movement in Pacemaker1.1.13-rc1.
>>  Stonith does not set it.
>> 
>>  -------------------------------------------------------------
>>  property no-quorum-policy="ignore" \
>>          stonith-enabled="false" \
>>          startup-fencing="false" \
>> 
>>  rsc_defaults resource-stickiness="INFINITY" \
>>          migration-threshold="1"
>> 
>>  primitive snmp1 ocf:pacemaker:remote \
>>          params \
>>                  server="snmp1" \
>>          op start interval="0s" timeout="60s" 
> on-fail="restart" \
>>          op monitor interval="3s" timeout="15s" \
>>          op stop interval="0s" timeout="60s" 
> on-fail="ignore"
>> 
>>  primitive snmp2 ocf:pacemaker:remote \
>>          params \
>>                  server="snmp2" \
>>          op start interval="0s" timeout="60s" 
> on-fail="restart" \
>>          op monitor interval="3s" timeout="15s" \
>>          op stop interval="0s" timeout="60s" 
> on-fail="stop"
>> 
>>  primitive Host-rsc1 ocf:heartbeat:Dummy \
>>          op start interval="0s" timeout="60s" 
> on-fail="restart" \
>>          op monitor interval="10s" timeout="60s" 
> on-fail="restart" \
>>          op stop interval="0s" timeout="60s" 
> on-fail="ignore"
>> 
>>  primitive Remote-rsc1 ocf:heartbeat:Dummy \
>>          op start interval="0s" timeout="60s" 
> on-fail="restart" \
>>          op monitor interval="10s" timeout="60s" 
> on-fail="restart" \
>>          op stop interval="0s" timeout="60s" 
> on-fail="ignore"
>> 
>>  primitive Remote-rsc2 ocf:heartbeat:Dummy \
>>          op start interval="0s" timeout="60s" 
> on-fail="restart" \
>>          op monitor interval="10s" timeout="60s" 
> on-fail="restart" \
>>          op stop interval="0s" timeout="60s" 
> on-fail="ignore"
>> 
>>  location loc1 Remote-rsc1 \
>>          rule 200: #uname eq snmp1 \
>>          rule 100: #uname eq snmp2
>>  location loc2 Remote-rsc2 \        rule 200: #uname eq snmp2 \      
>   rule 100: #uname eq snmp1
>>  location loc3 Host-rsc1 \
>>          rule 200: #uname eq sl7-01
>> 
>>  -------------------------------------------------------------
>> 
>>  Step 1) We use two remote nodes and constitute a cluster.
>>  -------------------------------------------------------------
>>  Version: 1.1.12-3e93bc1
>>  3 Nodes configured
>>  5 Resources configured
>> 
>> 
>>  Online: [ sl7-01 ]
>>  RemoteOnline: [ snmp1 snmp2 ]
>> 
>>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
>>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
>>   Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2 
>>   snmp1  (ocf::pacemaker:remote):        Started sl7-01 
>>   snmp2  (ocf::pacemaker:remote):        Started sl7-01 
>> 
>>  Node Attributes:
>>  * Node sl7-01:
>>  * Node snmp1:
>>  * Node snmp2:
>> 
>>  Migration summary:
>>  * Node sl7-01: 
>>  * Node snmp1: 
>>  * Node snmp2: 
>>  -------------------------------------------------------------
>> 
>>  Step 2) We stop pacemaker_remoted in one remote.
>>  -------------------------------------------------------------
>>  Current DC: sl7-01 - partition WITHOUT quorum
>>  Version: 1.1.12-3e93bc1
>>  3 Nodes configured
>>  5 Resources configured
>> 
>> 
>>  Online: [ sl7-01 ]
>>  RemoteOnline: [ snmp1 ]
>>  RemoteOFFLINE: [ snmp2 ]
>> 
>>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
>>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
>>   snmp1  (ocf::pacemaker:remote):        Started sl7-01 
>>   snmp2  (ocf::pacemaker:remote):        FAILED sl7-01 
>> 
>>  Node Attributes:
>>  * Node sl7-01:
>>  * Node snmp1:
>> 
>>  Migration summary:
>>  * Node sl7-01: 
>>     snmp2: migration-threshold=1 fail-count=1 last-failure='Fri Apr  3 
> 12:56:12 2015'
>>  * Node snmp1: 
>> 
>>  Failed actions:
>>      snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, 
> status=Error, exit-reason='none', last-rc-change='Fri Apr  3 
> 12:56:12 2015', queued=0ms, exec=0ms
> 
> Ideally we’d have fencing configured and reboot the remote node here.
> But for the sake of argument, ok :)
> 
> 
>>  -------------------------------------------------------------
>> 
>>  Step 3) We reboot pacemaker_remoted which stopped.
> 
> As in you reboot the node on which pacemaker_remoted is stopped and 
> pacemaker_remoted is configured to start at boot?
> 
>> 
>>  Step 4) We clear snmp2 of remote by crm_resource command,
> 
> Was pacemaker_remoted running at this point?
> I mentioned this earlier today, we need to improve the experience in this area.
> 
> Probably a good excuse to fix on-fail=ignore for start actions. 
> 
>>  but remote cannot participate in a cluster.
>>  -------------------------------------------------------------
>>  Version: 1.1.12-3e93bc1
>>  3 Nodes configured
>>  5 Resources configured
>> 
>> 
>>  Online: [ sl7-01 ]
>>  RemoteOnline: [ snmp1 ]
>>  RemoteOFFLINE: [ snmp2 ]
>> 
>>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
>>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
>>   snmp1  (ocf::pacemaker:remote):        Started sl7-01 
>>   snmp2  (ocf::pacemaker:remote):        FAILED sl7-01 
>> 
>>  Node Attributes:
>>  * Node sl7-01:
>>  * Node snmp1:
>> 
>>  Migration summary:
>>  * Node sl7-01: 
>>     snmp2: migration-threshold=1 fail-count=1000000 last-failure='Wed 
> Apr  8 11:21:09 2015'
>>  * Node snmp1: 
>> 
>>  Failed actions:
>>      snmp2_start_0 on sl7-01 'unknown error' (1): call=8, 
> status=Timed Out, exit-reason='none', last-rc-change='Wed Apr  8 
> 11:20:11 2015', queued=0ms, exec=0ms
>>  -------------------------------------------------------------
>> 
>> 
>>  Node of pacemaker and the remote node output the following log repeatedly.
>> 
>>  -------------------------------------------------------------
>>  Apr  8 11:20:38 sl7-01 crmd[17101]: info: crm_remote_tcp_connect_async: 
> Attempting to connect to remote server at 192.168.40.110:3121
>>  Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_tcp_connect_cb: Remote lrmd 
> client TLS connection established with server snmp2:3121
>>  Apr  8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_recv_reply: Unable to 
> receive expected reply, disconnecting.
>>  Apr  8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_send_recv: Remote lrmd 
> server disconnected while waiting for reply with id 101.
>>  Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_tls_connection_destroy: TLS 
> connection destroyed
>>  Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_api_disconnect: 
> Disconnecting from lrmd service
>>  -------------------------------------------------------------
>>  Apr  8 11:20:36 snmp2 pacemaker_remoted[1502]:   notice: 
> lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: 
> remote-lrmd-snmp2:3121 id: 8fbbc3cd-daa5-406b-942d-21be868cfc62
>>  Apr  8 11:20:37 snmp2 pacemaker_remoted[1502]:   notice: 
> lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id: 
> a59392c9-6575-40ed-9b53-98a68de00409
>>  Apr  8 11:20:38 snmp2 pacemaker_remoted[1502]:     info: 
> lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
>>  Apr  8 11:20:38 snmp2 pacemaker_remoted[1502]:   notice: 
> lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: 
> remote-lrmd-snmp2:3121 id: a59392c9-6575-40ed-9b53-98a68de00409
>>  Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:   notice: 
> lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id: 
> 0e58614c-b1c5-4e37-a917-1f8e3de5de24
>>  Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:     info: 
> lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
>>  Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:   notice: 
> lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: 
> remote-lrmd-snmp2:3121 id: 0e58614c-b1c5-4e37-a917-1f8e3de5de24
>>  Apr  8 11:20:40 snmp2 pacemaker_remoted[1502]:   notice: 
> lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id: 
> 518bcca5-5f83-47fb-93ea-2ece33690111
>>  -------------------------------------------------------------
>> 
>>  Is this movement right?
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>> 
>>  ----- Original Message -----
>>>  From: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
>>>  To: users at clusterlabs.org
>>>  Cc: 
>>>  Date: 2015/4/2, Thu 22:30
>>>  Subject: [ClusterLabs] Antw: Re: [Question] About movement of 
> pacemaker_remote.
>>> 
>>>>>>  David Vossel <dvossel at redhat.com> schrieb am 
> 02.04.2015 um 
>>>  14:58 in Nachricht
>>>  <796820123.6644200.1427979523554.JavaMail.zimbra at redhat.com>:
>>> 
>>>> 
>>>>  ----- Original Message -----
>>>>> 
>>>>>>  On 14 Mar 2015, at 10:14 am, David Vossel 
>>>  <dvossel at redhat.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  ----- Original Message -----
>>>>>>> 
>>>>>>>  Failed actions:
>>>>>>>       snmp2_start_0 on sl7-01 'unknown error' 
> (1): 
>>>  call=8, status=Timed Out,
>>>>>>>       exit-reason='none', 
> last-rc-change='Thu Mar 12 
>>>  14:26:26 2015',
>>>>>>>       queued=0ms, exec=0ms
>>>>>>>       snmp2_start_0 on sl7-01 'unknown error' 
> (1): 
>>>  call=8, status=Timed Out,
>>>>>>>       exit-reason='none', 
> last-rc-change='Thu Mar 12 
>>>  14:26:26 2015',
>>>>>>>       queued=0ms, exec=0ms
>>>>>>>  -----------------------
>>>>>> 
>>>>>>  Pacemaker is attempting to restore connection to the remote 
> node 
>>>  here, are
>>>>>>  you
>>>>>>  sure the remote is accessible? The "Timed Out" 
> error 
>>>  means that pacemaker
>>>>>>  was
>>>>>>  unable to establish the connection during the timeout 
> period.
>>>>> 
>>>>>  Random question: Are we smart enough not to try and start 
>>>  pacemaker-remote
>>>>>  resources for node's we've just fenced?
>>>> 
>>>>  we try and re-connect to remote nodes after fencing. if the fence 
> operation
>>>>  was 'off' instead of 'reboot', this would make no 
> sense. 
>>>  I'm not entirely 
>>>>  sure how to handle this. We want the remote-node re-integrated into 
> the 
>>>>  cluster,
>>>>  but i'd like to optimize the case where we know the node will 
> not be 
>>>  coming
>>>>  back online.
>>> 
>>>  Beware: Even if the fencing action is "off" (for software), a 
> human 
>>>  may decide to boot the node anyway, also starting the cluster software.
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>  _______________________________________________
>>>>>  Users mailing list: Users at clusterlabs.org 
>>>>>  http://clusterlabs.org/mailman/listinfo/users 
>>>>> 
>>>>>  Project Home: http://www.clusterlabs.org 
>>>>>  Getting started: 
>>>  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>  Bugs: http://bugs.clusterlabs.org 
>>>>> 
>>>> 
>>>>  _______________________________________________
>>>>  Users mailing list: Users at clusterlabs.org 
>>>>  http://clusterlabs.org/mailman/listinfo/users 
>>>> 
>>>>  Project Home: http://www.clusterlabs.org 
>>>>  Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>  Bugs: http://bugs.clusterlabs.org 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>  _______________________________________________
>>>  Users mailing list: Users at clusterlabs.org
>>>  http://clusterlabs.org/mailman/listinfo/users
>>> 
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>>> 
>> 
>>  _______________________________________________
>>  Users mailing list: Users at clusterlabs.org
>>  http://clusterlabs.org/mailman/listinfo/users
>> 
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>