[ClusterLabs] Antw: Re: [Question] About movement of pacemaker_remote.

Mon Apr 13 05:11:57 UTC 2015

> On 8 Apr 2015, at 12:27 pm, renayama19661014 at ybb.ne.jp wrote:
> 
> Hi All,
> 
> Let me confirm the first question once again.
> 
> I confirmed the next movement in Pacemaker1.1.13-rc1.
> Stonith does not set it.
> 
> -------------------------------------------------------------
> property no-quorum-policy="ignore" \
>         stonith-enabled="false" \
>         startup-fencing="false" \
> 
> rsc_defaults resource-stickiness="INFINITY" \
>         migration-threshold="1"
> 
> primitive snmp1 ocf:pacemaker:remote \
>         params \
>                 server="snmp1" \
>         op start interval="0s" timeout="60s" on-fail="restart" \
>         op monitor interval="3s" timeout="15s" \
>         op stop interval="0s" timeout="60s" on-fail="ignore"
> 
> primitive snmp2 ocf:pacemaker:remote \
>         params \
>                 server="snmp2" \
>         op start interval="0s" timeout="60s" on-fail="restart" \
>         op monitor interval="3s" timeout="15s" \
>         op stop interval="0s" timeout="60s" on-fail="stop"
> 
> primitive Host-rsc1 ocf:heartbeat:Dummy \
>         op start interval="0s" timeout="60s" on-fail="restart" \
>         op monitor interval="10s" timeout="60s" on-fail="restart" \
>         op stop interval="0s" timeout="60s" on-fail="ignore"
> 
> primitive Remote-rsc1 ocf:heartbeat:Dummy \
>         op start interval="0s" timeout="60s" on-fail="restart" \
>         op monitor interval="10s" timeout="60s" on-fail="restart" \
>         op stop interval="0s" timeout="60s" on-fail="ignore"
> 
> primitive Remote-rsc2 ocf:heartbeat:Dummy \
>         op start interval="0s" timeout="60s" on-fail="restart" \
>         op monitor interval="10s" timeout="60s" on-fail="restart" \
>         op stop interval="0s" timeout="60s" on-fail="ignore"
> 
> location loc1 Remote-rsc1 \
>         rule 200: #uname eq snmp1 \
>         rule 100: #uname eq snmp2
> location loc2 Remote-rsc2 \        rule 200: #uname eq snmp2 \        rule 100: #uname eq snmp1
> location loc3 Host-rsc1 \
>         rule 200: #uname eq sl7-01
> 
> -------------------------------------------------------------
> 
> Step 1) We use two remote nodes and constitute a cluster.
> -------------------------------------------------------------
> Version: 1.1.12-3e93bc1
> 3 Nodes configured
> 5 Resources configured
> 
> 
> Online: [ sl7-01 ]
> RemoteOnline: [ snmp1 snmp2 ]
> 
>  Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
>  Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
>  Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2 
>  snmp1  (ocf::pacemaker:remote):        Started sl7-01 
>  snmp2  (ocf::pacemaker:remote):        Started sl7-01 
> 
> Node Attributes:
> * Node sl7-01:
> * Node snmp1:
> * Node snmp2:
> 
> Migration summary:
> * Node sl7-01: 
> * Node snmp1: 
> * Node snmp2: 
> -------------------------------------------------------------
> 
> Step 2) We stop pacemaker_remoted in one remote.
> -------------------------------------------------------------
> Current DC: sl7-01 - partition WITHOUT quorum
> Version: 1.1.12-3e93bc1
> 3 Nodes configured
> 5 Resources configured
> 
> 
> Online: [ sl7-01 ]
> RemoteOnline: [ snmp1 ]
> RemoteOFFLINE: [ snmp2 ]
> 
>  Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
>  Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
>  snmp1  (ocf::pacemaker:remote):        Started sl7-01 
>  snmp2  (ocf::pacemaker:remote):        FAILED sl7-01 
> 
> Node Attributes:
> * Node sl7-01:
> * Node snmp1:
> 
> Migration summary:
> * Node sl7-01: 
>    snmp2: migration-threshold=1 fail-count=1 last-failure='Fri Apr  3 12:56:12 2015'
> * Node snmp1: 
> 
> Failed actions:
>     snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, status=Error, exit-reason='none', last-rc-change='Fri Apr  3 12:56:12 2015', queued=0ms, exec=0ms

Ideally we’d have fencing configured and reboot the remote node here.
But for the sake of argument, ok :)

> -------------------------------------------------------------
> 
> Step 3) We reboot pacemaker_remoted which stopped.

As in you reboot the node on which pacemaker_remoted is stopped and pacemaker_remoted is configured to start at boot?

> 
> Step 4) We clear snmp2 of remote by crm_resource command,

Was pacemaker_remoted running at this point?
I mentioned this earlier today, we need to improve the experience in this area.

Probably a good excuse to fix on-fail=ignore for start actions. 

> but remote cannot participate in a cluster.
> -------------------------------------------------------------
> Version: 1.1.12-3e93bc1
> 3 Nodes configured
> 5 Resources configured
> 
> 
> Online: [ sl7-01 ]
> RemoteOnline: [ snmp1 ]
> RemoteOFFLINE: [ snmp2 ]
> 
>  Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
>  Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
>  snmp1  (ocf::pacemaker:remote):        Started sl7-01 
>  snmp2  (ocf::pacemaker:remote):        FAILED sl7-01 
> 
> Node Attributes:
> * Node sl7-01:
> * Node snmp1:
> 
> Migration summary:
> * Node sl7-01: 
>    snmp2: migration-threshold=1 fail-count=1000000 last-failure='Wed Apr  8 11:21:09 2015'
> * Node snmp1: 
> 
> Failed actions:
>     snmp2_start_0 on sl7-01 'unknown error' (1): call=8, status=Timed Out, exit-reason='none', last-rc-change='Wed Apr  8 11:20:11 2015', queued=0ms, exec=0ms
> -------------------------------------------------------------
> 
> 
> Node of pacemaker and the remote node output the following log repeatedly.
> 
> -------------------------------------------------------------
> Apr  8 11:20:38 sl7-01 crmd[17101]: info: crm_remote_tcp_connect_async: Attempting to connect to remote server at 192.168.40.110:3121
> Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_tcp_connect_cb: Remote lrmd client TLS connection established with server snmp2:3121
> Apr  8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_recv_reply: Unable to receive expected reply, disconnecting.
> Apr  8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_send_recv: Remote lrmd server disconnected while waiting for reply with id 101.
> Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_tls_connection_destroy: TLS connection destroyed
> Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_api_disconnect: Disconnecting from lrmd service
> -------------------------------------------------------------
> Apr  8 11:20:36 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-snmp2:3121 id: 8fbbc3cd-daa5-406b-942d-21be868cfc62
> Apr  8 11:20:37 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id: a59392c9-6575-40ed-9b53-98a68de00409
> Apr  8 11:20:38 snmp2 pacemaker_remoted[1502]:     info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
> Apr  8 11:20:38 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-snmp2:3121 id: a59392c9-6575-40ed-9b53-98a68de00409
> Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id: 0e58614c-b1c5-4e37-a917-1f8e3de5de24
> Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:     info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
> Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-snmp2:3121 id: 0e58614c-b1c5-4e37-a917-1f8e3de5de24
> Apr  8 11:20:40 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id: 518bcca5-5f83-47fb-93ea-2ece33690111
> -------------------------------------------------------------
> 
> Is this movement right?
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> ----- Original Message -----
>> From: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
>> To: users at clusterlabs.org
>> Cc: 
>> Date: 2015/4/2, Thu 22:30
>> Subject: [ClusterLabs] Antw: Re: [Question] About movement of pacemaker_remote.
>> 
>>>>> David Vossel <dvossel at redhat.com> schrieb am 02.04.2015 um 
>> 14:58 in Nachricht
>> <796820123.6644200.1427979523554.JavaMail.zimbra at redhat.com>:
>> 
>>> 
>>> ----- Original Message -----
>>>> 
>>>>> On 14 Mar 2015, at 10:14 am, David Vossel 
>> <dvossel at redhat.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Original Message -----
>>>>>> 
>>>>>> Failed actions:
>>>>>>      snmp2_start_0 on sl7-01 'unknown error' (1): 
>> call=8, status=Timed Out,
>>>>>>      exit-reason='none', last-rc-change='Thu Mar 12 
>> 14:26:26 2015',
>>>>>>      queued=0ms, exec=0ms
>>>>>>      snmp2_start_0 on sl7-01 'unknown error' (1): 
>> call=8, status=Timed Out,
>>>>>>      exit-reason='none', last-rc-change='Thu Mar 12 
>> 14:26:26 2015',
>>>>>>      queued=0ms, exec=0ms
>>>>>> -----------------------
>>>>> 
>>>>> Pacemaker is attempting to restore connection to the remote node 
>> here, are
>>>>> you
>>>>> sure the remote is accessible? The "Timed Out" error 
>> means that pacemaker
>>>>> was
>>>>> unable to establish the connection during the timeout period.
>>>> 
>>>> Random question: Are we smart enough not to try and start 
>> pacemaker-remote
>>>> resources for node's we've just fenced?
>>> 
>>> we try and re-connect to remote nodes after fencing. if the fence operation
>>> was 'off' instead of 'reboot', this would make no sense. 
>> I'm not entirely 
>>> sure how to handle this. We want the remote-node re-integrated into the 
>>> cluster,
>>> but i'd like to optimize the case where we know the node will not be 
>> coming
>>> back online.
>> 
>> Beware: Even if the fencing action is "off" (for software), a human 
>> may decide to boot the node anyway, also starting the cluster software.
>> 
>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org 
>>>> http://clusterlabs.org/mailman/listinfo/users 
>>>> 
>>>> Project Home: http://www.clusterlabs.org 
>>>> Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>> Bugs: http://bugs.clusterlabs.org 
>>>> 
>>> 
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org 
>>> http://clusterlabs.org/mailman/listinfo/users 
>>> 
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org