[ClusterLabs] Antw: Re: [Question] About movement of pacemaker_remote.

Tue Apr 7 22:27:49 EDT 2015

Hi All,

Let me confirm the first question once again.

I confirmed the next movement in Pacemaker1.1.13-rc1.
Stonith does not set it.

-------------------------------------------------------------
property no-quorum-policy="ignore" \
        stonith-enabled="false" \
        startup-fencing="false" \

rsc_defaults resource-stickiness="INFINITY" \
        migration-threshold="1"

primitive snmp1 ocf:pacemaker:remote \
        params \
                server="snmp1" \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="3s" timeout="15s" \
        op stop interval="0s" timeout="60s" on-fail="ignore"

primitive snmp2 ocf:pacemaker:remote \
        params \
                server="snmp2" \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="3s" timeout="15s" \
        op stop interval="0s" timeout="60s" on-fail="stop"

primitive Host-rsc1 ocf:heartbeat:Dummy \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="ignore"

primitive Remote-rsc1 ocf:heartbeat:Dummy \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="ignore"

primitive Remote-rsc2 ocf:heartbeat:Dummy \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="ignore"

location loc1 Remote-rsc1 \
        rule 200: #uname eq snmp1 \
        rule 100: #uname eq snmp2
location loc2 Remote-rsc2 \        rule 200: #uname eq snmp2 \        rule 100: #uname eq snmp1
location loc3 Host-rsc1 \
        rule 200: #uname eq sl7-01

-------------------------------------------------------------

Step 1) We use two remote nodes and constitute a cluster.
-------------------------------------------------------------
Version: 1.1.12-3e93bc1
3 Nodes configured
5 Resources configured

Online: [ sl7-01 ]
RemoteOnline: [ snmp1 snmp2 ]

 Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
 Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
 Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2 
 snmp1  (ocf::pacemaker:remote):        Started sl7-01 
 snmp2  (ocf::pacemaker:remote):        Started sl7-01 

Node Attributes:
* Node sl7-01:
* Node snmp1:
* Node snmp2:

Migration summary:
* Node sl7-01: 
* Node snmp1: 
* Node snmp2: 
-------------------------------------------------------------

Step 2) We stop pacemaker_remoted in one remote.
-------------------------------------------------------------
Current DC: sl7-01 - partition WITHOUT quorum
Version: 1.1.12-3e93bc1
3 Nodes configured
5 Resources configured

Online: [ sl7-01 ]
RemoteOnline: [ snmp1 ]
RemoteOFFLINE: [ snmp2 ]

 Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
 Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
 snmp1  (ocf::pacemaker:remote):        Started sl7-01 
 snmp2  (ocf::pacemaker:remote):        FAILED sl7-01 

Node Attributes:
* Node sl7-01:
* Node snmp1:

Migration summary:
* Node sl7-01: 
   snmp2: migration-threshold=1 fail-count=1 last-failure='Fri Apr  3 12:56:12 2015'
* Node snmp1: 

Failed actions:
    snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, status=Error, exit-reason='none', last-rc-change='Fri Apr  3 12:56:12 2015', queued=0ms, exec=0ms
-------------------------------------------------------------

Step 3) We reboot pacemaker_remoted which stopped.

Step 4) We clear snmp2 of remote by crm_resource command, but remote cannot participate in a cluster.
-------------------------------------------------------------
Version: 1.1.12-3e93bc1
3 Nodes configured
5 Resources configured

Online: [ sl7-01 ]
RemoteOnline: [ snmp1 ]
RemoteOFFLINE: [ snmp2 ]

 Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
 Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
 snmp1  (ocf::pacemaker:remote):        Started sl7-01 
 snmp2  (ocf::pacemaker:remote):        FAILED sl7-01 

Node Attributes:
* Node sl7-01:
* Node snmp1:

Migration summary:
* Node sl7-01: 
   snmp2: migration-threshold=1 fail-count=1000000 last-failure='Wed Apr  8 11:21:09 2015'
* Node snmp1: 

Failed actions:
    snmp2_start_0 on sl7-01 'unknown error' (1): call=8, status=Timed Out, exit-reason='none', last-rc-change='Wed Apr  8 11:20:11 2015', queued=0ms, exec=0ms
-------------------------------------------------------------

Node of pacemaker and the remote node output the following log repeatedly.

-------------------------------------------------------------
Apr  8 11:20:38 sl7-01 crmd[17101]: info: crm_remote_tcp_connect_async: Attempting to connect to remote server at 192.168.40.110:3121
Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_tcp_connect_cb: Remote lrmd client TLS connection established with server snmp2:3121
Apr  8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_recv_reply: Unable to receive expected reply, disconnecting.
Apr  8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_send_recv: Remote lrmd server disconnected while waiting for reply with id 101.
Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_tls_connection_destroy: TLS connection destroyed
Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_api_disconnect: Disconnecting from lrmd service
-------------------------------------------------------------
Apr  8 11:20:36 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-snmp2:3121 id: 8fbbc3cd-daa5-406b-942d-21be868cfc62
Apr  8 11:20:37 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id: a59392c9-6575-40ed-9b53-98a68de00409
Apr  8 11:20:38 snmp2 pacemaker_remoted[1502]:     info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
Apr  8 11:20:38 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-snmp2:3121 id: a59392c9-6575-40ed-9b53-98a68de00409
Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id: 0e58614c-b1c5-4e37-a917-1f8e3de5de24
Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:     info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-snmp2:3121 id: 0e58614c-b1c5-4e37-a917-1f8e3de5de24
Apr  8 11:20:40 snmp2 pacemaker_remoted[1502]:   notice: lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id: 518bcca5-5f83-47fb-93ea-2ece33690111
-------------------------------------------------------------

Is this movement right?

Best Regards,
Hideo Yamauchi.

----- Original Message -----
> From: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
> To: users at clusterlabs.org
> Cc: 
> Date: 2015/4/2, Thu 22:30
> Subject: [ClusterLabs] Antw: Re: [Question] About movement of pacemaker_remote.
> 
>>>>  David Vossel <dvossel at redhat.com> schrieb am 02.04.2015 um 
> 14:58 in Nachricht
> <796820123.6644200.1427979523554.JavaMail.zimbra at redhat.com>:
> 
>> 
>>  ----- Original Message -----
>>> 
>>>  > On 14 Mar 2015, at 10:14 am, David Vossel 
> <dvossel at redhat.com> wrote:
>>>  > 
>>>  > 
>>>  > 
>>>  > ----- Original Message -----
>>>  >> 
>>>  >> Failed actions:
>>>  >>     snmp2_start_0 on sl7-01 'unknown error' (1): 
> call=8, status=Timed Out,
>>>  >>     exit-reason='none', last-rc-change='Thu Mar 12 
> 14:26:26 2015',
>>>  >>     queued=0ms, exec=0ms
>>>  >>     snmp2_start_0 on sl7-01 'unknown error' (1): 
> call=8, status=Timed Out,
>>>  >>     exit-reason='none', last-rc-change='Thu Mar 12 
> 14:26:26 2015',
>>>  >>     queued=0ms, exec=0ms
>>>  >> -----------------------
>>>  > 
>>>  > Pacemaker is attempting to restore connection to the remote node 
> here, are
>>>  > you
>>>  > sure the remote is accessible? The "Timed Out" error 
> means that pacemaker
>>>  > was
>>>  > unable to establish the connection during the timeout period.
>>> 
>>>  Random question: Are we smart enough not to try and start 
> pacemaker-remote
>>>  resources for node's we've just fenced?
>> 
>>  we try and re-connect to remote nodes after fencing. if the fence operation
>>  was 'off' instead of 'reboot', this would make no sense. 
> I'm not entirely 
>>  sure how to handle this. We want the remote-node re-integrated into the 
>>  cluster,
>>  but i'd like to optimize the case where we know the node will not be 
> coming
>>  back online.
> 
> Beware: Even if the fencing action is "off" (for software), a human 
> may decide to boot the node anyway, also starting the cluster software.
> 
>> 
>>> 
>>> 
>>> 
>>>  _______________________________________________
>>>  Users mailing list: Users at clusterlabs.org 
>>>  http://clusterlabs.org/mailman/listinfo/users 
>>> 
>>>  Project Home: http://www.clusterlabs.org 
>>>  Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>  Bugs: http://bugs.clusterlabs.org 
>>> 
>> 
>>  _______________________________________________
>>  Users mailing list: Users at clusterlabs.org 
>>  http://clusterlabs.org/mailman/listinfo/users 
>> 
>>  Project Home: http://www.clusterlabs.org 
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>  Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>