[ClusterLabs] Antw: Re: [Question] About movement of pacemaker_remote.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Mon Apr 13 06:59:42 UTC 2015
Hi Andrew,
Thank you for comments.
>> Step 4) We clear snmp2 of remote by crm_resource command,
>
> Was pacemaker_remoted running at this point?
Yes.
In the node that rebooted pacemaker_remote, it becomes the following log.
------------------------------
Apr 13 15:47:29 snmp2 pacemaker_remoted[1494]: info: main: Starting ----> #### RESTARTED pacemaker_remote.
Apr 13 15:47:42 snmp2 pacemaker_remoted[1494]: notice: lrmd_remote_listen: LRMD client connection established. 0x24f4ca0 id: 5b56e54e-b9da-4804-afda-5c72038d089c
Apr 13 15:47:43 snmp2 pacemaker_remoted[1494]: info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
Apr 13 15:47:43 snmp2 pacemaker_remoted[1494]: notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-snmp2:3121 id: 5b56e54e-b9da-4804-afda-5c72038d089c
Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]: notice: lrmd_remote_listen: LRMD client connection established. 0x24f4ca0 id: 907cd1fc-6c1d-40f1-8c60-34bc8b66715f
Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]: info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]: notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-snmp2:3121 id: 907cd1fc-6c1d-40f1-8c60-34bc8b66715f
Apr 13 15:47:45 snmp2 pacemaker_remoted[1494]: notice: lrmd_remote_listen: LRMD client connection established. 0x24f4ca0 id: 8b38c0dd-9338-478a-8f23-523aee4cc1a6
Apr 13 15:47:46 snmp2 pacemaker_remoted[1494]: info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
(snip)
After that the log is repeated.
------------------------------
> I mentioned this earlier today, we need to improve the experience in this area.
>
> Probably a good excuse to fix on-fail=ignore for start actions.
>
>> but remote cannot participate in a cluster.
I changed crm file as follows.(on-fail=ignore for start)
(snip)
primitive snmp1 ocf:pacemaker:remote \
params \
server="snmp1" \
op start interval="0s" timeout="60s" on-fail="ignore" \
op monitor interval="3s" timeout="15s" \
op stop interval="0s" timeout="60s" on-fail="ignore"
primitive snmp2 ocf:pacemaker:remote \
params \
server="snmp2" \
op start interval="0s" timeout="60s" on-fail="ignore" \
op monitor interval="3s" timeout="15s" \
op stop interval="0s" timeout="60s" on-fail="stop"
(snip)
However, the result was the same.
Even if the node of pacemaker_remote which rebooted carries out crm_resource -C, the node does not participate in a cluster.
[root at sl7-01 ~]# crm_mon -1 -Af
Last updated: Mon Apr 13 15:51:58 2015
Last change: Mon Apr 13 15:47:41 2015
Stack: corosync
Current DC: sl7-01 - partition WITHOUT quorum
Version: 1.1.12-3e93bc1
3 Nodes configured
5 Resources configured
Online: [ sl7-01 ]
RemoteOnline: [ snmp1 ]
RemoteOFFLINE: [ snmp2 ]
Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
Remote-rsc2 (ocf::heartbeat:Dummy): Started snmp1 (failure ignored)
snmp1 (ocf::pacemaker:remote): Started sl7-01
Node Attributes:
* Node sl7-01:
* Node snmp1:
Migration summary:
* Node sl7-01:
snmp2: migration-threshold=1 fail-count=1000000 last-failure='Mon Apr 13 15:48:40 2015'
* Node snmp1:
Failed actions:
snmp2_start_0 on sl7-01 'unknown error' (1): call=8, status=Timed Out, exit-reason='none', last-rc-change='Mon Apr 13 15:47:42 2015', queued=0ms, exec=0ms
Best Regards,
Hidoe Yamauchi.
----- Original Message -----
> From: Andrew Beekhof <andrew at beekhof.net>
> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc:
> Date: 2015/4/13, Mon 14:11
> Subject: Re: [ClusterLabs] Antw: Re: [Question] About movement of pacemaker_remote.
>
>
>> On 8 Apr 2015, at 12:27 pm, renayama19661014 at ybb.ne.jp wrote:
>>
>> Hi All,
>>
>> Let me confirm the first question once again.
>>
>> I confirmed the next movement in Pacemaker1.1.13-rc1.
>> Stonith does not set it.
>>
>> -------------------------------------------------------------
>> property no-quorum-policy="ignore" \
>> stonith-enabled="false" \
>> startup-fencing="false" \
>>
>> rsc_defaults resource-stickiness="INFINITY" \
>> migration-threshold="1"
>>
>> primitive snmp1 ocf:pacemaker:remote \
>> params \
>> server="snmp1" \
>> op start interval="0s" timeout="60s"
> on-fail="restart" \
>> op monitor interval="3s" timeout="15s" \
>> op stop interval="0s" timeout="60s"
> on-fail="ignore"
>>
>> primitive snmp2 ocf:pacemaker:remote \
>> params \
>> server="snmp2" \
>> op start interval="0s" timeout="60s"
> on-fail="restart" \
>> op monitor interval="3s" timeout="15s" \
>> op stop interval="0s" timeout="60s"
> on-fail="stop"
>>
>> primitive Host-rsc1 ocf:heartbeat:Dummy \
>> op start interval="0s" timeout="60s"
> on-fail="restart" \
>> op monitor interval="10s" timeout="60s"
> on-fail="restart" \
>> op stop interval="0s" timeout="60s"
> on-fail="ignore"
>>
>> primitive Remote-rsc1 ocf:heartbeat:Dummy \
>> op start interval="0s" timeout="60s"
> on-fail="restart" \
>> op monitor interval="10s" timeout="60s"
> on-fail="restart" \
>> op stop interval="0s" timeout="60s"
> on-fail="ignore"
>>
>> primitive Remote-rsc2 ocf:heartbeat:Dummy \
>> op start interval="0s" timeout="60s"
> on-fail="restart" \
>> op monitor interval="10s" timeout="60s"
> on-fail="restart" \
>> op stop interval="0s" timeout="60s"
> on-fail="ignore"
>>
>> location loc1 Remote-rsc1 \
>> rule 200: #uname eq snmp1 \
>> rule 100: #uname eq snmp2
>> location loc2 Remote-rsc2 \ rule 200: #uname eq snmp2 \
> rule 100: #uname eq snmp1
>> location loc3 Host-rsc1 \
>> rule 200: #uname eq sl7-01
>>
>> -------------------------------------------------------------
>>
>> Step 1) We use two remote nodes and constitute a cluster.
>> -------------------------------------------------------------
>> Version: 1.1.12-3e93bc1
>> 3 Nodes configured
>> 5 Resources configured
>>
>>
>> Online: [ sl7-01 ]
>> RemoteOnline: [ snmp1 snmp2 ]
>>
>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> Remote-rsc2 (ocf::heartbeat:Dummy): Started snmp2
>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>> snmp2 (ocf::pacemaker:remote): Started sl7-01
>>
>> Node Attributes:
>> * Node sl7-01:
>> * Node snmp1:
>> * Node snmp2:
>>
>> Migration summary:
>> * Node sl7-01:
>> * Node snmp1:
>> * Node snmp2:
>> -------------------------------------------------------------
>>
>> Step 2) We stop pacemaker_remoted in one remote.
>> -------------------------------------------------------------
>> Current DC: sl7-01 - partition WITHOUT quorum
>> Version: 1.1.12-3e93bc1
>> 3 Nodes configured
>> 5 Resources configured
>>
>>
>> Online: [ sl7-01 ]
>> RemoteOnline: [ snmp1 ]
>> RemoteOFFLINE: [ snmp2 ]
>>
>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>> snmp2 (ocf::pacemaker:remote): FAILED sl7-01
>>
>> Node Attributes:
>> * Node sl7-01:
>> * Node snmp1:
>>
>> Migration summary:
>> * Node sl7-01:
>> snmp2: migration-threshold=1 fail-count=1 last-failure='Fri Apr 3
> 12:56:12 2015'
>> * Node snmp1:
>>
>> Failed actions:
>> snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6,
> status=Error, exit-reason='none', last-rc-change='Fri Apr 3
> 12:56:12 2015', queued=0ms, exec=0ms
>
> Ideally we’d have fencing configured and reboot the remote node here.
> But for the sake of argument, ok :)
>
>
>> -------------------------------------------------------------
>>
>> Step 3) We reboot pacemaker_remoted which stopped.
>
> As in you reboot the node on which pacemaker_remoted is stopped and
> pacemaker_remoted is configured to start at boot?
>
>>
>> Step 4) We clear snmp2 of remote by crm_resource command,
>
> Was pacemaker_remoted running at this point?
> I mentioned this earlier today, we need to improve the experience in this area.
>
> Probably a good excuse to fix on-fail=ignore for start actions.
>
>> but remote cannot participate in a cluster.
>> -------------------------------------------------------------
>> Version: 1.1.12-3e93bc1
>> 3 Nodes configured
>> 5 Resources configured
>>
>>
>> Online: [ sl7-01 ]
>> RemoteOnline: [ snmp1 ]
>> RemoteOFFLINE: [ snmp2 ]
>>
>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>> snmp2 (ocf::pacemaker:remote): FAILED sl7-01
>>
>> Node Attributes:
>> * Node sl7-01:
>> * Node snmp1:
>>
>> Migration summary:
>> * Node sl7-01:
>> snmp2: migration-threshold=1 fail-count=1000000 last-failure='Wed
> Apr 8 11:21:09 2015'
>> * Node snmp1:
>>
>> Failed actions:
>> snmp2_start_0 on sl7-01 'unknown error' (1): call=8,
> status=Timed Out, exit-reason='none', last-rc-change='Wed Apr 8
> 11:20:11 2015', queued=0ms, exec=0ms
>> -------------------------------------------------------------
>>
>>
>> Node of pacemaker and the remote node output the following log repeatedly.
>>
>> -------------------------------------------------------------
>> Apr 8 11:20:38 sl7-01 crmd[17101]: info: crm_remote_tcp_connect_async:
> Attempting to connect to remote server at 192.168.40.110:3121
>> Apr 8 11:20:38 sl7-01 crmd[17101]: info: lrmd_tcp_connect_cb: Remote lrmd
> client TLS connection established with server snmp2:3121
>> Apr 8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_recv_reply: Unable to
> receive expected reply, disconnecting.
>> Apr 8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_send_recv: Remote lrmd
> server disconnected while waiting for reply with id 101.
>> Apr 8 11:20:38 sl7-01 crmd[17101]: info: lrmd_tls_connection_destroy: TLS
> connection destroyed
>> Apr 8 11:20:38 sl7-01 crmd[17101]: info: lrmd_api_disconnect:
> Disconnecting from lrmd service
>> -------------------------------------------------------------
>> Apr 8 11:20:36 snmp2 pacemaker_remoted[1502]: notice:
> lrmd_remote_client_destroy: LRMD client disconnecting remote client - name:
> remote-lrmd-snmp2:3121 id: 8fbbc3cd-daa5-406b-942d-21be868cfc62
>> Apr 8 11:20:37 snmp2 pacemaker_remoted[1502]: notice:
> lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id:
> a59392c9-6575-40ed-9b53-98a68de00409
>> Apr 8 11:20:38 snmp2 pacemaker_remoted[1502]: info:
> lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
>> Apr 8 11:20:38 snmp2 pacemaker_remoted[1502]: notice:
> lrmd_remote_client_destroy: LRMD client disconnecting remote client - name:
> remote-lrmd-snmp2:3121 id: a59392c9-6575-40ed-9b53-98a68de00409
>> Apr 8 11:20:39 snmp2 pacemaker_remoted[1502]: notice:
> lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id:
> 0e58614c-b1c5-4e37-a917-1f8e3de5de24
>> Apr 8 11:20:39 snmp2 pacemaker_remoted[1502]: info:
> lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
>> Apr 8 11:20:39 snmp2 pacemaker_remoted[1502]: notice:
> lrmd_remote_client_destroy: LRMD client disconnecting remote client - name:
> remote-lrmd-snmp2:3121 id: 0e58614c-b1c5-4e37-a917-1f8e3de5de24
>> Apr 8 11:20:40 snmp2 pacemaker_remoted[1502]: notice:
> lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id:
> 518bcca5-5f83-47fb-93ea-2ece33690111
>> -------------------------------------------------------------
>>
>> Is this movement right?
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>>
>>
>> ----- Original Message -----
>>> From: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
>>> To: users at clusterlabs.org
>>> Cc:
>>> Date: 2015/4/2, Thu 22:30
>>> Subject: [ClusterLabs] Antw: Re: [Question] About movement of
> pacemaker_remote.
>>>
>>>>>> David Vossel <dvossel at redhat.com> schrieb am
> 02.04.2015 um
>>> 14:58 in Nachricht
>>> <796820123.6644200.1427979523554.JavaMail.zimbra at redhat.com>:
>>>
>>>>
>>>> ----- Original Message -----
>>>>>
>>>>>> On 14 Mar 2015, at 10:14 am, David Vossel
>>> <dvossel at redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>>
>>>>>>> Failed actions:
>>>>>>> snmp2_start_0 on sl7-01 'unknown error'
> (1):
>>> call=8, status=Timed Out,
>>>>>>> exit-reason='none',
> last-rc-change='Thu Mar 12
>>> 14:26:26 2015',
>>>>>>> queued=0ms, exec=0ms
>>>>>>> snmp2_start_0 on sl7-01 'unknown error'
> (1):
>>> call=8, status=Timed Out,
>>>>>>> exit-reason='none',
> last-rc-change='Thu Mar 12
>>> 14:26:26 2015',
>>>>>>> queued=0ms, exec=0ms
>>>>>>> -----------------------
>>>>>>
>>>>>> Pacemaker is attempting to restore connection to the remote
> node
>>> here, are
>>>>>> you
>>>>>> sure the remote is accessible? The "Timed Out"
> error
>>> means that pacemaker
>>>>>> was
>>>>>> unable to establish the connection during the timeout
> period.
>>>>>
>>>>> Random question: Are we smart enough not to try and start
>>> pacemaker-remote
>>>>> resources for node's we've just fenced?
>>>>
>>>> we try and re-connect to remote nodes after fencing. if the fence
> operation
>>>> was 'off' instead of 'reboot', this would make no
> sense.
>>> I'm not entirely
>>>> sure how to handle this. We want the remote-node re-integrated into
> the
>>>> cluster,
>>>> but i'd like to optimize the case where we know the node will
> not be
>>> coming
>>>> back online.
>>>
>>> Beware: Even if the fencing action is "off" (for software), a
> human
>>> may decide to boot the node anyway, also starting the cluster software.
>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list