[ClusterLabs] Antw: Re: [Question] About movement of pacemaker_remote.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Wed Apr 15 01:37:04 UTC 2015
Hi David,
Thank you for comments.
> please turn on debug logging in /etc/sysconfig/pacemaker for both the pacemaker
> nodes and the nodes running pacemaker remote.
>
> set the following
>
> PCMK_logfile=/var/log/pacemaker.log
> PCMK_debug=yes
> PCMK_trace_files=lrmd_client.c,lrmd.c,tls_backend.c,remote.c
>
> Provide the logs with the new debug settings enabled during the time period
> that pacemaker is unable to reconnect to pacemaker_remote.
I put the log(log_zip.zip) of two nodes in the next place.(sl7-01 and snmp2)
* https://onedrive.live.com/?cid=3A14D57622C66876&id=3A14D57622C66876%21117
I rebooted pacemaker_remote of snmp2.
I carried out crm_resource -C snmp2 afterwards.
-------------------------------------------------
[root at sl7-01 ~]# crm_mon -1 -Af
Last updated: Wed Apr 15 09:54:16 2015
Last change: Wed Apr 15 09:46:17 2015
Stack: corosync
Current DC: sl7-01 - partition WITHOUT quorum
Version: 1.1.12-3e93bc1
3 Nodes configured
5 Resources configured
Online: [ sl7-01 ]
RemoteOnline: [ snmp1 ]
RemoteOFFLINE: [ snmp2 ]
Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
Remote-rsc2 (ocf::heartbeat:Dummy): Started snmp1 (failure ignored)
snmp1 (ocf::pacemaker:remote): Started sl7-01
Node Attributes:
* Node sl7-01:
* Node snmp1:
Migration summary:
* Node sl7-01:
snmp2: migration-threshold=1 fail-count=1000000 last-failure='Wed Apr 15 09:47:16 2015'
* Node snmp1:
Failed actions:
snmp2_start_0 on sl7-01 'unknown error' (1): call=8, status=Timed Out, exit-reason='none', last-rc-change='Wed Apr 15 09:46:18 2015', queued=0ms, exec=0ms
-------------------------------------------------
Best Regards,
Hideo Yamauchi.
----- Original Message -----
> From: David Vossel <dvossel at redhat.com>
> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc:
> Date: 2015/4/15, Wed 07:22
> Subject: Re: [ClusterLabs] Antw: Re: [Question] About movement of pacemaker_remote.
>
>
>
> ----- Original Message -----
>> Hi Andrew,
>>
>> Thank you for comments.
>>
>> >> Step 4) We clear snmp2 of remote by crm_resource command,
>> >
>> > Was pacemaker_remoted running at this point?
>
> please turn on debug logging in /etc/sysconfig/pacemaker for both the pacemaker
> nodes and the nodes running pacemaker remote.
>
> set the following
>
> PCMK_logfile=/var/log/pacemaker.log
> PCMK_debug=yes
> PCMK_trace_files=lrmd_client.c,lrmd.c,tls_backend.c,remote.c
>
> Provide the logs with the new debug settings enabled during the time period
> that pacemaker is unable to reconnect to pacemaker_remote.
>
> Thanks,
> --David
>
>>
>>
>> Yes.
>>
>> In the node that rebooted pacemaker_remote, it becomes the following log.
>>
>>
>> ------------------------------
>> Apr 13 15:47:29 snmp2 pacemaker_remoted[1494]: info: main: Starting
> ---->
>> #### RESTARTED pacemaker_remote.
>> Apr 13 15:47:42 snmp2 pacemaker_remoted[1494]: notice:
> lrmd_remote_listen:
>> LRMD client connection established. 0x24f4ca0 id:
>> 5b56e54e-b9da-4804-afda-5c72038d089c
>> Apr 13 15:47:43 snmp2 pacemaker_remoted[1494]: info:
>> lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
>> Apr 13 15:47:43 snmp2 pacemaker_remoted[1494]: notice:
>> lrmd_remote_client_destroy: LRMD client disconnecting remote client - name:
>> remote-lrmd-snmp2:3121 id: 5b56e54e-b9da-4804-afda-5c72038d089c
>> Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]: notice:
> lrmd_remote_listen:
>> LRMD client connection established. 0x24f4ca0 id:
>> 907cd1fc-6c1d-40f1-8c60-34bc8b66715f
>> Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]: info:
>> lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
>> Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]: notice:
>> lrmd_remote_client_destroy: LRMD client disconnecting remote client - name:
>> remote-lrmd-snmp2:3121 id: 907cd1fc-6c1d-40f1-8c60-34bc8b66715f
>> Apr 13 15:47:45 snmp2 pacemaker_remoted[1494]: notice:
> lrmd_remote_listen:
>> LRMD client connection established. 0x24f4ca0 id:
>> 8b38c0dd-9338-478a-8f23-523aee4cc1a6
>> Apr 13 15:47:46 snmp2 pacemaker_remoted[1494]: info:
>> lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
>> (snip)
>> After that the log is repeated.
>>
>>
>> ------------------------------
>>
>>
>> > I mentioned this earlier today, we need to improve the experience in
> this
>> > area.
>> >
>> > Probably a good excuse to fix on-fail=ignore for start actions.
>> >
>> >> but remote cannot participate in a cluster.
>>
>>
>>
>> I changed crm file as follows.(on-fail=ignore for start)
>>
>>
>> (snip)
>> primitive snmp1 ocf:pacemaker:remote \
>> params \
>> server="snmp1" \
>> op start interval="0s" timeout="60s"
> on-fail="ignore" \
>> op monitor interval="3s" timeout="15s" \
>> op stop interval="0s" timeout="60s"
> on-fail="ignore"
>>
>> primitive snmp2 ocf:pacemaker:remote \
>> params \
>> server="snmp2" \
>> op start interval="0s" timeout="60s"
> on-fail="ignore" \
>> op monitor interval="3s" timeout="15s" \
>> op stop interval="0s" timeout="60s"
> on-fail="stop"
>>
>> (snip)
>>
>> However, the result was the same.
>> Even if the node of pacemaker_remote which rebooted carries out
> crm_resource
>> -C, the node does not participate in a cluster.
>>
>> [root at sl7-01 ~]# crm_mon -1 -Af
>> Last updated: Mon Apr 13 15:51:58 2015
>> Last change: Mon Apr 13 15:47:41 2015
>> Stack: corosync
>> Current DC: sl7-01 - partition WITHOUT quorum
>> Version: 1.1.12-3e93bc1
>> 3 Nodes configured
>> 5 Resources configured
>>
>>
>> Online: [ sl7-01 ]
>> RemoteOnline: [ snmp1 ]
>> RemoteOFFLINE: [ snmp2 ]
>>
>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> Remote-rsc2 (ocf::heartbeat:Dummy): Started snmp1 (failure ignored)
>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>>
>> Node Attributes:
>> * Node sl7-01:
>> * Node snmp1:
>>
>> Migration summary:
>> * Node sl7-01:
>> snmp2: migration-threshold=1 fail-count=1000000 last-failure='Mon
> Apr 13
>> 15:48:40 2015'
>> * Node snmp1:
>>
>> Failed actions:
>> snmp2_start_0 on sl7-01 'unknown error' (1): call=8,
> status=Timed Out,
>> exit-reason='none', last-rc-change='Mon Apr 13 15:47:42
> 2015',
>> queued=0ms, exec=0ms
>>
>>
>>
>> Best Regards,
>> Hidoe Yamauchi.
>>
>>
>>
>> ----- Original Message -----
>> > From: Andrew Beekhof <andrew at beekhof.net>
>> > To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to
>> > open-source clustering welcomed <users at clusterlabs.org>
>> > Cc:
>> > Date: 2015/4/13, Mon 14:11
>> > Subject: Re: [ClusterLabs] Antw: Re: [Question] About movement of
>> > pacemaker_remote.
>> >
>> >
>> >> On 8 Apr 2015, at 12:27 pm, renayama19661014 at ybb.ne.jp wrote:
>> >>
>> >> Hi All,
>> >>
>> >> Let me confirm the first question once again.
>> >>
>> >> I confirmed the next movement in Pacemaker1.1.13-rc1.
>> >> Stonith does not set it.
>> >>
>> >> -------------------------------------------------------------
>> >> property no-quorum-policy="ignore" \
>> >> stonith-enabled="false" \
>> >> startup-fencing="false" \
>> >>
>> >> rsc_defaults resource-stickiness="INFINITY" \
>> >> migration-threshold="1"
>> >>
>> >> primitive snmp1 ocf:pacemaker:remote \
>> >> params \
>> >> server="snmp1" \
>> >> op start interval="0s" timeout="60s"
>> > on-fail="restart" \
>> >> op monitor interval="3s"
> timeout="15s" \
>> >> op stop interval="0s" timeout="60s"
>> > on-fail="ignore"
>> >>
>> >> primitive snmp2 ocf:pacemaker:remote \
>> >> params \
>> >> server="snmp2" \
>> >> op start interval="0s" timeout="60s"
>> > on-fail="restart" \
>> >> op monitor interval="3s"
> timeout="15s" \
>> >> op stop interval="0s" timeout="60s"
>> > on-fail="stop"
>> >>
>> >> primitive Host-rsc1 ocf:heartbeat:Dummy \
>> >> op start interval="0s" timeout="60s"
>> > on-fail="restart" \
>> >> op monitor interval="10s"
> timeout="60s"
>> > on-fail="restart" \
>> >> op stop interval="0s" timeout="60s"
>> > on-fail="ignore"
>> >>
>> >> primitive Remote-rsc1 ocf:heartbeat:Dummy \
>> >> op start interval="0s" timeout="60s"
>> > on-fail="restart" \
>> >> op monitor interval="10s"
> timeout="60s"
>> > on-fail="restart" \
>> >> op stop interval="0s" timeout="60s"
>> > on-fail="ignore"
>> >>
>> >> primitive Remote-rsc2 ocf:heartbeat:Dummy \
>> >> op start interval="0s" timeout="60s"
>> > on-fail="restart" \
>> >> op monitor interval="10s"
> timeout="60s"
>> > on-fail="restart" \
>> >> op stop interval="0s" timeout="60s"
>> > on-fail="ignore"
>> >>
>> >> location loc1 Remote-rsc1 \
>> >> rule 200: #uname eq snmp1 \
>> >> rule 100: #uname eq snmp2
>> >> location loc2 Remote-rsc2 \ rule 200: #uname eq snmp2
> \
>> > rule 100: #uname eq snmp1
>> >> location loc3 Host-rsc1 \
>> >> rule 200: #uname eq sl7-01
>> >>
>> >> -------------------------------------------------------------
>> >>
>> >> Step 1) We use two remote nodes and constitute a cluster.
>> >> -------------------------------------------------------------
>> >> Version: 1.1.12-3e93bc1
>> >> 3 Nodes configured
>> >> 5 Resources configured
>> >>
>> >>
>> >> Online: [ sl7-01 ]
>> >> RemoteOnline: [ snmp1 snmp2 ]
>> >>
>> >> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> >> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> >> Remote-rsc2 (ocf::heartbeat:Dummy): Started snmp2
>> >> snmp1 (ocf::pacemaker:remote): Started sl7-01
>> >> snmp2 (ocf::pacemaker:remote): Started sl7-01
>> >>
>> >> Node Attributes:
>> >> * Node sl7-01:
>> >> * Node snmp1:
>> >> * Node snmp2:
>> >>
>> >> Migration summary:
>> >> * Node sl7-01:
>> >> * Node snmp1:
>> >> * Node snmp2:
>> >> -------------------------------------------------------------
>> >>
>> >> Step 2) We stop pacemaker_remoted in one remote.
>> >> -------------------------------------------------------------
>> >> Current DC: sl7-01 - partition WITHOUT quorum
>> >> Version: 1.1.12-3e93bc1
>> >> 3 Nodes configured
>> >> 5 Resources configured
>> >>
>> >>
>> >> Online: [ sl7-01 ]
>> >> RemoteOnline: [ snmp1 ]
>> >> RemoteOFFLINE: [ snmp2 ]
>> >>
>> >> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> >> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> >> snmp1 (ocf::pacemaker:remote): Started sl7-01
>> >> snmp2 (ocf::pacemaker:remote): FAILED sl7-01
>> >>
>> >> Node Attributes:
>> >> * Node sl7-01:
>> >> * Node snmp1:
>> >>
>> >> Migration summary:
>> >> * Node sl7-01:
>> >> snmp2: migration-threshold=1 fail-count=1
> last-failure='Fri Apr 3
>> > 12:56:12 2015'
>> >> * Node snmp1:
>> >>
>> >> Failed actions:
>> >> snmp2_monitor_3000 on sl7-01 'unknown error' (1):
> call=6,
>> > status=Error, exit-reason='none', last-rc-change='Fri Apr
> 3
>> > 12:56:12 2015', queued=0ms, exec=0ms
>> >
>> > Ideally we’d have fencing configured and reboot the remote node here.
>> > But for the sake of argument, ok :)
>> >
>> >
>> >> -------------------------------------------------------------
>> >>
>> >> Step 3) We reboot pacemaker_remoted which stopped.
>> >
>> > As in you reboot the node on which pacemaker_remoted is stopped and
>> > pacemaker_remoted is configured to start at boot?
>> >
>> >>
>> >> Step 4) We clear snmp2 of remote by crm_resource command,
>> >
>> > Was pacemaker_remoted running at this point?
>> > I mentioned this earlier today, we need to improve the experience in
> this
>> > area.
>> >
>> > Probably a good excuse to fix on-fail=ignore for start actions.
>> >
>> >> but remote cannot participate in a cluster.
>> >> -------------------------------------------------------------
>> >> Version: 1.1.12-3e93bc1
>> >> 3 Nodes configured
>> >> 5 Resources configured
>> >>
>> >>
>> >> Online: [ sl7-01 ]
>> >> RemoteOnline: [ snmp1 ]
>> >> RemoteOFFLINE: [ snmp2 ]
>> >>
>> >> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> >> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> >> snmp1 (ocf::pacemaker:remote): Started sl7-01
>> >> snmp2 (ocf::pacemaker:remote): FAILED sl7-01
>> >>
>> >> Node Attributes:
>> >> * Node sl7-01:
>> >> * Node snmp1:
>> >>
>> >> Migration summary:
>> >> * Node sl7-01:
>> >> snmp2: migration-threshold=1 fail-count=1000000
> last-failure='Wed
>> > Apr 8 11:21:09 2015'
>> >> * Node snmp1:
>> >>
>> >> Failed actions:
>> >> snmp2_start_0 on sl7-01 'unknown error' (1): call=8,
>> > status=Timed Out, exit-reason='none', last-rc-change='Wed
> Apr 8
>> > 11:20:11 2015', queued=0ms, exec=0ms
>> >> -------------------------------------------------------------
>> >>
>> >>
>> >> Node of pacemaker and the remote node output the following log
>> >> repeatedly.
>> >>
>> >> -------------------------------------------------------------
>> >> Apr 8 11:20:38 sl7-01 crmd[17101]: info:
> crm_remote_tcp_connect_async:
>> > Attempting to connect to remote server at 192.168.40.110:3121
>> >> Apr 8 11:20:38 sl7-01 crmd[17101]: info: lrmd_tcp_connect_cb:
> Remote
>> >> lrmd
>> > client TLS connection established with server snmp2:3121
>> >> Apr 8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_recv_reply:
> Unable to
>> > receive expected reply, disconnecting.
>> >> Apr 8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_send_recv:
> Remote
>> >> lrmd
>> > server disconnected while waiting for reply with id 101.
>> >> Apr 8 11:20:38 sl7-01 crmd[17101]: info:
> lrmd_tls_connection_destroy:
>> >> TLS
>> > connection destroyed
>> >> Apr 8 11:20:38 sl7-01 crmd[17101]: info: lrmd_api_disconnect:
>> > Disconnecting from lrmd service
>> >> -------------------------------------------------------------
>> >> Apr 8 11:20:36 snmp2 pacemaker_remoted[1502]: notice:
>> > lrmd_remote_client_destroy: LRMD client disconnecting remote client -
> name:
>> > remote-lrmd-snmp2:3121 id: 8fbbc3cd-daa5-406b-942d-21be868cfc62
>> >> Apr 8 11:20:37 snmp2 pacemaker_remoted[1502]: notice:
>> > lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id:
>> > a59392c9-6575-40ed-9b53-98a68de00409
>> >> Apr 8 11:20:38 snmp2 pacemaker_remoted[1502]: info:
>> > lrmd_remote_client_msg: Client disconnect detected in tls msg
> dispatcher.
>> >> Apr 8 11:20:38 snmp2 pacemaker_remoted[1502]: notice:
>> > lrmd_remote_client_destroy: LRMD client disconnecting remote client -
> name:
>> > remote-lrmd-snmp2:3121 id: a59392c9-6575-40ed-9b53-98a68de00409
>> >> Apr 8 11:20:39 snmp2 pacemaker_remoted[1502]: notice:
>> > lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id:
>> > 0e58614c-b1c5-4e37-a917-1f8e3de5de24
>> >> Apr 8 11:20:39 snmp2 pacemaker_remoted[1502]: info:
>> > lrmd_remote_client_msg: Client disconnect detected in tls msg
> dispatcher.
>> >> Apr 8 11:20:39 snmp2 pacemaker_remoted[1502]: notice:
>> > lrmd_remote_client_destroy: LRMD client disconnecting remote client -
> name:
>> > remote-lrmd-snmp2:3121 id: 0e58614c-b1c5-4e37-a917-1f8e3de5de24
>> >> Apr 8 11:20:40 snmp2 pacemaker_remoted[1502]: notice:
>> > lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id:
>> > 518bcca5-5f83-47fb-93ea-2ece33690111
>> >> -------------------------------------------------------------
>> >>
>> >> Is this movement right?
>> >>
>> >> Best Regards,
>> >> Hideo Yamauchi.
>> >>
>> >>
>> >>
>> >> ----- Original Message -----
>> >>> From: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
>> >>> To: users at clusterlabs.org
>> >>> Cc:
>> >>> Date: 2015/4/2, Thu 22:30
>> >>> Subject: [ClusterLabs] Antw: Re: [Question] About movement of
>> > pacemaker_remote.
>> >>>
>> >>>>>> David Vossel <dvossel at redhat.com> schrieb
> am
>> > 02.04.2015 um
>> >>> 14:58 in Nachricht
>> >>>
> <796820123.6644200.1427979523554.JavaMail.zimbra at redhat.com>:
>> >>>
>> >>>>
>> >>>> ----- Original Message -----
>> >>>>>
>> >>>>>> On 14 Mar 2015, at 10:14 am, David Vossel
>> >>> <dvossel at redhat.com> wrote:
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> ----- Original Message -----
>> >>>>>>>
>> >>>>>>> Failed actions:
>> >>>>>>> snmp2_start_0 on sl7-01 'unknown
> error'
>> > (1):
>> >>> call=8, status=Timed Out,
>> >>>>>>> exit-reason='none',
>> > last-rc-change='Thu Mar 12
>> >>> 14:26:26 2015',
>> >>>>>>> queued=0ms, exec=0ms
>> >>>>>>> snmp2_start_0 on sl7-01 'unknown
> error'
>> > (1):
>> >>> call=8, status=Timed Out,
>> >>>>>>> exit-reason='none',
>> > last-rc-change='Thu Mar 12
>> >>> 14:26:26 2015',
>> >>>>>>> queued=0ms, exec=0ms
>> >>>>>>> -----------------------
>> >>>>>>
>> >>>>>> Pacemaker is attempting to restore connection to
> the remote
>> > node
>> >>> here, are
>> >>>>>> you
>> >>>>>> sure the remote is accessible? The "Timed
> Out"
>> > error
>> >>> means that pacemaker
>> >>>>>> was
>> >>>>>> unable to establish the connection during the
> timeout
>> > period.
>> >>>>>
>> >>>>> Random question: Are we smart enough not to try and
> start
>> >>> pacemaker-remote
>> >>>>> resources for node's we've just fenced?
>> >>>>
>> >>>> we try and re-connect to remote nodes after fencing. if
> the fence
>> > operation
>> >>>> was 'off' instead of 'reboot', this would
> make no
>> > sense.
>> >>> I'm not entirely
>> >>>> sure how to handle this. We want the remote-node
> re-integrated into
>> > the
>> >>>> cluster,
>> >>>> but i'd like to optimize the case where we know the
> node will
>> > not be
>> >>> coming
>> >>>> back online.
>> >>>
>> >>> Beware: Even if the fencing action is "off" (for
> software), a
>> > human
>> >>> may decide to boot the node anyway, also starting the cluster
> software.
>> >>>
>> >>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> Users mailing list: Users at clusterlabs.org
>> >>>>> http://clusterlabs.org/mailman/listinfo/users
>> >>>>>
>> >>>>> Project Home: http://www.clusterlabs.org
>> >>>>> Getting started:
>> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >>>>> Bugs: http://bugs.clusterlabs.org
>> >>>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> Users mailing list: Users at clusterlabs.org
>> >>>> http://clusterlabs.org/mailman/listinfo/users
>> >>>>
>> >>>> Project Home: http://www.clusterlabs.org
>> >>>> Getting started:
>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >>>> Bugs: http://bugs.clusterlabs.org
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> Users mailing list: Users at clusterlabs.org
>> >>> http://clusterlabs.org/mailman/listinfo/users
>> >>>
>> >>> Project Home: http://www.clusterlabs.org
>> >>> Getting started:
>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >>> Bugs: http://bugs.clusterlabs.org
>> >>>
>> >>
>> >> _______________________________________________
>> >> Users mailing list: Users at clusterlabs.org
>> >> http://clusterlabs.org/mailman/listinfo/users
>> >>
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs: http://bugs.clusterlabs.org
>> >
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
More information about the Users
mailing list