[ClusterLabs] Antw: Re: [Question] About movement of pacemaker_remote.

Tue Apr 14 22:22:46 UTC 2015

----- Original Message -----
> Hi Andrew,
> 
> Thank you for comments.
> 
> >> Step 4) We clear snmp2 of remote by crm_resource command,
> > 
> > Was pacemaker_remoted running at this point?

please turn on debug logging in /etc/sysconfig/pacemaker for both the pacemaker
nodes and the nodes running pacemaker remote.

set the following

PCMK_logfile=/var/log/pacemaker.log
PCMK_debug=yes
PCMK_trace_files=lrmd_client.c,lrmd.c,tls_backend.c,remote.c

Provide the logs with the new debug settings enabled during the time period
that pacemaker is unable to reconnect to pacemaker_remote.

Thanks,
--David

> 
> 
> Yes.
> 
> In the node that rebooted pacemaker_remote, it becomes the following log.
> 
> 
> ------------------------------
> Apr 13 15:47:29 snmp2 pacemaker_remoted[1494]:     info: main: Starting ---->
> #### RESTARTED pacemaker_remote.
> Apr 13 15:47:42 snmp2 pacemaker_remoted[1494]:   notice: lrmd_remote_listen:
> LRMD client connection established. 0x24f4ca0 id:
> 5b56e54e-b9da-4804-afda-5c72038d089c
> Apr 13 15:47:43 snmp2 pacemaker_remoted[1494]:     info:
> lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
> Apr 13 15:47:43 snmp2 pacemaker_remoted[1494]:   notice:
> lrmd_remote_client_destroy: LRMD client disconnecting remote client - name:
> remote-lrmd-snmp2:3121 id: 5b56e54e-b9da-4804-afda-5c72038d089c
> Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]:   notice: lrmd_remote_listen:
> LRMD client connection established. 0x24f4ca0 id:
> 907cd1fc-6c1d-40f1-8c60-34bc8b66715f
> Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]:     info:
> lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
> Apr 13 15:47:44 snmp2 pacemaker_remoted[1494]:   notice:
> lrmd_remote_client_destroy: LRMD client disconnecting remote client - name:
> remote-lrmd-snmp2:3121 id: 907cd1fc-6c1d-40f1-8c60-34bc8b66715f
> Apr 13 15:47:45 snmp2 pacemaker_remoted[1494]:   notice: lrmd_remote_listen:
> LRMD client connection established. 0x24f4ca0 id:
> 8b38c0dd-9338-478a-8f23-523aee4cc1a6
> Apr 13 15:47:46 snmp2 pacemaker_remoted[1494]:     info:
> lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
> (snip)
> After that the log is repeated.
> 
> 
> ------------------------------
> 
> 
> > I mentioned this earlier today, we need to improve the experience in this
> > area.
> > 
> > Probably a good excuse to fix on-fail=ignore for start actions.
> > 
> >> but remote cannot participate in a cluster.
> 
> 
> 
> I changed crm file as follows.(on-fail=ignore for start)
> 
> 
> (snip)
> primitive snmp1 ocf:pacemaker:remote \
>         params \
>                 server="snmp1" \
>         op start interval="0s" timeout="60s" on-fail="ignore" \
>         op monitor interval="3s" timeout="15s" \
>         op stop interval="0s" timeout="60s" on-fail="ignore"
> 
> primitive snmp2 ocf:pacemaker:remote \
>         params \
>                 server="snmp2" \
>         op start interval="0s" timeout="60s" on-fail="ignore" \
>         op monitor interval="3s" timeout="15s" \
>         op stop interval="0s" timeout="60s" on-fail="stop"
> 
> (snip)
> 
> However, the result was the same.
> Even if the node of pacemaker_remote which rebooted carries out crm_resource
> -C, the node does not participate in a cluster.
> 
> [root at sl7-01 ~]# crm_mon -1 -Af
> Last updated: Mon Apr 13 15:51:58 2015
> Last change: Mon Apr 13 15:47:41 2015
> Stack: corosync
> Current DC: sl7-01 - partition WITHOUT quorum
> Version: 1.1.12-3e93bc1
> 3 Nodes configured
> 5 Resources configured
> 
> 
> Online: [ sl7-01 ]
> RemoteOnline: [ snmp1 ]
> RemoteOFFLINE: [ snmp2 ]
> 
>  Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>  Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>  Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp1 (failure ignored)
>  snmp1  (ocf::pacemaker:remote):        Started sl7-01
> 
> Node Attributes:
> * Node sl7-01:
> * Node snmp1:
> 
> Migration summary:
> * Node sl7-01:
>    snmp2: migration-threshold=1 fail-count=1000000 last-failure='Mon Apr 13
>    15:48:40 2015'
> * Node snmp1:
> 
> Failed actions:
>     snmp2_start_0 on sl7-01 'unknown error' (1): call=8, status=Timed Out,
>     exit-reason='none', last-rc-change='Mon Apr 13 15:47:42 2015',
>     queued=0ms, exec=0ms
> 
> 
> 
> Best Regards,
> Hidoe Yamauchi.
> 
> 
> 
> ----- Original Message -----
> > From: Andrew Beekhof <andrew at beekhof.net>
> > To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to
> > open-source clustering welcomed <users at clusterlabs.org>
> > Cc:
> > Date: 2015/4/13, Mon 14:11
> > Subject: Re: [ClusterLabs] Antw: Re: [Question] About movement of
> > pacemaker_remote.
> > 
> > 
> >>  On 8 Apr 2015, at 12:27 pm, renayama19661014 at ybb.ne.jp wrote:
> >> 
> >>  Hi All,
> >> 
> >>  Let me confirm the first question once again.
> >> 
> >>  I confirmed the next movement in Pacemaker1.1.13-rc1.
> >>  Stonith does not set it.
> >> 
> >>  -------------------------------------------------------------
> >>  property no-quorum-policy="ignore" \
> >>          stonith-enabled="false" \
> >>          startup-fencing="false" \
> >> 
> >>  rsc_defaults resource-stickiness="INFINITY" \
> >>          migration-threshold="1"
> >> 
> >>  primitive snmp1 ocf:pacemaker:remote \
> >>          params \
> >>                  server="snmp1" \
> >>          op start interval="0s" timeout="60s"
> > on-fail="restart" \
> >>          op monitor interval="3s" timeout="15s" \
> >>          op stop interval="0s" timeout="60s"
> > on-fail="ignore"
> >> 
> >>  primitive snmp2 ocf:pacemaker:remote \
> >>          params \
> >>                  server="snmp2" \
> >>          op start interval="0s" timeout="60s"
> > on-fail="restart" \
> >>          op monitor interval="3s" timeout="15s" \
> >>          op stop interval="0s" timeout="60s"
> > on-fail="stop"
> >> 
> >>  primitive Host-rsc1 ocf:heartbeat:Dummy \
> >>          op start interval="0s" timeout="60s"
> > on-fail="restart" \
> >>          op monitor interval="10s" timeout="60s"
> > on-fail="restart" \
> >>          op stop interval="0s" timeout="60s"
> > on-fail="ignore"
> >> 
> >>  primitive Remote-rsc1 ocf:heartbeat:Dummy \
> >>          op start interval="0s" timeout="60s"
> > on-fail="restart" \
> >>          op monitor interval="10s" timeout="60s"
> > on-fail="restart" \
> >>          op stop interval="0s" timeout="60s"
> > on-fail="ignore"
> >> 
> >>  primitive Remote-rsc2 ocf:heartbeat:Dummy \
> >>          op start interval="0s" timeout="60s"
> > on-fail="restart" \
> >>          op monitor interval="10s" timeout="60s"
> > on-fail="restart" \
> >>          op stop interval="0s" timeout="60s"
> > on-fail="ignore"
> >> 
> >>  location loc1 Remote-rsc1 \
> >>          rule 200: #uname eq snmp1 \
> >>          rule 100: #uname eq snmp2
> >>  location loc2 Remote-rsc2 \        rule 200: #uname eq snmp2 \
> >   rule 100: #uname eq snmp1
> >>  location loc3 Host-rsc1 \
> >>          rule 200: #uname eq sl7-01
> >> 
> >>  -------------------------------------------------------------
> >> 
> >>  Step 1) We use two remote nodes and constitute a cluster.
> >>  -------------------------------------------------------------
> >>  Version: 1.1.12-3e93bc1
> >>  3 Nodes configured
> >>  5 Resources configured
> >> 
> >> 
> >>  Online: [ sl7-01 ]
> >>  RemoteOnline: [ snmp1 snmp2 ]
> >> 
> >>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
> >>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
> >>   Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2
> >>   snmp1  (ocf::pacemaker:remote):        Started sl7-01
> >>   snmp2  (ocf::pacemaker:remote):        Started sl7-01
> >> 
> >>  Node Attributes:
> >>  * Node sl7-01:
> >>  * Node snmp1:
> >>  * Node snmp2:
> >> 
> >>  Migration summary:
> >>  * Node sl7-01:
> >>  * Node snmp1:
> >>  * Node snmp2:
> >>  -------------------------------------------------------------
> >> 
> >>  Step 2) We stop pacemaker_remoted in one remote.
> >>  -------------------------------------------------------------
> >>  Current DC: sl7-01 - partition WITHOUT quorum
> >>  Version: 1.1.12-3e93bc1
> >>  3 Nodes configured
> >>  5 Resources configured
> >> 
> >> 
> >>  Online: [ sl7-01 ]
> >>  RemoteOnline: [ snmp1 ]
> >>  RemoteOFFLINE: [ snmp2 ]
> >> 
> >>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
> >>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
> >>   snmp1  (ocf::pacemaker:remote):        Started sl7-01
> >>   snmp2  (ocf::pacemaker:remote):        FAILED sl7-01
> >> 
> >>  Node Attributes:
> >>  * Node sl7-01:
> >>  * Node snmp1:
> >> 
> >>  Migration summary:
> >>  * Node sl7-01:
> >>     snmp2: migration-threshold=1 fail-count=1 last-failure='Fri Apr  3
> > 12:56:12 2015'
> >>  * Node snmp1:
> >> 
> >>  Failed actions:
> >>      snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6,
> > status=Error, exit-reason='none', last-rc-change='Fri Apr  3
> > 12:56:12 2015', queued=0ms, exec=0ms
> > 
> > Ideally we’d have fencing configured and reboot the remote node here.
> > But for the sake of argument, ok :)
> > 
> > 
> >>  -------------------------------------------------------------
> >> 
> >>  Step 3) We reboot pacemaker_remoted which stopped.
> > 
> > As in you reboot the node on which pacemaker_remoted is stopped and
> > pacemaker_remoted is configured to start at boot?
> > 
> >> 
> >>  Step 4) We clear snmp2 of remote by crm_resource command,
> > 
> > Was pacemaker_remoted running at this point?
> > I mentioned this earlier today, we need to improve the experience in this
> > area.
> > 
> > Probably a good excuse to fix on-fail=ignore for start actions.
> > 
> >>  but remote cannot participate in a cluster.
> >>  -------------------------------------------------------------
> >>  Version: 1.1.12-3e93bc1
> >>  3 Nodes configured
> >>  5 Resources configured
> >> 
> >> 
> >>  Online: [ sl7-01 ]
> >>  RemoteOnline: [ snmp1 ]
> >>  RemoteOFFLINE: [ snmp2 ]
> >> 
> >>   Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
> >>   Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
> >>   snmp1  (ocf::pacemaker:remote):        Started sl7-01
> >>   snmp2  (ocf::pacemaker:remote):        FAILED sl7-01
> >> 
> >>  Node Attributes:
> >>  * Node sl7-01:
> >>  * Node snmp1:
> >> 
> >>  Migration summary:
> >>  * Node sl7-01:
> >>     snmp2: migration-threshold=1 fail-count=1000000 last-failure='Wed
> > Apr  8 11:21:09 2015'
> >>  * Node snmp1:
> >> 
> >>  Failed actions:
> >>      snmp2_start_0 on sl7-01 'unknown error' (1): call=8,
> > status=Timed Out, exit-reason='none', last-rc-change='Wed Apr  8
> > 11:20:11 2015', queued=0ms, exec=0ms
> >>  -------------------------------------------------------------
> >> 
> >> 
> >>  Node of pacemaker and the remote node output the following log
> >>  repeatedly.
> >> 
> >>  -------------------------------------------------------------
> >>  Apr  8 11:20:38 sl7-01 crmd[17101]: info: crm_remote_tcp_connect_async:
> > Attempting to connect to remote server at 192.168.40.110:3121
> >>  Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_tcp_connect_cb: Remote
> >>  lrmd
> > client TLS connection established with server snmp2:3121
> >>  Apr  8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_recv_reply: Unable to
> > receive expected reply, disconnecting.
> >>  Apr  8 11:20:38 sl7-01 crmd[17101]: error: lrmd_tls_send_recv: Remote
> >>  lrmd
> > server disconnected while waiting for reply with id 101.
> >>  Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_tls_connection_destroy:
> >>  TLS
> > connection destroyed
> >>  Apr  8 11:20:38 sl7-01 crmd[17101]: info: lrmd_api_disconnect:
> > Disconnecting from lrmd service
> >>  -------------------------------------------------------------
> >>  Apr  8 11:20:36 snmp2 pacemaker_remoted[1502]:   notice:
> > lrmd_remote_client_destroy: LRMD client disconnecting remote client - name:
> > remote-lrmd-snmp2:3121 id: 8fbbc3cd-daa5-406b-942d-21be868cfc62
> >>  Apr  8 11:20:37 snmp2 pacemaker_remoted[1502]:   notice:
> > lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id:
> > a59392c9-6575-40ed-9b53-98a68de00409
> >>  Apr  8 11:20:38 snmp2 pacemaker_remoted[1502]:     info:
> > lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
> >>  Apr  8 11:20:38 snmp2 pacemaker_remoted[1502]:   notice:
> > lrmd_remote_client_destroy: LRMD client disconnecting remote client - name:
> > remote-lrmd-snmp2:3121 id: a59392c9-6575-40ed-9b53-98a68de00409
> >>  Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:   notice:
> > lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id:
> > 0e58614c-b1c5-4e37-a917-1f8e3de5de24
> >>  Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:     info:
> > lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
> >>  Apr  8 11:20:39 snmp2 pacemaker_remoted[1502]:   notice:
> > lrmd_remote_client_destroy: LRMD client disconnecting remote client - name:
> > remote-lrmd-snmp2:3121 id: 0e58614c-b1c5-4e37-a917-1f8e3de5de24
> >>  Apr  8 11:20:40 snmp2 pacemaker_remoted[1502]:   notice:
> > lrmd_remote_listen: LRMD client connection established. 0xbb7ca0 id:
> > 518bcca5-5f83-47fb-93ea-2ece33690111
> >>  -------------------------------------------------------------
> >> 
> >>  Is this movement right?
> >> 
> >>  Best Regards,
> >>  Hideo Yamauchi.
> >> 
> >> 
> >> 
> >>  ----- Original Message -----
> >>>  From: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
> >>>  To: users at clusterlabs.org
> >>>  Cc:
> >>>  Date: 2015/4/2, Thu 22:30
> >>>  Subject: [ClusterLabs] Antw: Re: [Question] About movement of
> > pacemaker_remote.
> >>> 
> >>>>>>  David Vossel <dvossel at redhat.com> schrieb am
> > 02.04.2015 um
> >>>  14:58 in Nachricht
> >>>  <796820123.6644200.1427979523554.JavaMail.zimbra at redhat.com>:
> >>> 
> >>>> 
> >>>>  ----- Original Message -----
> >>>>> 
> >>>>>>  On 14 Mar 2015, at 10:14 am, David Vossel
> >>>  <dvossel at redhat.com> wrote:
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>>  ----- Original Message -----
> >>>>>>> 
> >>>>>>>  Failed actions:
> >>>>>>>       snmp2_start_0 on sl7-01 'unknown error'
> > (1):
> >>>  call=8, status=Timed Out,
> >>>>>>>       exit-reason='none',
> > last-rc-change='Thu Mar 12
> >>>  14:26:26 2015',
> >>>>>>>       queued=0ms, exec=0ms
> >>>>>>>       snmp2_start_0 on sl7-01 'unknown error'
> > (1):
> >>>  call=8, status=Timed Out,
> >>>>>>>       exit-reason='none',
> > last-rc-change='Thu Mar 12
> >>>  14:26:26 2015',
> >>>>>>>       queued=0ms, exec=0ms
> >>>>>>>  -----------------------
> >>>>>> 
> >>>>>>  Pacemaker is attempting to restore connection to the remote
> > node
> >>>  here, are
> >>>>>>  you
> >>>>>>  sure the remote is accessible? The "Timed Out"
> > error
> >>>  means that pacemaker
> >>>>>>  was
> >>>>>>  unable to establish the connection during the timeout
> > period.
> >>>>> 
> >>>>>  Random question: Are we smart enough not to try and start
> >>>  pacemaker-remote
> >>>>>  resources for node's we've just fenced?
> >>>> 
> >>>>  we try and re-connect to remote nodes after fencing. if the fence
> > operation
> >>>>  was 'off' instead of 'reboot', this would make no
> > sense.
> >>>  I'm not entirely
> >>>>  sure how to handle this. We want the remote-node re-integrated into
> > the
> >>>>  cluster,
> >>>>  but i'd like to optimize the case where we know the node will
> > not be
> >>>  coming
> >>>>  back online.
> >>> 
> >>>  Beware: Even if the fencing action is "off" (for software), a
> > human
> >>>  may decide to boot the node anyway, also starting the cluster software.
> >>> 
> >>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>  _______________________________________________
> >>>>>  Users mailing list: Users at clusterlabs.org
> >>>>>  http://clusterlabs.org/mailman/listinfo/users
> >>>>> 
> >>>>>  Project Home: http://www.clusterlabs.org
> >>>>>  Getting started:
> >>>  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>>  Bugs: http://bugs.clusterlabs.org
> >>>>> 
> >>>> 
> >>>>  _______________________________________________
> >>>>  Users mailing list: Users at clusterlabs.org
> >>>>  http://clusterlabs.org/mailman/listinfo/users
> >>>> 
> >>>>  Project Home: http://www.clusterlabs.org
> >>>>  Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>  Bugs: http://bugs.clusterlabs.org
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>>  _______________________________________________
> >>>  Users mailing list: Users at clusterlabs.org
> >>>  http://clusterlabs.org/mailman/listinfo/users
> >>> 
> >>>  Project Home: http://www.clusterlabs.org
> >>>  Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>  Bugs: http://bugs.clusterlabs.org
> >>> 
> >> 
> >>  _______________________________________________
> >>  Users mailing list: Users at clusterlabs.org
> >>  http://clusterlabs.org/mailman/listinfo/users
> >> 
> >>  Project Home: http://www.clusterlabs.org
> >>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>  Bugs: http://bugs.clusterlabs.org
> > 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>