[ClusterLabs] [Question] About movement of pacemaker_remote.

Fri Mar 13 19:14:19 EDT 2015

----- Original Message -----
> Hi All,
> 
> We confirm a function of pacemaker_remote.(stonith is invalid.)
> 
> There are two questions.
> 
> * Questsion 1 : The pacemaker_remote does not restore from trouble. Is this
> right movement?
> 
> - Step1 - Start a cluster.
> -----------------------
> [root at sl7-01 ~]# crm_mon -1 -Af
> Last updated: Thu Mar 12 14:25:05 2015
> Last change: Thu Mar 12 14:24:31 2015
> Stack: corosync
> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
> Version: 1.1.12-ce09802
> 3 Nodes configured
> 5 Resources configured
> 
> 
> Online: [ sl7-01 ]
> RemoteOnline: [ snmp1 snmp2 ]
> 
>  Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>  Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>  Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2
>  snmp1  (ocf::pacemaker:remote):        Started sl7-01
>  snmp2  (ocf::pacemaker:remote):        Started sl7-01
> 
> -----------------------
> 
> - Step2 - Let pacemaker_remote break down.
> 
> -----------------------
> [root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
> [1] 24202
> [root at snmp2 ~]# kill -TERM 24202
> 
> [root at sl7-01 ~]# crm_mon -1 -Af
> Last updated: Thu Mar 12 14:25:55 2015
> Last change: Thu Mar 12 14:24:31 2015
> Stack: corosync
> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
> Version: 1.1.12-ce09802
> 3 Nodes configured
> 5 Resources configured
> 
> 
> Online: [ sl7-01 ]
> RemoteOnline: [ snmp1 ]
> RemoteOFFLINE: [ snmp2 ]
> 
>  Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>  Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>  snmp1  (ocf::pacemaker:remote):        Started sl7-01
>  snmp2  (ocf::pacemaker:remote):        FAILED sl7-01
> 
> Migration summary:
> * Node sl7-01:
>    snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12
>    14:25:40 2015'
> * Node snmp1:
> 
> Failed actions:
>     snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, status=Error,
>     exit-reason='none', last-rc-change='Thu Mar 12 14:25:40 2015',
>     queued=0ms, exec=0ms
> -----------------------
> 
> - Step3 - Reboot pacemaker_remote. And remote clear it, but a node is
> offline.
> 
> -----------------------
> [root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
> [2] 24248
> 
> [root at sl7-01 ~]# crm_resource -C -r snmp2
> Cleaning up snmp2 on sl7-01
> Cleaning up snmp2 on snmp1
> Waiting for 1 replies from the CRMd. OK
> 
> 
> [root at sl7-01 ~]# crm_mon -1 -Af
> Last updated: Thu Mar 12 14:26:46 2015
> Last change: Thu Mar 12 14:26:26 2015
> Stack: corosync
> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
> Version: 1.1.12-ce09802
> 3 Nodes configured
> 5 Resources configured
> 
> 
> Online: [ sl7-01 ]
> RemoteOnline: [ snmp1 ]
> RemoteOFFLINE: [ snmp2 ]
> 
>  Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>  Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>  snmp1  (ocf::pacemaker:remote):        Started sl7-01
>  snmp2  (ocf::pacemaker:remote):        FAILED sl7-01
> 
> Migration summary:
> * Node sl7-01:
>    snmp2: migration-threshold=1 fail-count=1000000 last-failure='Thu Mar 12
>    14:26:44 2015'
> * Node snmp1:
> 
> Failed actions:
>     snmp2_start_0 on sl7-01 'unknown error' (1): call=8, status=Timed Out,
>     exit-reason='none', last-rc-change='Thu Mar 12 14:26:26 2015',
>     queued=0ms, exec=0ms
>     snmp2_start_0 on sl7-01 'unknown error' (1): call=8, status=Timed Out,
>     exit-reason='none', last-rc-change='Thu Mar 12 14:26:26 2015',
>     queued=0ms, exec=0ms
> -----------------------

Pacemaker is attempting to restore connection to the remote node here, are you
sure the remote is accessible? The "Timed Out" error means that pacemaker was
unable to establish the connection during the timeout period.

> 
> * Questsion 2 : When pacemaker_remote broke down, is the method that stonith
> moves a resource by constitution of the invalidity right in the next
> procedure? Is there a procedure to move without deleting the node?
> 
> - Step1 - Start a cluster.
> -----------------------
> [root at sl7-01 ~]# crm_mon -1 -Af
> Last updated: Thu Mar 12 14:30:27 2015
> Last change: Thu Mar 12 14:29:14 2015
> Stack: corosync
> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
> Version: 1.1.12-ce09802
> 3 Nodes configured
> 5 Resources configured
> 
> 
> Online: [ sl7-01 ]
> RemoteOnline: [ snmp1 snmp2 ]
> 
>  Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>  Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>  Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2
>  snmp1  (ocf::pacemaker:remote):        Started sl7-01
>  snmp2  (ocf::pacemaker:remote):        Started sl7-01
> 
> -----------------------
> 
> - Step2 - Let pacemaker_remote break down.
> -----------------------
> [root at snmp2 ~]# kill -TERM 24248
> 
> [root at sl7-01 ~]# crm_mon -1 -Af
> Last updated: Thu Mar 12 14:31:59 2015
> Last change: Thu Mar 12 14:29:14 2015
> Stack: corosync
> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
> Version: 1.1.12-ce09802
> 3 Nodes configured
> 5 Resources configured
> 
> 
> Online: [ sl7-01 ]
> RemoteOnline: [ snmp1 ]
> RemoteOFFLINE: [ snmp2 ]
> 
>  Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>  Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>  snmp1  (ocf::pacemaker:remote):        Started sl7-01
>  snmp2  (ocf::pacemaker:remote):        FAILED sl7-01
> 
> Migration summary:
> * Node sl7-01:
>    snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12
>    14:31:42 2015'
> * Node snmp1:
> 
> Failed actions:
>     snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, status=Error,
>     exit-reason='none', last-rc-change='Thu Mar 12 14:31:42 2015',
>     queued=0ms, exec=0ms
> -----------------------
> 
> - Step3 - We delete the inoperative node. Then a resource moves.

Maybe this is expected. It's impossible to tell without looking at the configuration
in more detail.

-- David

> 
> -----------------------
> [root at sl7-01 ~]# crm
> crm(live)# node
> crm(live)node# delete snmp2
> INFO: node snmp2 deleted
> 
> [root at sl7-01 ~]# crm_mon -1 -Af
> Last updated: Thu Mar 12 14:35:00 2015
> Last change: Thu Mar 12 14:34:20 2015
> Stack: corosync
> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
> Version: 1.1.12-ce09802
> 3 Nodes configured
> 5 Resources configured
> 
> 
> Online: [ sl7-01 ]
> RemoteOnline: [ snmp1 ]
> RemoteOFFLINE: [ snmp2 ]
> 
>  Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01
>  Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1
>  Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp1
>  snmp1  (ocf::pacemaker:remote):        Started sl7-01
> 
> Migration summary:
> * Node sl7-01:
>    snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12
>    14:51:44 2015'
> * Node snmp1:
> 
> Failed actions:
>     snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, status=Error,
>     exit-reason='none', last-rc-change='Thu Mar 12 14:51:44 2015',
>     queued=0ms, exec=0ms
> -----------------------
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>