[ClusterLabs] [Question] About movement of pacemaker_remote.

Thu Mar 12 06:00:21 UTC 2015

Hi All,

We confirm a function of pacemaker_remote.(stonith is invalid.)

There are two questions.

* Questsion 1 : The pacemaker_remote does not restore from trouble. Is this right movement?

- Step1 - Start a cluster.
-----------------------
[root at sl7-01 ~]# crm_mon -1 -Af
Last updated: Thu Mar 12 14:25:05 2015
Last change: Thu Mar 12 14:24:31 2015
Stack: corosync
Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
Version: 1.1.12-ce09802
3 Nodes configured
5 Resources configured

Online: [ sl7-01 ]
RemoteOnline: [ snmp1 snmp2 ]

 Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
 Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
 Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2 
 snmp1  (ocf::pacemaker:remote):        Started sl7-01 
 snmp2  (ocf::pacemaker:remote):        Started sl7-01 

-----------------------

- Step2 - Let pacemaker_remote break down.

-----------------------
[root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
[1] 24202
[root at snmp2 ~]# kill -TERM 24202

[root at sl7-01 ~]# crm_mon -1 -Af
Last updated: Thu Mar 12 14:25:55 2015
Last change: Thu Mar 12 14:24:31 2015
Stack: corosync
Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
Version: 1.1.12-ce09802
3 Nodes configured
5 Resources configured

Online: [ sl7-01 ]
RemoteOnline: [ snmp1 ]
RemoteOFFLINE: [ snmp2 ]

 Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
 Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
 snmp1  (ocf::pacemaker:remote):        Started sl7-01 
 snmp2  (ocf::pacemaker:remote):        FAILED sl7-01 

Migration summary:
* Node sl7-01: 
   snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12 14:25:40 2015'
* Node snmp1: 

Failed actions:
    snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, status=Error, exit-reason='none', last-rc-change='Thu Mar 12 14:25:40 2015', queued=0ms, exec=0ms
-----------------------

- Step3 - Reboot pacemaker_remote. And remote clear it, but a node is offline.

-----------------------
[root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
[2] 24248

[root at sl7-01 ~]# crm_resource -C -r snmp2
Cleaning up snmp2 on sl7-01
Cleaning up snmp2 on snmp1
Waiting for 1 replies from the CRMd. OK

[root at sl7-01 ~]# crm_mon -1 -Af
Last updated: Thu Mar 12 14:26:46 2015
Last change: Thu Mar 12 14:26:26 2015
Stack: corosync
Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
Version: 1.1.12-ce09802
3 Nodes configured
5 Resources configured

Online: [ sl7-01 ]
RemoteOnline: [ snmp1 ]
RemoteOFFLINE: [ snmp2 ]

 Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
 Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
 snmp1  (ocf::pacemaker:remote):        Started sl7-01 
 snmp2  (ocf::pacemaker:remote):        FAILED sl7-01 

Migration summary:
* Node sl7-01: 
   snmp2: migration-threshold=1 fail-count=1000000 last-failure='Thu Mar 12 14:26:44 2015'
* Node snmp1: 

Failed actions:
    snmp2_start_0 on sl7-01 'unknown error' (1): call=8, status=Timed Out, exit-reason='none', last-rc-change='Thu Mar 12 14:26:26 2015', queued=0ms, exec=0ms
    snmp2_start_0 on sl7-01 'unknown error' (1): call=8, status=Timed Out, exit-reason='none', last-rc-change='Thu Mar 12 14:26:26 2015', queued=0ms, exec=0ms
-----------------------

* Questsion 2 : When pacemaker_remote broke down, is the method that stonith moves a resource by constitution of the invalidity right in the next procedure? Is there a procedure to move without deleting the node?

- Step1 - Start a cluster.
-----------------------
[root at sl7-01 ~]# crm_mon -1 -Af
Last updated: Thu Mar 12 14:30:27 2015
Last change: Thu Mar 12 14:29:14 2015
Stack: corosync
Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
Version: 1.1.12-ce09802
3 Nodes configured
5 Resources configured

Online: [ sl7-01 ]
RemoteOnline: [ snmp1 snmp2 ]

 Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
 Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
 Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp2 
 snmp1  (ocf::pacemaker:remote):        Started sl7-01 
 snmp2  (ocf::pacemaker:remote):        Started sl7-01 

-----------------------

- Step2 - Let pacemaker_remote break down.
-----------------------
[root at snmp2 ~]# kill -TERM 24248

[root at sl7-01 ~]# crm_mon -1 -Af
Last updated: Thu Mar 12 14:31:59 2015
Last change: Thu Mar 12 14:29:14 2015
Stack: corosync
Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
Version: 1.1.12-ce09802
3 Nodes configured
5 Resources configured

Online: [ sl7-01 ]
RemoteOnline: [ snmp1 ]
RemoteOFFLINE: [ snmp2 ]

 Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
 Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
 snmp1  (ocf::pacemaker:remote):        Started sl7-01 
 snmp2  (ocf::pacemaker:remote):        FAILED sl7-01 

Migration summary:
* Node sl7-01: 
   snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12 14:31:42 2015'
* Node snmp1: 

Failed actions:
    snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, status=Error, exit-reason='none', last-rc-change='Thu Mar 12 14:31:42 2015', queued=0ms, exec=0ms
-----------------------

- Step3 - We delete the inoperative node. Then a resource moves. 

-----------------------
[root at sl7-01 ~]# crm
crm(live)# node
crm(live)node# delete snmp2
INFO: node snmp2 deleted

[root at sl7-01 ~]# crm_mon -1 -Af
Last updated: Thu Mar 12 14:35:00 2015
Last change: Thu Mar 12 14:34:20 2015
Stack: corosync
Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
Version: 1.1.12-ce09802
3 Nodes configured
5 Resources configured

Online: [ sl7-01 ]
RemoteOnline: [ snmp1 ]
RemoteOFFLINE: [ snmp2 ]

 Host-rsc1      (ocf::heartbeat:Dummy): Started sl7-01 
 Remote-rsc1    (ocf::heartbeat:Dummy): Started snmp1 
 Remote-rsc2    (ocf::heartbeat:Dummy): Started snmp1 
 snmp1  (ocf::pacemaker:remote):        Started sl7-01 

Migration summary:
* Node sl7-01: 
   snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12 14:51:44 2015'
* Node snmp1: 

Failed actions:
    snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6, status=Error, exit-reason='none', last-rc-change='Thu Mar 12 14:51:44 2015', queued=0ms, exec=0ms
-----------------------

Best Regards,
Hideo Yamauchi.