[ClusterLabs] [Question:pacemaker_remote] By the operation that remote node cannot carry out a cluster, the resource does not move. (STONITH is not carried out, too)

Mon Aug 17 22:19:11 EDT 2015

Hi Andrew,

A correction seems to still have a problem.

It is awaiting demote, and the master-group resource cannot move.
[root at bl460g8n3 ~]# crm_mon -1 -Af
Last updated: Tue Aug 18 11:13:39 2015          Last change: Tue Aug 18 11:11:01 2015 by root via crm_resource on bl460g8n4
Stack: corosync
Current DC: bl460g8n3 (version 1.1.13-7d0cac0) - partition with quorum
4 nodes and 10 resources configured

Online: [ bl460g8n3 bl460g8n4 ]
GuestOnline: [ pgsr02 at bl460g8n4 ]

 prmDB2 (ocf::heartbeat:VirtualDomain): Started bl460g8n4
 Resource Group: grpStonith1
     prmStonith1-2      (stonith:external/ipmi):        Started bl460g8n4
 Resource Group: grpStonith2
     prmStonith2-2      (stonith:external/ipmi):        Started bl460g8n3
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ pgsr02 ]

Node Attributes:
* Node bl460g8n3:
* Node bl460g8n4:
* Node pgsr02 at bl460g8n4:
    + master-pgsql                      : 10        

Migration Summary:
* Node bl460g8n3:
   pgsr01: migration-threshold=1 fail-count=1 last-failure='Tue Aug 18 11:12:03 2015'
* Node bl460g8n4:
* Node pgsr02 at bl460g8n4:

Failed Actions:
* pgsr01_monitor_30000 on bl460g8n3 'unknown error' (1): call=2, status=Error, exitreason='none',
    last-rc-change='Tue Aug 18 11:12:03 2015', queued=0ms, exec=0ms

(snip)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Container prmDB1 and the resources within it have failed 1 times on bl460g8n3
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Forcing prmDB1 away from bl460g8n3 after 1 failures (max=1)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: pgsr01 has failed 1 times on bl460g8n3
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Forcing pgsr01 away from bl460g8n3 after 1 failures (max=1)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: prmDB1: Rolling back scores from pgsr01Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource prmDB1 cannot run anywhere
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource pgsr01 cannot run anywhere
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: pgsql:0: Rolling back scores from vip-master
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource pgsql:0 cannot run anywhere
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Promoting pgsql:1 (Master pgsr02)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: msPostgresql: Promoted 1 instances of a possible 1 to master
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action vip-master_stop_0 on pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (10s) for vip-master on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action vip-rep_stop_0 on pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (10s) for vip-rep on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (9s) for pgsql:1 on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (9s) for pgsql:1 on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Impliying node pgsr01 is down when container prmDB1 is stopped ((nil))
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmDB1  (Stopped)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmDB2  (Started bl460g8n4)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmStonith1-2   (Started bl460g8n4)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmStonith2-2   (Started bl460g8n3)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Stop    vip-master    (Started pgsr01 - blocked)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Stop    vip-rep       (Started pgsr01 - blocked)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Demote  pgsql:0       (Master -> Stopped pgsr01 - blocked)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   pgsql:1 (Master pgsr02)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   pgsr01  (Stopped)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   pgsr02  (Started bl460g8n4)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: crit: Cannot shut down node 'pgsr01' because of pgsql:0: blocked failed
Aug 18 11:12:07 bl460g8n3 pengine[10325]: crit: Cannot shut down node 'pgsr01' because of vip-rep: blocked failed
Aug 18 11:12:07 bl460g8n3 pengine[10325]: crit: Cannot shut down node 'pgsr01' because of vip-master: blocked failed

 * http://bugs.clusterlabs.org/show_bug.cgi?id=5247#c3

Best Regards,
Hideo Yamauch.

----- Original Message -----
>From: Andrew Beekhof <andrew at beekhof.net>
>To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org> 
>Date: 2015/8/18, Tue 10:17
>Subject: Re: [ClusterLabs] [Question:pacemaker_remote] By the operation that remote node cannot carry out a cluster, the resource does not move. (STONITH is not carried out, too)
> 
>Should be fixed now. Thanks for the report!
>
>> On 12 Aug 2015, at 1:20 pm, renayama19661014 at ybb.ne.jp wrote:
>> 
>> Hi All,
>> 
>> We confirmed movement of pacemaker_remote.(version:pacemaker-ad1f397a8228a63949f86c96597da5cecc3ed977)
>> 
>> It is the following cluster constitution.
>>  * bl460g8n3(KVM host)
>>  * bl460g8n4(KVM host)
>>  * pgsr01(Guest on the bl460g8n3 host)
>>  * pgsr02(Guest on the bl460g8n4 host)
>> 
>> 
>> Step 1) I compose a cluster of a simple resource.
>> 
>> [root at bl460g8n3 ~]# crm_mon -1 -Af
>> Last updated: Wed Aug 12 11:52:27 2015          Last change: Wed Aug 12 11:51:47 2015 by root via crm_resource on bl460g8n4
>> Stack: corosync
>> Current DC: bl460g8n3 (version 1.1.13-ad1f397) - partition with quorum
>> 4 nodes and 10 resources configured
>> 
>> Online: [ bl460g8n3 bl460g8n4 ]
>> GuestOnline: [ pgsr01 at bl460g8n3 pgsr02 at bl460g8n4 ]
>> 
>>  prmDB1 (ocf::heartbeat:VirtualDomain): Started bl460g8n3
>>  prmDB2 (ocf::heartbeat:VirtualDomain): Started bl460g8n4
>>  Resource Group: grpStonith1
>>      prmStonith1-2      (stonith:external/ipmi):        Started bl460g8n4
>>  Resource Group: grpStonith2
>>      prmStonith2-2      (stonith:external/ipmi):        Started bl460g8n3
>>  Resource Group: master-group
>>      vip-master (ocf::heartbeat:Dummy): Started pgsr02
>>      vip-rep    (ocf::heartbeat:Dummy): Started pgsr02
>>  Master/Slave Set: msPostgresql [pgsql]
>>      Masters: [ pgsr02 ]
>>      Slaves: [ pgsr01 ]
>> 
>> Node Attributes:
>> * Node bl460g8n3:
>> * Node bl460g8n4:
>> * Node pgsr01 at bl460g8n3:
>>     + master-pgsql                      : 5        
>> * Node pgsr02 at bl460g8n4:
>>     + master-pgsql                      : 10        
>> 
>> Migration Summary:
>> * Node bl460g8n4:
>> * Node bl460g8n3:
>> * Node pgsr02 at bl460g8n4:
>> * Node pgsr01 at bl460g8n3:
>> 
>> 
>> Step 2) I cause trouble of pacemaker_remote in pgsr02.
>> 
>> [root at pgsr02 ~]# ps -ef |grep remote
>> root      1171     1  0 11:52 ?        00:00:00 /usr/sbin/pacemaker_remoted
>> root      1428  1377  0 11:53 pts/0    00:00:00 grep --color=auto remote
>> [root at pgsr02 ~]# kill -9 1171
>> 
>> 
>> Step 3) After trouble, the master-group resource does not start in pgsr01.
>> 
>> [root at bl460g8n3 ~]# crm_mon -1 -Af
>> Last updated: Wed Aug 12 11:54:04 2015          Last change: Wed Aug 12 11:51:47 2015 by root via crm_resource on bl460g8n4
>> Stack: corosync
>> Current DC: bl460g8n3 (version 1.1.13-ad1f397) - partition with quorum
>> 4 nodes and 10 resources configured
>> 
>> Online: [ bl460g8n3 bl460g8n4 ]
>> GuestOnline: [ pgsr01 at bl460g8n3 ]
>> 
>>  prmDB1 (ocf::heartbeat:VirtualDomain): Started bl460g8n3
>>  prmDB2 (ocf::heartbeat:VirtualDomain): FAILED bl460g8n4
>>  Resource Group: grpStonith1
>>      prmStonith1-2      (stonith:external/ipmi):        Started bl460g8n4
>>  Resource Group: grpStonith2
>>      prmStonith2-2      (stonith:external/ipmi):        Started bl460g8n3
>>  Master/Slave Set: msPostgresql [pgsql]
>>      Masters: [ pgsr01 ]
>> 
>> Node Attributes:
>> * Node bl460g8n3:
>> * Node bl460g8n4:
>> * Node pgsr01 at bl460g8n3:
>>     + master-pgsql                      : 10        
>> 
>> Migration Summary:
>> * Node bl460g8n4:
>>    pgsr02: migration-threshold=1 fail-count=1 last-failure='Wed Aug 12 11:53:39 2015'
>> * Node bl460g8n3:
>> * Node pgsr01 at bl460g8n3:
>> 
>> Failed Actions:
>> * pgsr02_monitor_30000 on bl460g8n4 'unknown error' (1): call=2, status=Error, exitreason='none',
>>     last-rc-change='Wed Aug 12 11:53:39 2015', queued=0ms, exec=0ms
>> 
>> 
>> It seems to be caused by the fact that STONITH is not carried out somehow or other.
>> The demote operation that a cluster cannot handle seems to obstruct start in pgsr01.
>> --------------------------------------------------------------------------------------
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: Graph 10 with 20 actions: batch-limit=20 jobs, network-delay=0ms
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action    4]: Pending rsc op prmDB2_stop_0                       on bl460g8n4 (priority: 0, waiting:  70)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   36]: Completed pseudo op master-group_stop_0            on N/A (priority: 0, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   34]: Completed pseudo op master-group_start_0           on N/A (priority: 0, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   82]: Completed rsc op pgsql_post_notify_demote_0        on pgsr01 (priority: 1000000, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   81]: Completed rsc op pgsql_pre_notify_demote_0         on pgsr01 (priority: 0, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   78]: Completed rsc op pgsql_post_notify_stop_0          on pgsr01 (priority: 1000000, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   77]: Completed rsc op pgsql_pre_notify_stop_0           on pgsr01 (priority: 0, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   67]: Completed pseudo op msPostgresql_confirmed-post_notify_demoted_0 on N/A (priority: 1000000, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   66]: Completed pseudo op msPostgresql_post_notify_demoted_0 on N/A (priority: 1000000, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   65]: Completed pseudo op msPostgresql_confirmed-pre_notify_demote_0 on N/A (priority: 0, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   64]: Completed pseudo op msPostgresql_pre_notify_demote_0 on N/A (priority: 0, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   63]: Completed pseudo op msPostgresql_demoted_0         on N/A (priority: 1000000, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   62]: Completed pseudo op msPostgresql_demote_0          on N/A (priority: 0, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   55]: Completed pseudo op msPostgresql_confirmed-post_notify_stopped_0 on N/A (priority: 1000000, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   54]: Completed pseudo op msPostgresql_post_notify_stopped_0 on N/A (priority: 1000000, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   53]: Completed pseudo op msPostgresql_confirmed-pre_notify_stop_0 on N/A (priority: 0, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   52]: Completed pseudo op msPostgresql_pre_notify_stop_0 on N/A (priority: 0, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   51]: Completed pseudo op msPostgresql_stopped_0         on N/A (priority: 1000000, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   50]: Completed pseudo op msPostgresql_stop_0            on N/A (priority: 0, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: [Action   70]: Pending rsc op pgsr02_stop_0                       on bl460g8n4 (priority: 0, waiting: none)
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice:  * [Input 38]: Unresolved dependency rsc op pgsql_demote_0 on pgsr02
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: info: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
>> Aug 12 12:08:40 bl460g8n3 crmd[9427]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>> --------------------------------------------------------------------------------------
>> 
>> Is there setting to let a cluster carry out STONITH well?
>> Is this a bug of pacemaker_remote?
>> 
>>  * I registered these contents with Bugzilla.(http://bugs.clusterlabs.org/show_bug.cgi?id=5247)
>>  * In addition, I attached crm_report to Bugzilla.
>> 
>> Best Regards,
>> Hideo Yamauchi.
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
>