[Pacemaker] can not handle the failcount for clone resource using crm tool

Thu May 21 18:53:25 EDT 2009

Hi Junko-san,

On Thu, May 21, 2009 at 06:32:52PM +0900, Junko IKEDA wrote:
> Hi,
> 
> I have 4 nodes (dl380g5a, dl380g5b, dl380g5c, dl380g5d), 
> and run 1 clone resource with the following configuration.
> 	clone_max"="2"
> 	clone_node_max"="1"
> 
> (1) Initial state
> 	dummy:0	dl380g5a
> 	dummy:1	dl380g5b
> 
> (2) dummy:1 break down, and move to dl380g5c
> 	dummy:0	dl380g5a
> 	dummy:1	dl380g5c
> 
> (3) dummy:1 break down again, move to dl380g5d
> 	dummy:0	dl380g5a
> 	dummy:1	dl380g5d
> 
> (4) Now, the failconts for dummy:1 are;
> 	dl380g5c = 1
> 	dl380g5d = 1
> 
> I tried to delete the failcount using crm.
> But it seems that delete switch for clone resource doesn't work.
> 
> crm(live)resource# failcount dummy:1 show dl380g5c
> scope=status  name=fail-count-dummy:1 value=1
> crm(live)resource# failcount dummy:1 delete dl380g5c

I can see in the logs this:

crm_attribute -N dl380g5c -n fail-count-dummy:1 -D -t status -d 0

Well, that should've deleted the failcount. Unfortunately, can't
see anything in the logs. I think that you should file a bug.

> crm(live)resource# failcount dummy:1 show dl380g5c
> scope=status  name=fail-count-dummy:1 value=1
> 
> set value "0" worked.
> crm(live)resource# failcount dummy:1 set dl380g5c 0
> crm(live)resource# failcount dummy:1 show dl380g5c
> scope=status  name=fail-count-dummy:1 value=0
> 
> Is this case only in the clone resrouce?

Not sure what you mean.

> And anothre thing.
> After set value "0",
> The failcount was deleted not only for dl380g5c but also dl380g5d.

The set value command I can see in the logs is this:

crm_attribute -N dl380g5c -n fail-count-dummy:1 -v 0 -t status -d 0

That worked fine. In dl380g5d/pengine/pe-input-4.bz2 I can still
see that the fail-count for dummy:1 at 5b is set to 1. Then, in
dl380g5d/pengine/pe-input-5.bz2 it is not set to 0 but gone. I'm
really not sure what triggered the latter transition. Andrew?

> I expected that "failcount <rsc> show _<node>_" could specify one node.
> Is there any wrong configurations?

Sorry, you lost me here as well.

BTW, I can't find the changeset id from the hb_report in the
repository:

CRM Version: 1.0.3 (2e35b8ac90a327c77ff869e1189fc70234213906)

Thanks,

Dejan

> See also attatched hb_report.
> 
> Best Regards,
> Junko Ikeda
> 
> NTT DATA INTELLILINK CORPORATION
> 

> ???(1) initial state
> 	dummy:0	dl380g5a
> 	dummy:1	dl380g5b
> 
> ============
> Last updated: Thu May 21 17:45:16 2009
> Current DC: dl380g5d (1a7cfd3b-c885-45a3-b893-b09adb286e5c) - partition with quorum
> Version: 1.0.3-2e35b8ac90a327c77ff869e1189fc70234213906
> 4 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
> 
> Online: [ dl380g5a dl380g5b dl380g5c dl380g5d ]
> 
> Clone Set: clone
>         Started: [ dl380g5a dl380g5b ]
> 
> Operations:
> * Node dl380g5a:
>    dummy:0: migration-threshold=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=0 (ok)
> * Node dl380g5d:
> * Node dl380g5b:
>    dummy:1: migration-threshold=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=0 (ok)
> * Node dl380g5c:
> 
> 
> (2) dummy:1 break down, and dummy:1 move to dl380g5c
> 	dummy:0	dl380g5a
> 	dummy:1	dl380g5c
> 
> ============
> Last updated: Thu May 21 17:46:21 2009
> Current DC: dl380g5d (1a7cfd3b-c885-45a3-b893-b09adb286e5c) - partition with quorum
> Version: 1.0.3-2e35b8ac90a327c77ff869e1189fc70234213906
> 4 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
> 
> Online: [ dl380g5a dl380g5b dl380g5c dl380g5d ]
> 
> Clone Set: clone
>         Started: [ dl380g5a dl380g5c ]
> 
> Operations:
> * Node dl380g5a:
>    dummy:0: migration-threshold=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=0 (ok)
> * Node dl380g5d:
> * Node dl380g5b:
>    dummy:1: migration-threshold=1 fail-count=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=7 (not running)
>     + (5) stop: rc=0 (ok)
> * Node dl380g5c:
>    dummy:1: migration-threshold=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=0 (ok)
> 
> Failed actions:
>     dummy:1_monitor_10000 (node=dl380g5b, call=4, rc=7, status=complete): not running
> 
> 
> (3) dummy:1 break down again, and dummy:1 move to dl380g5d
> 	dummy:0	dl380g5a
> 	dummy:1	dl380g5d
> 
> ============
> Last updated: Thu May 21 17:46:51 2009
> Current DC: dl380g5d (1a7cfd3b-c885-45a3-b893-b09adb286e5c) - partition with quorum
> Version: 1.0.3-2e35b8ac90a327c77ff869e1189fc70234213906
> 4 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
> 
> Online: [ dl380g5a dl380g5b dl380g5c dl380g5d ]
> 
> Clone Set: clone
>         Started: [ dl380g5a dl380g5d ]
> 
> Operations:
> * Node dl380g5a:
>    dummy:0: migration-threshold=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=0 (ok)
> * Node dl380g5d:
>    dummy:1: migration-threshold=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=0 (ok)
> * Node dl380g5b:
>    dummy:1: migration-threshold=1 fail-count=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=7 (not running)
>     + (5) stop: rc=0 (ok)
> * Node dl380g5c:
>    dummy:1: migration-threshold=1 fail-count=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=7 (not running)
>     + (5) stop: rc=0 (ok)
> 
> Failed actions:
>     dummy:1_monitor_10000 (node=dl380g5b, call=4, rc=7, status=complete): not running
>     dummy:1_monitor_10000 (node=dl380g5c, call=4, rc=7, status=complete): not running
> 
> 
> (4) Now, the failconts for dummy:1 are;
> 	dl380g5c = 1
> 	dl380g5d = 1
> 
> ============
> Last updated: Thu May 21 17:48:06 2009
> Current DC: dl380g5d (1a7cfd3b-c885-45a3-b893-b09adb286e5c) - partition with quorum
> Version: 1.0.3-2e35b8ac90a327c77ff869e1189fc70234213906
> 4 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
> 
> Online: [ dl380g5a dl380g5b dl380g5c dl380g5d ]
> 
> Clone Set: clone
>         Started: [ dl380g5a dl380g5d ]
> 
> Operations:
> * Node dl380g5a:
>    dummy:0: migration-threshold=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=0 (ok)
> * Node dl380g5d:
>    dummy:1: migration-threshold=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=0 (ok)
> * Node dl380g5b:
>    dummy:1: migration-threshold=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=7 (not running)
>     + (5) stop: rc=0 (ok)
> * Node dl380g5c:
>    dummy:1: migration-threshold=1
>     + (3) start: rc=0 (ok)
>     + (4) monitor: interval=10000ms rc=7 (not running)
>     + (5) stop: rc=0 (ok)
> 
> Failed actions:
>     dummy:1_monitor_10000 (node=dl380g5b, call=4, rc=7, status=complete): not running
>     dummy:1_monitor_10000 (node=dl380g5c, call=4, rc=7, status=complete): not running

> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker