[Pacemaker] [Problem]The monitor that start-delay is long does not stop.

Andrew Beekhof andrew at beekhof.net
Thu Oct 7 03:58:55 EDT 2010


On Thu, Oct 7, 2010 at 8:39 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> Hi,
>
> I operated the next to confirm the contribution of the mailing list.
>
>  * http://www.gossamer-threads.com/lists/linuxha/pacemaker/66939
>
>
> Step1) I prepare cib.xml having monitor which set start-delay than five minutes..
> Step2) I start two nodes and send cib.
>
> ============
> Last updated: Thu Oct  7 14:58:09 2010
> Stack: Heartbeat
> Current DC: srv02 (1f8dd092-d82b-47eb-86c4-e011a2cd11b3) - partition WITHOUT quorum
> Version: 1.0.9-860b32388908c6a345786d4ecd2e2a3bec780dd2
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ srv01 srv02 ]
>
>  Resource Group: grpDummy
>     prmFsPostgreSQLDB1-3       (ocf::heartbeat:Dummy): Started srv01
>     prmIpPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv01
>
> Step3) I causes the monitor error of the resource successively.
>
> ============
> Last updated: Thu Oct  7 15:20:01 2010
> Stack: Heartbeat
> Current DC: srv02 (d3fe8b08-20d9-4990-aebb-56a0675af5bd) - partition WITHOUT quorum
> Version: 1.0.9-860b32388908c6a345786d4ecd2e2a3bec780dd2
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ srv01 srv02 ]
>
>  Resource Group: grpDummy
>     prmFsPostgreSQLDB1-3       (ocf::heartbeat:Dummy): Started srv02
>     prmIpPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
>
> Migration summary:
> * Node srv02:
> * Node srv01:
>   prmIpPostgreSQLDB2: migration-threshold=1 fail-count=1
>   prmFsPostgreSQLDB1-3: migration-threshold=1 fail-count=1
>
> Failed actions:
>    prmIpPostgreSQLDB2_monitor_60000 (node=srv01, call=7, rc=7, status=complete): not running
>    prmFsPostgreSQLDB1-3_monitor_30000 (node=srv01, call=5, rc=7, status=complete): not running
>
> Step4) The resource does fail-over in a srv02 node, but the monitor  of srv01 does not stop.
>
> [root at srv01 ~]# !tail
> tail -f /var/log/ha-log
> Oct  7 15:27:27 srv01 lrmd: [15792]: debug: rsc:prmFsPostgreSQLDB1-3:5: monitor
> Oct  7 15:27:27 srv01 Dummy[16572]: DEBUG: prmFsPostgreSQLDB1-3 monitor : 7
> Oct  7 15:27:58 srv01 lrmd: [15792]: debug: rsc:prmFsPostgreSQLDB1-3:5: monitor
> Oct  7 15:27:58 srv01 Dummy[16594]: DEBUG: prmFsPostgreSQLDB1-3 monitor : 7
> Oct  7 15:27:59 srv01 lrmd: [15792]: debug: rsc:prmIpPostgreSQLDB2:8: monitor
> Oct  7 15:27:59 srv01 Dummy[16601]: DEBUG: prmIpPostgreSQLDB2 monitor : 7
> Oct  7 15:27:59 srv01 lrmd: [15792]: debug: rsc:prmIpPostgreSQLDB2:7: monitor
> Oct  7 15:27:59 srv01 Dummy[16608]: DEBUG: prmIpPostgreSQLDB2 monitor : 7
> Oct  7 15:28:28 srv01 lrmd: [15792]: debug: rsc:prmFsPostgreSQLDB1-3:5: monitor
> Oct  7 15:28:28 srv01 Dummy[16628]: DEBUG: prmFsPostgreSQLDB1-3 monitor : 7
>
> Step5) The fail-count does strange increase afterwards.
>
> ============
> Last updated: Thu Oct  7 15:31:21 2010
> Stack: Heartbeat
> Current DC: srv02 (d3fe8b08-20d9-4990-aebb-56a0675af5bd) - partition WITHOUT quorum
> Version: 1.0.9-860b32388908c6a345786d4ecd2e2a3bec780dd2
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ srv01 srv02 ]
>
>  Resource Group: grpDummy
>     prmFsPostgreSQLDB1-3       (ocf::heartbeat:Dummy): Started srv02
>     prmIpPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
>
> Migration summary:
> * Node srv02:
> * Node srv01:
>   prmIpPostgreSQLDB2: migration-threshold=1 fail-count=2
>   prmFsPostgreSQLDB1-3: migration-threshold=1 fail-count=1
>
> Failed actions:
>    prmIpPostgreSQLDB2_monitor_60000 (node=srv01, call=8, rc=7, status=complete): not running
>    prmFsPostgreSQLDB1-3_monitor_30000 (node=srv01, call=5, rc=7, status=complete): not running
>
>
> The next report may be related.
>
>  * http://www.gossamer-threads.com/lists/linuxha/pacemaker/66939

Funnily enough I was just looking at that message and saw that the
code relevant to this one looked wrong too.

I believe this should fix the issue:
   http://hg.clusterlabs.org/pacemaker/1.1/rev/e06810256413

>
> I registered log and more with Bugzilla.
>
>  * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2505

Oops, I didn't see that. I should have included the bug number in the commit :-(




More information about the Pacemaker mailing list