[Pacemaker] Restart of resources

Tue Jan 28 08:44:22 EST 2014

No one with an idea?
Or can someone tell me if it is even possible?

Thanks
Frank

Am 23.01.2014 10:50, schrieb Frank Brendel:
> Hi list,
>
> I have some trouble configuring a resource that is allowed to fail 
> once in two minutes.
> The documentation states that I have to configure migration-threshold 
> and failure-timeout to achieve this.
> Here is the configuration for the resource.
>
> # pcs config
> Cluster Name: mycluster
> Corosync Nodes:
>
> Pacemaker Nodes:
>  Node1 Node2 Node3
>
> Resources:
>  Clone: resClamd-clone
>   Meta Attrs: clone-max=3 clone-node-max=1 interleave=true
>   Resource: resClamd (class=lsb type=clamd)
>    Meta Attrs: failure-timeout=120s migration-threshold=2
>    Operations: monitor on-fail=restart interval=60s 
> (resClamd-monitor-on-fail-restart)
>
> Stonith Devices:
> Fencing Levels:
>
> Location Constraints:
> Ordering Constraints:
> Colocation Constraints:
>
> Cluster Properties:
>  cluster-infrastructure: cman
>  dc-version: 1.1.10-14.el6_5.1-368c726
>  last-lrm-refresh: 1390468150
>  stonith-enabled: false
>
> # pcs resource defaults
> resource-stickiness: INFINITY
>
> # pcs status
> Cluster name: mycluster
> Last updated: Thu Jan 23 10:12:49 2014
> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node2
> Stack: cman
> Current DC: Node2 - partition with quorum
> Version: 1.1.10-14.el6_5.1-368c726
> 3 Nodes configured
> 3 Resources configured
>
>
> Online: [ Node1 Node2 Node3 ]
>
> Full list of resources:
>
>  Clone Set: resClamd-clone [resClamd]
>      Started: [ Node1 Node2 Node3 ]
>
>
> Stopping the clamd daemon sets the failcount to 1 and the daemon is 
> started again. Ok.
>
>
> # service clamd stop
> Stopping Clam AntiVirus Daemon:                            [  OK ]
>
> /var/log/messages
> Jan 23 10:15:20 Node1 crmd[6075]:   notice: process_lrm_event: 
> Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update 
> relayed from Node2
> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_trigger_update: 
> Sending flush op to all hosts for: fail-count-resClamd (1)
> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_perform_update: 
> Sent update 177: fail-count-resClamd=1
> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update 
> relayed from Node2
> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_trigger_update: 
> Sending flush op to all hosts for: last-failure-resClamd (1390468520)
> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_perform_update: 
> Sent update 179: last-failure-resClamd=1390468520
> Jan 23 10:15:20 Node1 crmd[6075]:   notice: process_lrm_event: 
> Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
> Jan 23 10:15:21 Node1 crmd[6075]:   notice: process_lrm_event: LRM 
> operation resClamd_stop_0 (call=310, rc=0, cib-update=110, 
> confirmed=true) ok
> Jan 23 10:15:30 elmailtst1 crmd[6075]:   notice: process_lrm_event: 
> LRM operation resClamd_start_0 (call=314, rc=0, cib-update=111, 
> confirmed=true) ok
> Jan 23 10:15:30 elmailtst1 crmd[6075]:   notice: process_lrm_event: 
> LRM operation resClamd_monitor_60000 (call=317, rc=0, cib-update=112, 
> confirmed=false) ok
>
> # pcs status
> Cluster name: mycluster
> Last updated: Thu Jan 23 10:16:48 2014
> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
> Stack: cman
> Current DC: Node2 - partition with quorum
> Version: 1.1.10-14.el6_5.1-368c726
> 3 Nodes configured
> 3 Resources configured
>
>
> Online: [ Node1 Node2 Node3 ]
>
> Full list of resources:
>
>  Clone Set: resClamd-clone [resClamd]
>      Started: [ Node1 Node2 Node3 ]
>
> Failed actions:
>     resClamd_monitor_60000 on Node1 'not running' (7): call=305, 
> status=complete, last-rc-change='Thu Jan 23 10:15:20 2014', 
> queued=0ms, exec=0ms
>
> # pcs resource failcount show resClamd
> Failcounts for resClamd
>  Node1: 1
>
>
> After 7 Minutes I let it fail again and as I understood it should be 
> started as well. But it doesn't.
>
>
> # service clamd stop
> Stopping Clam AntiVirus Daemon:                            [  OK ]
>
> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: LRM 
> operation resClamd_monitor_60000 (call=317, rc=7, cib-update=113, 
> confirmed=false) not running
> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: 
> Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update 
> relayed from Node2
> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_trigger_update: 
> Sending flush op to all hosts for: fail-count-resClamd (2)
> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_perform_update: 
> Sent update 181: fail-count-resClamd=2
> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update 
> relayed from Node2
> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_trigger_update: 
> Sending flush op to all hosts for: last-failure-resClamd (1390468950)
> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_perform_update: 
> Sent update 183: last-failure-resClamd=1390468950
> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: 
> Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: LRM 
> operation resClamd_stop_0 (call=322, rc=0, cib-update=114, 
> confirmed=true) ok
>
> # pcs status
> Cluster name: mycluster
> Last updated: Thu Jan 23 10:22:41 2014
> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
> Stack: cman
> Current DC: Node2 - partition with quorum
> Version: 1.1.10-14.el6_5.1-368c726
> 3 Nodes configured
> 3 Resources configured
>
>
> Online: [ Node1 Node2 Node3 ]
>
> Full list of resources:
>
>  Clone Set: resClamd-clone [resClamd]
>      Started: [ Node2 Node3 ]
>      Stopped: [ Node1 ]
>
> Failed actions:
>     resClamd_monitor_60000 on Node1 'not running' (7): call=317, 
> status=complete, last-rc-change='Thu Jan 23 10:22:30 2014', 
> queued=0ms, exec=0ms
>
>
> What's wrong with my configuration?
>
>
> Thanks in advance
> Frank
>