[Pacemaker] Help with config please

Wed Jul 9 04:10:45 EDT 2014

Hi

Config pacemaker on centos 6.5
pacemaker-cli-1.1.10-14.el6_5.3.x86_64
pacemaker-1.1.10-14.el6_5.3.x86_64
pacemaker-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64

this is my config
Cluster Name: ybrp
Corosync Nodes:

Pacemaker Nodes:
 devrp1 devrp2 

Resources: 
 Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport 
  Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s 
  Operations: monitor on-fail=restart interval=5s timeout=20s (ybrpip-monitor-interval-5s)
 Clone: ybrpstat-clone
  Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1 
  Resource: ybrpstat (class=ocf provider=yb type=proxy)
   Operations: monitor on-fail=restart interval=5s timeout=20s (ybrpstat-monitor-interval-5s)

Stonith Devices: 
Fencing Levels: 

Location Constraints:
Ordering Constraints:
  start ybrpstat-clone then start ybrpip (Mandatory) (id:order-ybrpstat-clone-ybrpip-mandatory)
Colocation Constraints:
  ybrpip with ybrpstat-clone (INFINITY) (id:colocation-ybrpip-ybrpstat-clone-INFINITY)

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.10-14.el6_5.3-368c726
 last-lrm-refresh: 1404892739
 no-quorum-policy: ignore
 stonith-enabled: false

I have my own resource file and I start stop the proxy service outside of pacemaker!

I had an interesting problem, where I did a vmware update on the linux box, which interrupted network activity.

Part of my monitor function on my script is to 1) test if the proxy process is running, 2) get a status page from the proxy and confirm it is 200

This is what I got in /var/log/messages

Jul  9 06:16:13 devrp1 crmd[6849]:  warning: update_failcount: Updating failcount for ybrpstat on devrp2 after failed monitor: rc=7 (update
=value++, time=1404850573)
Jul  9 06:16:13 devrp1 crmd[6849]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_
INTERNAL origin=abort_transition_graph ]
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on devrp2: not running (7)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing failed op start for ybrpstat:1 on devrp1: unknown error (1)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 1000000 failures (ma
x=1000000)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 1000000 failures (ma
x=1000000)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Restart ybrpip#011(Started devrp2)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Recover ybrpstat:0#011(Started devrp2)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: process_pe_message: Calculated Transition 1054: /var/lib/pacemaker/pengine/pe-input-235.bz2
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on devrp2: not running (7)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing failed op start for ybrpstat:1 on devrp1: unknown error (1)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 1000000 failures (max=1000000)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 1000000 failures (max=1000000)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Restart ybrpip#011(Started devrp2)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Recover ybrpstat:0#011(Started devrp2)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: process_pe_message: Calculated Transition 1055: /var/lib/pacemaker/pengine/pe-input-236.bz2
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on devrp2: not running (7)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing failed op start for ybrpstat:1 on devrp1: unknown error (1)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 1000000 failures (max=1000000)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 1000000 failures (max=1000000)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Restart ybrpip#011(Started devrp2)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Recover ybrpstat:0#011(Started devrp2)

And it stay this way for the next 12 hours, until I got on.

I poked around and to fix it I ran this
        /usr/sbin/pcs resource cleanup ybrpip
        /usr/sbin/pcs resource cleanup ybrpstat

Bascially I cleaned up the errors and off it went all by itself.

So my question is how do I configure it or what do I need to change in the resource script file to send a temp error back to pacemaker so that it should have kept trying to check the status of proxy ?

It seems to me it tried once and then failed... although the log says filed after 1000000 failures ....  how can I change that to infinite and where is the interval setting for this, cause in the config above it looks to me like it should be infinite ?

Thanks
Alex