[Pacemaker] newbie question(s)

Tue May 21 01:04:20 EDT 2013

Hi

I have setup a small 2 node cluster that we are using for HA a java app.

Basically the requirement is to provide HA and later on load balancing.

My initial plan was to use 
2 nodes of linux
Iptables cluster module to do the load balancing 
Cluster software to do the failover.

I have left the load balancing for now, HA has been given a higher priority.

So I am using centos 6.3, with pacemaker 1.1.7 rpm's

I have 2 nodes and 1 VIP, the VIP determines which node is the active one.
The application is actual live on both nodes, its really only the VIP that moves
I use pacemaker to ensure 1 the application is running and to place the VIP in the right place

I have create my own resource script
/usr/lib/ocf/resource.d/yb/ybrp

Used one of the others script files for that, but it test
1) the application is running by using ps
2) that the application is okay, it can make a call and test the result

The start stop basically touches a lock file
Monitor does the test
Status uses the lock file and does the tests as well

So here is the output from

crm configure show
node dc1wwwrp01
node dc1wwwrp02
primitive ybrpip ocf:heartbeat:IPaddr2 \
        params ip="10.32.21.10" cidr_netmask="24" \
        op monitor interval="5s"
primitive ybrpstat ocf:yb:ybrp \
        op monitor interval="5s"
group ybrp ybrpip ybrpstat
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1369092192"

is there anything I should be doing differently, I have seen colocation option and something about affinity of resources, but I used group which would be the best practise way of doing it ?

my next step is to add in iptables cluster ip modules. It is controlled by a /proc/.... control file.  Basically you tell the OS how many nodes and which node number this machine is looking after.

So I was going to make a resource for node number ie node 1 preference for node 1 and node 2 preference for node 2. So that when 1 node goes down it will bring that resource over to it. That can be done by poking a number into the /proc file

But I have seen some wierds things happen that I can explain or control. Sometimes things go a bit off when I do a 

/usr/sbin/crm_mon -1

I can see the resource have errors next to them and a message along the lines of 

operation monitor failed 'insufficient privileges' (rc=4)

I normally just do a 
crm resource cleanup ybrpstat
and things come back to normal, but I need to understand how it gets there and why and what I can do to stop it

this is from /var/log/messages

from node1
==========
May 21 09:02:35 dc1wwwrp01 cib[2351]:     info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min
May 21 09:09:28 dc1wwwrp01 crmd[2356]:     info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
May 21 09:09:28 dc1wwwrp01 crmd[2356]:     info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: unpack_config: On loss of CCM Quorum: Ignore
May 21 09:09:28 dc1wwwrp01 pengine[2355]:    error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp01: operation monitor failed 'insufficient privileges' (rc=4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]:  warning: unpack_rsc_op: Processing failed op ybrpstat_last_failure_0 on dc1wwwrp01: insufficient privileges (4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]:    error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp02: operation monitor failed 'insufficient privileges' (rc=4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]:  warning: unpack_rsc_op: Processing failed op ybrpstat_last_failure_0 on dc1wwwrp02: insufficient privileges (4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: common_apply_stickiness: ybrpstat can fail 999999 more times on dc1wwwrp01 before being forced off
May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: common_apply_stickiness: ybrpstat can fail 999999 more times on dc1wwwrp02 before being forced off
May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: process_pe_message: Transition 5487: PEngine Input stored in: /var/lib/pengine/pe-input-1485.bz2
May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
May 21 09:09:28 dc1wwwrp01 crmd[2356]:     info: do_te_invoke: Processing graph 5487 (ref=pe_calc-dc-1369091368-5548) derived from /var/lib/pengine/pe-input-1485.bz2
May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: run_graph: ==== Transition 5487 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-1485.bz2): Complete
May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
May 21 09:12:35 dc1wwwrp01 cib[2351]:     info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min
May 21 09:23:12 dc1wwwrp01 crm_resource[5165]:    error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp01: operation monitor failed 'insufficient privileges' (rc=4)
May 21 09:23:12 dc1wwwrp01 crm_resource[5165]:    error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp02: operation monitor failed 'insufficient privileges' (rc=4)
May 21 09:23:12 dc1wwwrp01 cib[2351]:     info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='dc1wwwrp01']//lrm_resource[@id='ybrpstat'] (origin=local/crmd/5589, version=0.101.36): ok (rc=0)
May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: delete_resource: Removing resource ybrpstat for 5165_crm_resource (internal) on dc1wwwrp01
May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: notify_deleted: Notifying 5165_crm_resource on dc1wwwrp01 that ybrpstat was deleted
May 21 09:23:12 dc1wwwrp01 crmd[2356]:  warning: decode_transition_key: Bad UUID (crm-resource-5165) in sscanf result (3) for 0:0:crm-resource-5165
May 21 09:23:12 dc1wwwrp01 attrd[2354]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ybrpstat (<null>)
May 21 09:23:12 dc1wwwrp01 cib[2351]:     info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='dc1wwwrp01']//lrm_resource[@id='ybrpstat'] (origin=local/crmd/5590, version=0.101.37): ok (rc=0)
May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=ybrpstat_last_0, magic=0:0;3:3572:0:c348b36c-f6dd-4a7d-ac5b-01a3b8ce3c34, cib=0.101.37) : Resource op removal
May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=ybrpstat_last_0, magic=0:0;3:3572:0:c348b36c-f6dd-4a7d-ac5b-01a3b8ce3c34, cib=0.101.37) : Resource op removal

>From node2
===========
May 21 09:20:03 dc1wwwrp02 lrmd: [2045]: info: rsc:ybrpip:16: monitor
May 21 09:23:12 dc1wwwrp02 lrmd: [2045]: info: cancel_op: operation monitor[16] on ocf::IPaddr2::ybrpip for client 2048, its parameters: CRM_meta_name=[monitor] cidr_netmask=[24] crm_feature_set=[3.0.6] CRM_meta_timeout=[20000] CRM_meta_interval=[5000] ip=[10.32.21.10]  cancelled
May 21 09:23:12 dc1wwwrp02 lrmd: [2045]: info: rsc:ybrpip:20: stop
May 21 09:23:12 dc1wwwrp02 cib[2043]:     info: apply_xml_diff: Digest mis-match: expected dcee73fe6518ac0d4b3429425d5dfc16, calculated 4a39d2ad25d50af2ec19b5b24252aef8
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_process_diff: Diff 0.101.36 -> 0.101.37 not applied to 0.101.36: Failed application of an update diff
May 21 09:23:12 dc1wwwrp02 cib[2043]:     info: cib_server_process_diff: Requesting re-sync from peer
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not applying diff 0.101.36 -> 0.101.37 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not applying diff 0.101.37 -> 0.102.1 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not applying diff 0.102.1 -> 0.102.2 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not applying diff 0.102.2 -> 0.102.3 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not applying diff 0.102.3 -> 0.102.4 (sync in progress)

Any help or suggestions muchly appreciated

Thanks
Alex