[Pacemaker] newbie question(s)

Tue May 21 08:17:48 EDT 2013

Le 21/05/2013 07:04, Alex Samad - Yieldbroker a écrit :
> Hi
>
> I have setup a small 2 node cluster that we are using for HA a java app.
>
> Basically the requirement is to provide HA and later on load balancing.
>
> My initial plan was to use
> 2 nodes of linux
> Iptables cluster module to do the load balancing
> Cluster software to do the failover.
>
> I have left the load balancing for now, HA has been given a higher priority.
>
> So I am using centos 6.3, with pacemaker 1.1.7 rpm's
>
> I have 2 nodes and 1 VIP, the VIP determines which node is the active one.
> The application is actual live on both nodes, its really only the VIP that moves
> I use pacemaker to ensure 1 the application is running and to place the VIP in the right place
>
> I have create my own resource script
> /usr/lib/ocf/resource.d/yb/ybrp
>
> Used one of the others script files for that, but it test
> 1) the application is running by using ps
> 2) that the application is okay, it can make a call and test the result
>
> The start stop basically touches a lock file
> Monitor does the test
> Status uses the lock file and does the tests as well
>
>
> So here is the output from
>
> crm configure show
> node dc1wwwrp01
> node dc1wwwrp02
> primitive ybrpip ocf:heartbeat:IPaddr2 \
>          params ip="10.32.21.10" cidr_netmask="24" \
>          op monitor interval="5s"
> primitive ybrpstat ocf:yb:ybrp \
>          op monitor interval="5s"
> group ybrp ybrpip ybrpstat
> property $id="cib-bootstrap-options" \
>          dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
>          cluster-infrastructure="openais" \
>          expected-quorum-votes="2" \
>          stonith-enabled="false" \
>          no-quorum-policy="ignore" \
>          last-lrm-refresh="1369092192"
>
>
> is there anything I should be doing differently, I have seen colocation option and something about affinity of resources, but I used group which would be the best practise way of doing it ?
>

Probably writing a better ocf script that doesn't rely on 'ps' (eww!) 
but on a real lsb 'status'.

> my next step is to add in iptables cluster ip modules. It is controlled by a /proc/.... control file.  Basically you tell the OS how many nodes and which node number this machine is looking after.
>
> So I was going to make a resource for node number ie node 1 preference for node 1 and node 2 preference for node 2. So that when 1 node goes down it will bring that resource over to it. That can be done by poking a number into the /proc file
>
> But I have seen some wierds things happen that I can explain or control. Sometimes things go a bit off when I do a
>
> /usr/sbin/crm_mon -1
>
> I can see the resource have errors next to them and a message along the lines of
>
> operation monitor failed 'insufficient privileges' (rc=4)
>
> I normally just do a
> crm resource cleanup ybrpstat
> and things come back to normal, but I need to understand how it gets there and why and what I can do to stop it

Maybe the user that runs pacemaker have the right to execute your 'ybrp' 
ocf script but not the 'ps' that you used in it ?
It returns 4 (<> 0), it should be quite easy to read 
/usr/lib/ocf/resource.d/yb/ybrp and search what would return 4 in the 
"status" section.

>
>
> this is from /var/log/messages
>
> from node1
> ==========
> May 21 09:02:35 dc1wwwrp01 cib[2351]:     info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min
> May 21 09:09:28 dc1wwwrp01 crmd[2356]:     info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
> May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
> May 21 09:09:28 dc1wwwrp01 crmd[2356]:     info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: unpack_config: On loss of CCM Quorum: Ignore
> May 21 09:09:28 dc1wwwrp01 pengine[2355]:    error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp01: operation monitor failed 'insufficient privileges' (rc=4)
> May 21 09:09:28 dc1wwwrp01 pengine[2355]:  warning: unpack_rsc_op: Processing failed op ybrpstat_last_failure_0 on dc1wwwrp01: insufficient privileges (4)
> May 21 09:09:28 dc1wwwrp01 pengine[2355]:    error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp02: operation monitor failed 'insufficient privileges' (rc=4)
> May 21 09:09:28 dc1wwwrp01 pengine[2355]:  warning: unpack_rsc_op: Processing failed op ybrpstat_last_failure_0 on dc1wwwrp02: insufficient privileges (4)
> May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: common_apply_stickiness: ybrpstat can fail 999999 more times on dc1wwwrp01 before being forced off
> May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: common_apply_stickiness: ybrpstat can fail 999999 more times on dc1wwwrp02 before being forced off
> May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: process_pe_message: Transition 5487: PEngine Input stored in: /var/lib/pengine/pe-input-1485.bz2
> May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> May 21 09:09:28 dc1wwwrp01 crmd[2356]:     info: do_te_invoke: Processing graph 5487 (ref=pe_calc-dc-1369091368-5548) derived from /var/lib/pengine/pe-input-1485.bz2
> May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: run_graph: ==== Transition 5487 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-1485.bz2): Complete
> May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> May 21 09:12:35 dc1wwwrp01 cib[2351]:     info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min
> May 21 09:23:12 dc1wwwrp01 crm_resource[5165]:    error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp01: operation monitor failed 'insufficient privileges' (rc=4)
> May 21 09:23:12 dc1wwwrp01 crm_resource[5165]:    error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp02: operation monitor failed 'insufficient privileges' (rc=4)
> May 21 09:23:12 dc1wwwrp01 cib[2351]:     info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='dc1wwwrp01']//lrm_resource[@id='ybrpstat'] (origin=local/crmd/5589, version=0.101.36): ok (rc=0)
> May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: delete_resource: Removing resource ybrpstat for 5165_crm_resource (internal) on dc1wwwrp01
> May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: notify_deleted: Notifying 5165_crm_resource on dc1wwwrp01 that ybrpstat was deleted
> May 21 09:23:12 dc1wwwrp01 crmd[2356]:  warning: decode_transition_key: Bad UUID (crm-resource-5165) in sscanf result (3) for 0:0:crm-resource-5165
> May 21 09:23:12 dc1wwwrp01 attrd[2354]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ybrpstat (<null>)
> May 21 09:23:12 dc1wwwrp01 cib[2351]:     info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='dc1wwwrp01']//lrm_resource[@id='ybrpstat'] (origin=local/crmd/5590, version=0.101.37): ok (rc=0)
> May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=ybrpstat_last_0, magic=0:0;3:3572:0:c348b36c-f6dd-4a7d-ac5b-01a3b8ce3c34, cib=0.101.37) : Resource op removal
> May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=ybrpstat_last_0, magic=0:0;3:3572:0:c348b36c-f6dd-4a7d-ac5b-01a3b8ce3c34, cib=0.101.37) : Resource op removal
>
>>From node2
> ===========
> May 21 09:20:03 dc1wwwrp02 lrmd: [2045]: info: rsc:ybrpip:16: monitor
> May 21 09:23:12 dc1wwwrp02 lrmd: [2045]: info: cancel_op: operation monitor[16] on ocf::IPaddr2::ybrpip for client 2048, its parameters: CRM_meta_name=[monitor] cidr_netmask=[24] crm_feature_set=[3.0.6] CRM_meta_timeout=[20000] CRM_meta_interval=[5000] ip=[10.32.21.10]  cancelled
> May 21 09:23:12 dc1wwwrp02 lrmd: [2045]: info: rsc:ybrpip:20: stop
> May 21 09:23:12 dc1wwwrp02 cib[2043]:     info: apply_xml_diff: Digest mis-match: expected dcee73fe6518ac0d4b3429425d5dfc16, calculated 4a39d2ad25d50af2ec19b5b24252aef8
> May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_process_diff: Diff 0.101.36 -> 0.101.37 not applied to 0.101.36: Failed application of an update diff
> May 21 09:23:12 dc1wwwrp02 cib[2043]:     info: cib_server_process_diff: Requesting re-sync from peer
> May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not applying diff 0.101.36 -> 0.101.37 (sync in progress)
> May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not applying diff 0.101.37 -> 0.102.1 (sync in progress)
> May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not applying diff 0.102.1 -> 0.102.2 (sync in progress)
> May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not applying diff 0.102.2 -> 0.102.3 (sync in progress)
> May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not applying diff 0.102.3 -> 0.102.4 (sync in progress)
>
> Any help or suggestions muchly appreciated
>

Also, fencing!

>
> Thanks
> Alex

-- 
Cheers,
Florian Crouzat