[ClusterLabs] Pacemaker crash and reboot in maintenance-mode

Digimer lists at alteeve.ca
Sat May 30 17:19:39 UTC 2015


On 30/05/15 01:16 PM, Oscar Salvador wrote:
> 
> Hi  Guys!
> 
> Yesterday I posted on pacemaker at oss.clusterlabs.org
> <mailto:pacemaker at oss.clusterlabs.org>, but I think it's deprecated, and
> then I posted in this list, but in an awful way.
> I have also add some technical specifications.
> Sorry to post it again.
> 
> I'm having a strange problem with pacemaker-heartbeat cluster when I put
> it in a maintenance-mode.
> 
> First of all, let me show my configuration:
> 
> [ Pacemaker ]
> 
> node $id="23ade9ce-d274-4e56-aa91-9e95a8c08cf7" test-lb02 \
>         attributes standby="off"
> node $id="52ac429f-2b78-4630-bbd8-fb73a8152ab3" test-lb01
> primitive ClusterMon ocf:pacemaker:ClusterMon \
>         params user="root" update="30" extra_options="-T somemail -F
> somemail -P PACEMAKER" \
>         op monitor interval="60" timeout="20" on-fail="standby"
> primitive IP-rsc_apache ocf:heartbeat:IPaddr2 \
>         params ip="xx.xx.xx.yy" nic="eth0" cidr_netmask="255.255.255.192" \
>         meta migration-threshold="2" target-role="Started" \
>         op monitor interval="20" timeout="20" on-fail="standby"
> primitive Nginx-rsc ocf:heartbeat:nginx \
>         meta migration-threshold="2" is-managed="true"
> target-role="Started" \
>         op monitor interval="20" timeout="20" on-fail="standby"
> clone ClusterMon-clone ClusterMon \
>         meta taget-role="Started"
> colocation lb-loc inf: IP-rsc_apache Nginx-rsc
> order lb-ord inf: IP-rsc_apache Nginx-rsc
> property $id="cib-bootstrap-options" \
>         stonith-enabled="no" \
>         dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>         cluster-infrastructure="Heartbeat" \
>         maintenance-mode="false" \
>         cluster-recheck-interval="60s"
> 
> [heartbeat]
> 
> crm yes
> 
> logfile /var/log/ha-log
> 
> logfacility     local0
> 
> keepalive 2
> deadtime 30
> warntime 10
> initdead 120
> 
> auto_failback off
> 
> ucast   eth0 xx.xx.xx.xx
> ucast   eth0 xx.xx.xx.xy
> 
> node    test-lb01
> node    test-lb02
> 
> 
> 
> [STATUS]
> 
> Last updated: Fri May 29 18:04:11 2015
> Last change: Fri May 29 18:01:57 2015 via cibadmin on test-lb01
> Stack: Heartbeat
> Current DC: test-lb01 (52ac429f-2b78-4630-bbd8-fb73a8152ab3) - partition
> with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, unknown expected votes
> 4 Resources configured.
> ============
> 
> Online: [ test-lb01 test-lb02 ]
> 
>  IP-rsc_apache    (ocf::heartbeat:IPaddr2):    Started test-lb01
>  Nginx-rsc    (ocf::heartbeat:nginx):    Started test-lb01
>  Clone Set: ClusterMon-clone [ClusterMon]
>      Started: [ test-lb01 test-lb02 ]
> 
> 
> ====================
> 
> 
> So, everything is working. But now, if put enable the maintenance-mode
> on the cluster, both nodes get rebooted:
> 
> 
> crm(live)# configure property maintenance-mode="true"
> 
> Then in the logs:
> 
> 
> May 29 18:06:39 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating
> action 2: cancel IP-rsc_apache_monitor_20000 on test-lb01 (local)
> May 29 18:06:39 test-lb01 lrmd: [3237]: info: cancel_op: operation
> monitor[8] on IP-rsc_apache for client 3240, its parameters:
> cidr_netmask=[255.255.255.192] CRM_meta_timeout=[20000]
> CRM_meta_name=[monitor] CRM_meta_on_fail=[standby]
> crm_feature_set=[3.0.6] CRM_meta_interval=[20000] ip=[xx.xx.xx.yy]
> nic=[eth0]  cancelled
> May 29 18:06:39 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating
> action 3: cancel Nginx-rsc_monitor_20000 on test-lb01 (local)
> May 29 18:06:39 test-lb01 lrmd: [3237]: info: cancel_op: operation
> monitor[10] on Nginx-rsc for client 3240, its parameters:
> crm_feature_set=[3.0.6] CRM_meta_on_fail=[standby]
> CRM_meta_name=[monitor] CRM_meta_interval=[20000]
> CRM_meta_timeout=[20000]  cancelled
> May 29 18:06:39 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating
> action 4: cancel ClusterMon:0_monitor_60000 on test-lb01 (local)
> May 29 18:06:39 test-lb01 lrmd: [3237]: info: cancel_op: operation
> monitor[6] on ClusterMon:0 for client 3240, its parameters:
> CRM_meta_timeout=[20000] CRM_meta_name=[monitor]
> CRM_meta_on_fail=[standby] crm_feature_set=[3.0.6] CRM_meta_notify=[false]
> extra_options=[-T somemail -F somemail -P PACEMAKuser=[root]
> CRM_meta_clone=[0] CRM_meta_clone_max=[2] CRM_meta_clone_node_max=[1]
> CRM_meta_interval=[60000] CRM_meta_globally_unique=[false] update=[30] 
> cancelled
> 
> May 29 18:06:39 test-lb02 lrmd: [3223]: info: cancel_op: operation
> monitor[6] on ClusterMon:1 for client 3226, its parameters:
> CRM_meta_timeout=[20000] CRM_meta_name=[monitor]
> CRM_meta_on_fail=[standby] crm_feature_set=[3.0.6]
> CRM_meta_notify=[false] extra_options=[-T somemail -F some mail -P
> PACEMAKuser=[root] CRM_meta_clone=[1] CRM_meta_clone_max=[2]
> CRM_meta_clone_node_max=[1] CRM_meta_interval=[60000]
> CRM_meta_globally_unique=[false] update=[30]  cancelled
> 
> 
> May 29 18:06:39 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating
> action 1: cancel ClusterMon:1_monitor_60000 on test-lb02
> May 29 18:06:39 test-lb01 crmd: [3240]: info: process_lrm_event: LRM
> operation IP-rsc_apache_monitor_20000 (call=8, status=1, cib-update=0,
> confirmed=true) Cancelled
> May 29 18:06:39 test-lb01 crmd: [3240]: info: process_lrm_event: LRM
> operation Nginx-rsc_monitor_20000 (call=10, status=1, cib-update=0,
> confirmed=true) Cancelled
> May 29 18:06:39 test-lb01 crmd: [3240]: info: process_lrm_event: LRM
> operation ClusterMon:0_monitor_60000 (call=6, status=1, cib-update=0,
> confirmed=true) Cancelled
> May 29 18:06:39 test-lb02 crmd: [3226]: info: process_lrm_event: LRM
> operation ClusterMon:1_monitor_60000 (call=6, status=1, cib-update=0,
> confirmed=true) Cancelled
> 
> 
> But after 60s, when the timmer hits, the whole cluster went away:
> 
> 
> May 29 18:07:40 test-lb01 crmd: [3240]: info: crm_timer_popped: PEngine
> Recheck Timer (I_PE_CALC) just popped (60000ms)
> May 29 18:07:40 test-lb01 crmd: [3240]: notice: do_state_transition:
> State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> May 29 18:07:40 test-lb01 crmd: [3240]: info: do_state_transition:
> Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> May 29 18:07:40 test-lb01 crmd: [3240]: notice: do_state_transition:
> State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> May 29 18:07:40 test-lb01 crmd: [3240]: WARN: destroy_action: Cancelling
> timer for action 2 (src=98)
> May 29 18:07:40 test-lb01 crmd: [3240]: WARN: destroy_action: Cancelling
> timer for action 3 (src=99)
> May 29 18:07:40 test-lb01 crmd: [3240]: WARN: destroy_action: Cancelling
> timer for action 4 (src=100)
> May 29 18:07:40 test-lb01 crmd: [3240]: info: do_te_invoke: Processing
> graph 8 (ref=pe_calc-dc-1432915660-42) derived from
> /var/lib/pengine/pe-input-242.bz2
> May 29 18:07:40 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating
> action 2: cancel IP-rsc_apache_monitor_20000 on test-lb01 (local)
> May 29 18:07:40 test-lb01 crmd: [3240]: info: cancel_op: No pending op
> found for IP-rsc_apache:8
> May 29 18:07:40 test-lb01 lrmd: [3237]: info: on_msg_cancel_op: no
> operation with id 8
> May 29 18:07:40 test-lb01 cib: [3236]: info: cib_process_request:
> Operation complete: op cib_delete for section
> //node_state[@uname='test-lb01']//lrm_resource[@id='IP-rsc_apache']/lrm_rsc_op[@id='IP-rsc_apache_monitor_20000'
> and @call-id='8'] (origin=local/crmd/73, version=0.124.27): ok (rc=0)
> May 29 18:07:40 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating
> action 3: cancel Nginx-rsc_monitor_20000 on test-lb01 (local)
> May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed
> to send a getrsc message to lrmd via ch_cmd channel.
> May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed
> to send a getrsc message to lrmd via ch_cmd channel.
> May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_add_rsc(870): failed
> to send a addrsc message to lrmd via ch_cmd channel.
> May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed
> to send a getrsc message to lrmd via ch_cmd channel.
> May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: get_lrm_resource: Could
> not add resource Nginx-rsc to LRM
> May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: do_lrm_invoke: Invalid
> resource definition
> 
> 
> And this message repeated for every resource I have:
> 
> May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input
> <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.6"
> subt="request" reference="lrm_invoke-tengine-1432915660-45"
> crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine"
> crm_host_to="test-lb01" >
> May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input  
> <crm_xml >
> May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad
> input     <rsc_op id="3" operation="cancel"
> operation_key="Nginx-rsc_monitor_20000" on_node="test-lb01"
> on_node_uuid="52ac429f-2b78-4630-bbd8-fb73a8152ab3"
> transition-key="3:8:0:3edaee69-5093-4538-8d12-90e0db0658ba" >
> May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad
> input       <primitive id="Nginx-rsc" long-id="Nginx-rsc" class="ocf"
> provider="heartbeat" type="nginx" />
> May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad
> input       <attributes CRM_meta_call_id="10" CRM_meta_interval="20000"
> CRM_meta_name="monitor" CRM_meta_on_fail="standby"
> CRM_meta_operation="monitor" CRM_meta_timeout="20000"
> crm_feature_set="3.0.6" />
> May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad
> input     </rsc_op>
> May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input  
> </crm_xml>
> May 29 18:07:40 test-lb01 crmd: [3240]: WARN: do_lrm_invoke: bad input
> </create_request_adv>
> May 29 18:07:40 test-lb01 crmd: [3240]: info: te_rsc_command: Initiating
> action 4: cancel ClusterMon:0_monitor_60000 on test-lb01 (local)
> May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed
> to send a getrsc message to lrmd via ch_cmd channel.
> May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed
> to send a getrsc message to lrmd via ch_cmd channel.
> May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_add_rsc(870): failed
> to send a addrsc message to lrmd via ch_cmd channel.
> May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: lrm_get_rsc(666): failed
> to send a getrsc message to lrmd via ch_cmd channel.
> 
> And then the rea crash
> 
> May 29 18:07:40 test-lb01 crmd: [3240]: ERROR: crm_abort:
> abort_transition_graph: Triggered assert at te_utils.c:339 :
> transition_graph != NULL
> May 29 18:07:40 test-lb01 heartbeat: [3155]: WARN: Managed
> /usr/lib/heartbeat/crmd process 3240 killed by signal 11 [SIGSEGV -
> Segmentation violation].
> May 29 18:07:40 test-lb01 heartbeat: [3155]: ERROR: Managed
> /usr/lib/heartbeat/crmd process 3240 dumped core
> May 29 18:07:40 test-lb01 heartbeat: [3155]: EMERG: Rebooting system. 
> Reason: /usr/lib/heartbeat/crmd
> 
> 
> 
> At the same time, I found that in the other node:
> 
> 
> May 29 18:07:41 test-lb02 crmd: [3226]: CRIT: lrm_connection_destroy:
> LRM Connection failed
> May 29 18:07:41 test-lb02 crmd: [3226]: info: lrm_connection_destroy:
> LRM Connection disconnected
> May 29 18:07:41 test-lb02 crmd: [3226]: ERROR: do_log: FSA: Input
> I_ERROR from lrm_connection_destroy() received in state S_ELECTION
> May 29 18:07:41 test-lb02 crmd: [3226]: notice: do_state_transition:
> State transition S_ELECTION -> S_RECOVERY [ input=I_ERROR
> cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
> May 29 18:07:41 test-lb02 crmd: [3226]: ERROR: do_recover: Action
> A_RECOVER (0000000001000000) not supported
> May 29 18:07:41 test-lb02 crmd: [3226]: info: do_dc_release: DC role
> released
> May 29 18:07:41 test-lb02 crmd: [3226]: info: do_te_control:
> Transitioner is now inactive
> May 29 18:07:41 test-lb02 crmd: [3226]: ERROR: do_log: FSA: Input
> I_TERMINATE from do_recover() received in state S_RECOVERY
> 
> 
> May 29 18:07:41 test-lb02 heartbeat: [3148]: WARN: Managed
> /usr/lib/heartbeat/crmd process 3226 exited with return code 2.
> May 29 18:07:41 test-lb02 heartbeat: [3148]: EMERG: Rebooting system. 
> Reason: /usr/lib/heartbeat/crmd
> 
> 
> 
> For some reason, pacemaker found something that it didn't like, and
> crashed, but I'm not understanding what.
> Could someone throw me some hints about that?
> 
> 
> Some technical specifications:
> 
> * Debian wheezy
> * Pacemaker: 1.1.7-1
> * Heartbeat: 1:3.0.5-3

That is an extremely old stack. Many bug-fixes have gone into pacemaker
since then. Also, heartbeat has long been deprecated in favour or corosync.

Can you upgrade and see if the issue persists?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?




More information about the Users mailing list