[Pacemaker] Pacemaker crash on node unstandby/standby.

Fri Oct 18 21:45:10 EDT 2013

Hi,

I am having an issue with Pacemaker on the cman stack where I can reliably
cause pacemaker to crash and coredump when I put the node in standby or
unstandby. Here is my messages log from when I do a unstandby like so "pcs
cluster unstandby `uname -n`":

Oct 18 18:33:11 localhost crmd[19258]:   notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
[7/3857]
Oct 18 18:33:11 localhost cib[19253]:   notice: cib:diff: Diff: --- 0.45.4
Oct 18 18:33:11 localhost cib[19253]:   notice: cib:diff: Diff: +++ 0.46.1
01d8e730566358dcb36a7bc5b9ac587c
Oct 18 18:33:11 localhost cib[19253]:   notice: cib:diff: --
<nvpair id="nodes-node01-standby" name="standby" value="on" />
Oct 18 18:33:11 localhost cib[19253]:   notice: cib:diff: ++ <cib
epoch="46" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2"
crm_feature_set="3.0.7" update-origin="node01"
update-client="crm_attribute" cib-last-
written="Fri Oct 18 18:32:57 2013" have-quorum="0" dc-uuid="node01" />
Oct 18 18:33:11 localhost pengine[19257]:   notice: unpack_config: On loss
of CCM Quorum: Ignore
Oct 18 18:33:11 localhost pengine[19257]:   notice: LogActions: Start
reader_vip_2#011(node01)
Oct 18 18:33:11 localhost pengine[19257]:   notice: LogActions: Start
writer_vip#011(node01)
Oct 18 18:33:11 localhost pengine[19257]:   notice: LogActions: Start
reader_vip_1#011(node01)
Oct 18 18:33:11 localhost cib[19253]:    error: cib_ipc_closed: Client
not found in the hashtable
Oct 18 18:33:11 localhost cib[19253]:    error: qb_ipcs_connection_unref:
ref:0 state:3 (19253-20281-21)
Oct 18 18:33:11 localhost pengine[19257]:   notice: process_pe_message:
Calculated Transition 4: /var/lib/pacemaker/pengine/pe-input-77.bz2
Oct 18 18:33:12 localhost lrmd[19255]:   notice: operation_finished:
reader_vip_2_start_0:20284 [ 2013/10/18_18:33:12 INFO: ip -f inet addr add
10.10.10.9/32 brd 10.10.10.9 dev eth1 ]
Oct 18 18:33:12 localhost lrmd[19255]:   notice: operation_finished:
reader_vip_2_start_0:20284 [ 2013/10/18_18:33:12 INFO: ip link set eth1 up ]
Oct 18 18:33:12 localhost lrmd[19255]:   notice: operation_finished:
reader_vip_2_start_0:20284 [ 2013/10/18_18:33:12 INFO:
/usr/lib64/heartbeat/send_arp -i 200 -r 5 -p
/var/run/heartbeat/rsctmp/send_arp-10.10.10.9 eth1 10.10.10.9
 auto not_used not_used ]
Oct 18 18:33:13 localhost abrtd: Directory 'ccpp-2013-10-18-18:33:11-19253'
creation detected
Oct 18 18:33:13 localhost lrmd[19255]:   notice: operation_finished:
reader_vip_1_start_0:20286 [ 2013/10/18_18:33:12 INFO: ip -f inet addr add
10.10.10.8/32 brd 10.10.10.8 dev eth1 ]
Oct 18 18:33:13 localhost lrmd[19255]:   notice: operation_finished:
reader_vip_1_start_0:20286 [ 2013/10/18_18:33:12 INFO: ip link set eth1 up ]
Oct 18 18:33:13 localhost lrmd[19255]:   notice: operation_finished:
reader_vip_1_start_0:20286 [ 2013/10/18_18:33:12 INFO:
/usr/lib64/heartbeat/send_arp -i 200 -r 5 -p
/var/run/heartbeat/rsctmp/send_arp-10.10.10.8 eth1 10.10.10.8
 auto not_used not_used ]
Oct 18 18:33:13 localhost abrt[20283]: Saved core dump of pid 19253
(/usr/libexec/pacemaker/cib) to
/var/spool/abrt/ccpp-2013-10-18-18:33:11-19253 (10153984 bytes)
Oct 18 18:33:13 localhost rsyslogd-2177: imuxsock lost 75 messages from pid
18983 due to rate-limiting
Oct 18 18:33:13 localhost pacemakerd[19247]:   notice: pcmk_child_exit:
Child process cib terminated with signal 6 (pid=19253, core=128)
Oct 18 18:33:13 localhost pacemakerd[19247]:   notice: pcmk_process_exit:
Respawning failed child process: cib
Oct 18 18:33:13 localhost cib[20433]:   notice: crm_add_logfile: Additional
logging available in /var/log/cluster/corosync.log
Oct 18 18:33:13 localhost lrmd[19255]:   notice: operation_finished:
writer_vip_start_0:20285 [ 2013/10/18_18:33:12 INFO: ip -f inet addr add
10.10.10.10/32 brd 10.10.10.10 dev eth1 ]
Oct 18 18:33:13 localhost lrmd[19255]:   notice: operation_finished:
writer_vip_start_0:20285 [ 2013/10/18_18:33:12 INFO: ip link set eth1 up ]
Oct 18 18:33:13 localhost lrmd[19255]:   notice: operation_finished:
writer_vip_start_0:20285 [ 2013/10/18_18:33:13 INFO:
/usr/lib64/heartbeat/send_arp -i 200 -r 5 -p
/var/run/heartbeat/rsctmp/send_arp-10.10.10.10 eth1 10.10.10.10
 auto not_used not_used ]
Oct 18 18:33:13 localhost cib[20433]:   notice: crm_cluster_connect:
Connecting to cluster infrastructure: cman
Oct 18 18:33:13 localhost crmd[19258]:   notice: crm_ipc_send: Connection
to cib_shm closed
Oct 18 18:33:13 localhost crmd[19258]:    error:
cib_native_perform_op_delegate: Couldn't perform cib_modify operation
(timeout=120s): -107: Transport endpoint is not connected (107)
Oct 18 18:33:13 localhost crmd[19258]:    error:
cib_native_perform_op_delegate: CIB disconnected
Oct 18 18:33:13 localhost crmd[19258]:  warning: cib_rsc_callback: Resource
update -70 failed: (rc=-70) Communication error on send
Oct 18 18:33:13 localhost crmd[19258]:   notice: process_lrm_event: LRM
operation reader_vip_2_start_0 (call=60, rc=0, cib-update=-70,
confirmed=true) ok
Oct 18 18:33:13 localhost crmd[19258]:  warning: cib_rsc_callback: Resource
update -107 failed: (rc=-107) Transport endpoint is not connected
Oct 18 18:33:13 localhost crmd[19258]:   notice: process_lrm_event: LRM
operation reader_vip_1_start_0 (call=64, rc=0, cib-update=-107,
confirmed=true) ok
Oct 18 18:33:13 localhost crmd[19258]:  warning: cib_rsc_callback: Resource
update -107 failed: (rc=-107) Transport endpoint is not connected
Oct 18 18:33:13 localhost crmd[19258]:   notice: process_lrm_event: LRM
operation writer_vip_start_0 (call=62, rc=0, cib-update=-107,
confirmed=true) ok
Oct 18 18:33:13 localhost crmd[19258]:    error: crm_ipc_read: Connection
to cib_shm failed
Oct 18 18:33:13 localhost crmd[19258]:    error: mainloop_gio_callback:
Connection to cib_shm[0x26c7080] closed (I/O condition=17)
Oct 18 18:33:13 localhost crmd[19258]:    error:
crmd_cib_connection_destroy: Connection to the CIB terminated...
Oct 18 18:33:13 localhost crmd[19258]:    error: do_log: FSA: Input I_ERROR
from crmd_cib_connection_destroy() received in state S_TRANSITION_ENGINE
Oct 18 18:33:13 localhost crmd[19258]:  warning: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_RECOVERY [ input=I_ERROR
cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ]
Oct 18 18:33:13 localhost crmd[19258]:    error: do_recover: Action
A_RECOVER (0000000001000000) not supported
Oct 18 18:33:13 localhost crmd[19258]:  warning: do_election_vote: Not
voting in election, we're in state S_RECOVERY
Oct 18 18:33:13 localhost crmd[19258]:  warning: destroy_action: Cancelling
timer for action 4 (src=71)
Oct 18 18:33:13 localhost crmd[19258]:  warning: destroy_action: Cancelling
timer for action 6 (src=72)
Oct 18 18:33:13 localhost crmd[19258]:  warning: destroy_action: Cancelling
timer for action 8 (src=73)
Oct 18 18:33:13 localhost crmd[19258]:    error: do_log: FSA: Input
I_TERMINATE from do_recover() received in state S_RECOVERY
Oct 18 18:33:13 localhost crmd[19258]:    error: verify_stopped: Resource
reader_vip_1 was active at shutdown.  You may ignore this error if it is
unmanaged.
Oct 18 18:33:13 localhost crmd[19258]:    error: verify_stopped: Resource
reader_vip_2 was active at shutdown.  You may ignore this error if it is
unmanaged.
Oct 18 18:33:13 localhost crmd[19258]:    error: verify_stopped: Resource
writer_vip was active at shutdown.  You may ignore this error if it is
unmanaged.
Oct 18 18:33:13 localhost crmd[19258]:   notice: terminate_cs_connection:
Disconnecting from Corosync
Oct 18 18:33:13 localhost crmd[19258]:    error: do_exit: Could not recover
from internal error
Oct 18 18:33:13 localhost pacemakerd[19247]:    error:
qb_ipcs_connection_unref: ref:0 state:3 (19247-19258-19)
Oct 18 18:33:14 localhost abrtd: Directory 'ccpp-2013-10-18-18:33:14-19247'
creation detected
Oct 18 18:33:14 localhost abrt[20454]: Saved core dump of pid 19247
(/usr/sbin/pacemakerd) to /var/spool/abrt/ccpp-2013-10-18-18:33:14-19247
(9646080 bytes)
Oct 18 18:33:24 localhost abrtd: Sending an email...
Oct 18 18:33:24 localhost abrtd: Email was sent to: root at localhost
Oct 18 18:33:25 localhost abrtd: New problem directory
/var/spool/abrt/ccpp-2013-10-18-18:33:11-19253, processing

I get "[abrt] full crash report" emails as well.

some basic info from the emails

Subject: [abrt] full crash report
User-Agent: Heirloom mailx 12.4 7/29/08
Content-Type: text/plain; charset=us-ascii
Status: RO

abrt_version:   2.0.8
cgroup:
cmdline:        /usr/libexec/pacemaker/cib
executable:     /usr/libexec/pacemaker/cib
kernel:         2.6.32-358.18.1.el6.x86_64
pid:            19253
pwd:            /var/lib/pacemaker/cores
time:           Fri 18 Oct 2013 06:33:11 PM PDT
uid:            495
username:       hacluster

sosreport.tar.xz: Binary file, 401160 bytes

and

Subject: [abrt] full crash report
User-Agent: Heirloom mailx 12.4 7/29/08
Content-Type: text/plain; charset=us-ascii
Status: R

abrt_version:   2.0.8
cgroup:
cmdline:        pacemakerd
executable:     /usr/sbin/pacemakerd
kernel:         2.6.32-358.18.1.el6.x86_64
pid:            19247
pwd:            /var/lib/pacemaker/cores/root
time:           Fri 18 Oct 2013 06:33:14 PM PDT
uid:            0
username:       root

sosreport.tar.xz: Binary file, 401964 bytes

This is Centos 6.4 running the following versions:

corosync-1.4.1-15.el6_4.1.x86_64
pacemaker-libs-1.1.8-7.el6.x86_64
corosynclib-1.4.1-15.el6_4.1.x86_64
pacemaker-cli-1.1.8-7.el6.x86_64
pacemaker-cluster-libs-1.1.8-7.el6.x86_64
pacemaker-1.1.8-7.el6.x86_64

Here is my corosync.conf file:

compatibility: whitetank

totem {
  version: 2

  rrp_mode: none
  netmtu: 1500
  secuath: on
  threads: 0
  transport: udp
  token: 1000
  token_retransmit: 238
  hold: 100
  join: 50
  consensus: 1200
  merge: 200
  downcheck: 1000
  fail_recv_const: 2500
  seqno_unchanged_const: 30
  heartbeat_failures_allowed: 0
  max_network_delay: 50
  window_size: 50
  max_messages: 17
  miss_count_const: 5
  rrp_problem_count_timeout: 2000
  rrp_problem_count_threshold: 10
  rrp_problem_count_mcast_threshold:
  rrp_token_expired_timeout: 47
  rrp_autorecovery_check_timeout: 1000
  interface: {
    ringnumber: 0
    bindnetaddr: 10.10.10.0
    broadcast: no
    mcastaddr: 239.0.0.111
    mcastport: 4000
    ttl: 1

  }
}

logging {
  timestamp: off
  fileline: off
  to_stderr: no
  to_logfile: yes
  to_syslog: yes

  logfile: /var/log/corosync.log

  logfile_priority: info
  syslog_facility: daemon
  syslog_priority: info
  debug: off

}

amf {
  mode: disabled
}

and my cluster.conf

<?xml version="1.0"?>
<cluster config_version="1" name="nodes">
  <logging debug="on"/>
  <clusternodes>
    <clusternode name="node01" nodeid="1"/>
    <clusternode name="node02" nodeid="2"/>
  </clusternodes>
</cluster>

If I'm doing something obviously wrong please let me know. If you would
like the full contents of the coredump and the abrt emails, I can also
provide those.

Thanks for your time,

Justin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131018/5428b017/attachment-0002.html>