[Pacemaker] pacemaker error after a couple week or month

Fri Dec 19 14:21:59 EST 2014

----- Original Message -----
> Hello,
> 
> I have 2 active-passive fail over system with corosync and drbd.
> One system using 2 debian server and the other using 2 ubuntu server.
> The debian servers are for web server fail over and the ubuntu servers are
> for database server fail over.
> 
> I applied the same configuration in the pacemaker. Everything works fine,
> fail over can be done nicely and also the file system synchronization, but
> in the ubuntu server, it was always has error after a couple week or month.
> The pacemaker in ubuntu1 had different status with ubuntu2, ubuntu1 assumed
> that ubuntu2 was down and ubuntu2 assumed that something happened with
> ubuntu1 but still alive and took over the resources. It made the drbd
> resource cannot be taken over, thus no fail over happened and we must
> manually restart the server because restarting pacemaker and corosync didn't
> help. I have changed the configuration of pacemaker a couple time, but the
> problem still exist.
> 
> has anyone experienced it? I use Ubuntu 14.04.1 LTS.
> 
> I got this error in apport.log
> 
> ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: executable:
> /usr/lib/pacemaker/lrmd (command line "/usr/lib/pacemaker/lrmd")

wow, it looks like the lrmd is crashing on you. I haven't seen this occur
in the wild before. Without a backtrace it will be nearly impossible to determine
what is happening.

Do you have the ability to upgrade pacemaker to a newer version?

-- Vossel

> ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: is_closing_session(): no
> DBUS_SESSION_BUS_ADDRESS in environment
> ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: wrote report
> /var/crash/_usr_lib_pacemaker_lrmd.0.crash
> 
> my pacemaker configuration:
> 
> node $id="1" db \
> attributes standby="off"
> node $id="2" db2 \
> attributes standby="off"
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
> params ip="192.168.0.100" cidr_netmask="24" \
> op monitor interval="30s"
> primitive DBase ocf:heartbeat:mysql \
> meta target-role="Started" \
> op start timeout="120s" interval="0" \
> op stop timeout="120s" interval="0" \
> op monitor interval="20s" timeout="30s"
> primitive DbFS ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/sync" fstype="ext4" \
> op start timeout="60s" interval="0" \
> op stop timeout="180s" interval="0" \
> op monitor interval="60s" timeout="60s"
> primitive Links lsb:drbdlinks
> primitive r0 ocf:linbit:drbd \
> params drbd_resource="r0" \
> op monitor interval="29s" role="Master" \
> op start timeout="240s" interval="0" \
> op stop timeout="180s" interval="0" \
> op promote timeout="180s" interval="0" \
> op demote timeout="180s" interval="0" \
> op monitor interval="30s" role="Slave"
> group DbServer ClusterIP DbFS Links DBase
> ms ms_r0 r0 \
> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
> notify="true" target-role="Master"
> location prefer-db DbServer 50: db
> colocation DbServer-with-ms_ro inf: DbServer ms_r0:Master
> order DbServer-after-ms_ro inf: ms_r0:promote DbServer:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.10-42f2063" \
> cluster-infrastructure="corosync" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1363370585"
> 
> my corosync config:
> 
> totem {
> version: 2
> token: 3000
> token_retransmits_before_loss_const: 10
> join: 60
> consensus: 3600
> vsftype: none
> max_messages: 20
> clear_node_high_bit: yes
> secauth: off
> threads: 0
> rrp_mode: none
> transport: udpu
> cluster_name: Dbcluster
> }
> 
> nodelist {
> node {
> ring0_addr: db
> nodeid: 1
> }
> node {
> ring0_addr: db2
> nodeid: 2
> }
> }
> 
> quorum {
> provider: corosync_votequorum
> }
> 
> amf {
> mode: disabled
> }
> 
> service {
> ver: 0
> name: pacemaker
> }
> 
> aisexec {
> user: root
> group: root
> }
> 
> logging {
> fileline: off
> to_stderr: yes
> to_logfile: yes
> logfile: /var/log/corosync/corosync.log
> to_syslog: no
> syslog_facility: daemon
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> tags: enter|leave|trace1|trace2|trace3|trace4|trace6
> }
> }
> 
> my drbd.conf:
> 
> global {
> usage-count no;
> }
> 
> common {
> protocol C;
> 
> handlers {
> pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
> pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
> local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
> halt -f";
> }
> 
> startup {
> degr-wfc-timeout 120;
> }
> 
> disk {
> on-io-error detach;
> }
> 
> syncer {
> rate 100M;
> al-extents 257;
> }
> }
> 
> resource r0 {
> protocol C;
> flexible-meta-disk internal;
> 
> on db2 {
> address 192.168.0.10:7801 ;
> device /dev/drbd0 minor 0;
> disk /dev/sdb1;
> }
> on db {
> device /dev/drbd0 minor 0;
> disk /dev/db/sync;
> address 192.168.0.20:7801 ;
> }
> handlers {
> split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> }
> net {
> after-sb-0pri discard-younger-primary; #discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri call-pri-lost-after-sb;
> }
> }
> 
> I have no idea, how to solve this problem. Maybe someone can help me.
> 
> best regards,
> 
> ariee
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>