[Pacemaker] cibadmin set node offline?

Wed Aug 6 13:22:40 UTC 2014

Hi,

I have setup a 2 node cluster, using the following packages:

pacemaker                           1.1.10+git20130802-1ubuntu2
corosync                            2.3.3-1ubuntu1

My cluster config is as so:

node $id="12303" ldb03
node $id="12304" ldb04
primitive p_fence_ldb03 stonith:external/vcenter \
        params VI_SERVER="10.17.248.10"
VI_CREDSTORE="/root/.vmware/credstore/vicredentials.xml"
HOSTLIST="ldb03=ldb03" RESETPOWERON="0" pcmk_host_check="static-list"
pcmk_host_list="ldb03" \
        op start interval="0" timeout="500s"
primitive p_fence_ldb04 stonith:external/vcenter \
        params VI_SERVER="10.17.248.10"
VI_CREDSTORE="/root/.vmware/credstore/vicredentials.xml"
HOSTLIST="ldb04=ldb04" RESETPOWERON="0" pcmk_host_check="static-list"
pcmk_host_list="ldb04" \
        op start interval="0" timeout="500s"
primitive p_fs_mysql ocf:heartbeat:Filesystem \
        params device="nfsserver:/LDB_Cluster1" directory="/var/lib/mysql"
fstype="nfs"
options="relatime,rw,hard,nointr,rsize=32768,wsize=32768,bg,vers=3,proto=tcp"
\
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="60s" timeout="60s" \
        meta is-managed="true"
primitive p_ip_1 ocf:heartbeat:IPaddr2 \
        params ip="10.10.10.11" cidr_netmask="25" \
        op monitor interval="30s" \
        meta target-role="Started" is-managed="true"
primitive p_ip_2 ocf:heartbeat:IPaddr2 \
        params ip="10.10.10.12" cidr_netmask="25" \
        op monitor interval="30s" \
        meta target-role="Started" is-managed="true"
primitive p_ip_3 ocf:heartbeat:IPaddr2 \
        params ip="10.10.10.13" cidr_netmask="25" \
        op monitor interval="30s" \
        meta target-role="Started" is-managed="true"
primitive p_mysql ocf:heartbeat:mysql \
        params datadir="/var/lib/mysql" binary="/usr/bin/mysqld_safe"
socket="/var/run/mysqld/mysqld.sock" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="120" \
        op monitor interval="20" timeout="30" \
        meta target-role="Started" is-managed="true"
group g_mysql p_fs_mysql p_mysql p_ip_1 p_ip_2 p_ip_3 \
location l_fence_ldb03 p_fence_ldb03 -inf: ldb03
location l_fence_ldb04 p_fence_ldb04 -inf: ldb04
property $id="cib-bootstrap-options" \
        dc-version="1.1.10-42f2063" \
        cluster-infrastructure="corosync" \
        no-quorum-policy="ignore" \
        stonith-enabled="true" \
        stop-all-resources="false" \
        expected-quorum-votes="2" \
        last-lrm-refresh="1407325251"

This exact configuration has worked during the setup, but I have
encountered a problem with my inactive node ldb03. Corosync shows this node
as up:

root at ldb03:~# corosync-cmapctl | grep members
runtime.totem.pg.mrp.srp.members.12303.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.12303.ip (str) = r(0) ip(10.10.10.8)
runtime.totem.pg.mrp.srp.members.12303.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.12303.status (str) = joined
runtime.totem.pg.mrp.srp.members.12304.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.12304.ip (str) = r(0) ip(10.10.10.9)
runtime.totem.pg.mrp.srp.members.12304.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.12304.status (str) = joined

and crm status and crm node status show it as online:

Last updated: Wed Aug  6 14:16:24 2014
Last change: Wed Aug  6 14:02:00 2014 via crm_resource on ldb04
Stack: corosync
Current DC: ldb04 (12304) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
7 Resources configured
Online: [ ldb03 ldb04 ]

root at ldb03:~# crm node status
<nodes>
  <node id="12304" uname="ldb04"/>
  <node id="12303" uname="ldb03"/>
</nodes>

but....after seeing this entry in my logs:
Aug  6 13:26:23 ldb03 cibadmin[2140]:   notice: crm_log_args: Invoked:
cibadmin -M -c -o status --xml-text <node_state id="ldb03" uname="ldb03"
ha="active" in_ccm="false" crmd="offline" join="member" expected="down"
crm-debug-origin="manual_clear" shutdown="0"/>

I noticed that cibadmin shows it as normal(offline)
root at ldb03:~# crm node show
ldb04(12304): normal
ldb03(12303): normal(offline)

The offline state is not present in anything but cibadmin. Not the cib.xml,
not corosync-quorumtool and a tcpdump shows multicast traffic from both
hosts.

I tried (hesitantly) to delete the line using cibadmin, but I couldn't
quite get the syntax right. Any tips on how to get this node to show as
online and subsequently be able to run resources? Currently, when I run crm
resource move, this has no effect, no errors and nothing noticeable in the
logfiles either.

Sorry for long thread....I can attach more logs/config if necessary.

Thanks,

Jamie.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140806/e64891af/attachment-0003.html>