[ClusterLabs] Resources not starting some times after node reboot

Fri Oct 30 17:46:52 EDT 2015

On 10/29/2015 12:42 PM, Pritam Kharat wrote:
> Hi All,
> 
> I have single node with 5 resources running on it. When I rebooted node
> sometimes I saw resources in stopped state though node comes online.
> 
> When looked in to the logs, one difference found in success and failure
> case is, when
> *Election Trigger (I_DC_TIMEOUT) just popped (20000ms)  *occurred LRM did
> not start the resources instead jumped to monitor action and then onwards
> it did not start the resources at all.
> 
> But in success case this Election timeout did not come and first action
> taken by LRM was start the resource and then start monitoring it making all
> the resources started properly.
> 
> I have attached both the success and failure logs. Could some one please
> explain the reason for this issue  and how to solve this ?
> 
> 
> My CRM configuration is -
> 
> root at sc-node-2:~# crm configure show
> node $id="2" sc-node-2
> primitive oc-fw-agent upstart:oc-fw-agent \
> meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
> op monitor interval="15s" timeout="60s"
> primitive oc-lb-agent upstart:oc-lb-agent \
> meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
> op monitor interval="15s" timeout="60s"
> primitive oc-service-manager upstart:oc-service-manager \
> meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
> op monitor interval="15s" timeout="60s"
> primitive oc-vpn-agent upstart:oc-vpn-agent \
> meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
> op monitor interval="15s" timeout="60s"
> primitive sc_vip ocf:heartbeat:IPaddr2 \
> params ip="200.10.10.188" cidr_netmask="24" nic="eth1" \
> op monitor interval="15s"
> group sc-resources sc_vip oc-service-manager oc-fw-agent oc-lb-agent
> oc-vpn-agent
> property $id="cib-bootstrap-options" \
> dc-version="1.1.10-42f2063" \
> cluster-infrastructure="corosync" \
> stonith-enabled="false" \
> cluster-recheck-interval="3min" \
> default-action-timeout="180s"

The attached logs don't go far enough to be sure what happened; all they
show at that point is that in both cases, the cluster correctly probed
all the resources to be sure they weren't already running.

The behavior shouldn't be different depending on the election trigger,
but it's hard to say for sure from this info.

With a single-node cluster, you should also set no-quorum-policy=ignore.