[ClusterLabs] Singleton resource not being migrated
Ken Gaillot
kgaillot at redhat.com
Tue Aug 16 20:29:42 UTC 2016
On 08/05/2016 05:12 PM, Nikita Koshikov wrote:
> Thanks, Ken,
>
> On Fri, Aug 5, 2016 at 7:21 AM, Ken Gaillot <kgaillot at redhat.com
> <mailto:kgaillot at redhat.com>> wrote:
>
> On 08/05/2016 03:48 AM, Andreas Kurz wrote:
> > Hi,
> >
> > On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov <koshikov at gmail.com <mailto:koshikov at gmail.com>
> > <mailto:koshikov at gmail.com <mailto:koshikov at gmail.com>>> wrote:
> >
> > Hello list,
> >
> > Can you, please, help me in debugging 1 resource not being started
> > after node failover ?
> >
> > Here is configuration that I'm testing:
> > 3 nodes(kvm VM) cluster, that have:
> >
> > node 10: aic-controller-58055.test.domain.local
> > node 6: aic-controller-50186.test.domain.local
> > node 9: aic-controller-12993.test.domain.local
> > primitive cmha cmha \
> > params conffile="/etc/cmha/cmha.conf"
> > daemon="/usr/bin/cmhad" pidfile="/var/run/cmha/cmha.pid"
> user=cmha \
> > meta failure-timeout=30 resource-stickiness=1
> > target-role=Started migration-threshold=3 \
> > op monitor interval=10 on-fail=restart timeout=20 \
> > op start interval=0 on-fail=restart timeout=60 \
> > op stop interval=0 on-fail=block timeout=90
> >
> >
> > What is the output of crm_mon -1frA once a node is down ... any failed
> > actions?
> >
> >
> > primitive sysinfo_aic-controller-12993.test.domain.local
> > ocf:pacemaker:SysInfo \
> > params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> > op monitor interval=15s
> > primitive sysinfo_aic-controller-50186.test.domain.local
> > ocf:pacemaker:SysInfo \
> > params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> > op monitor interval=15s
> > primitive sysinfo_aic-controller-58055.test.domain.local
> > ocf:pacemaker:SysInfo \
> > params disk_unit=M disks="/ /var/log" min_disk_free=512M \
> > op monitor interval=15s
> >
> >
> > You can use a clone for this sysinfo resource and a symmetric cluster
> > for a more compact configuration .... then you can skip all these
> > location constraints.
> >
> >
> > location cmha-on-aic-controller-12993.test.domain.local cmha 100:
> > aic-controller-12993.test.domain.local
> > location cmha-on-aic-controller-50186.test.domain.local cmha 100:
> > aic-controller-50186.test.domain.local
> > location cmha-on-aic-controller-58055.test.domain.local cmha 100:
> > aic-controller-58055.test.domain.local
> > location sysinfo-on-aic-controller-12993.test.domain.local
> > sysinfo_aic-controller-12993.test.domain.local inf:
> > aic-controller-12993.test.domain.local
> > location sysinfo-on-aic-controller-50186.test.domain.local
> > sysinfo_aic-controller-50186.test.domain.local inf:
> > aic-controller-50186.test.domain.local
> > location sysinfo-on-aic-controller-58055.test.domain.local
> > sysinfo_aic-controller-58055.test.domain.local inf:
> > aic-controller-58055.test.domain.local
> > property cib-bootstrap-options: \
> > have-watchdog=false \
> > dc-version=1.1.14-70404b0 \
> > cluster-infrastructure=corosync \
> > cluster-recheck-interval=15s \
> >
> >
> > Never tried such a low cluster-recheck-interval ... wouldn't do
> that. I
> > saw setups with low intervals burning a lot of cpu cycles in bigger
> > cluster setups and side-effects from aborted transitions. If you
> do this
> > for "cleanup" the cluster state because you see resource-agent errors
> > you should better fix the resource agent.
>
> Strongly agree -- your recheck interval is lower than the various action
> timeouts. The only reason recheck interval should ever be set less than
> about 5 minutes is if you have time-based rules that you want to trigger
> with a finer granularity.
>
> Your issue does not appear to be coming from recheck interval, otherwise
> it would go away after the recheck interval passed.
>
>
> As of small cluster-recheck-interval - this was only for testing.
>
>
> > Regards,
> > Andreas
> >
> >
> > no-quorum-policy=stop \
> > stonith-enabled=false \
> > start-failure-is-fatal=false \
> > symmetric-cluster=false \
> > node-health-strategy=migrate-on-red \
> > last-lrm-refresh=1470334410
> >
> > When 3 nodes online, everything seemed OK, this is output of
> > scoreshow.sh:
> > Resource Score
> > Node Stickiness #Fail
> > Migration-Threshold
> > cmha -INFINITY
> > aic-controller-12993.test.domain.local 1 0
> > cmha
> > 101 aic-controller-50186.test.domain.local 1 0
> > cmha -INFINITY
>
> Everything is not OK; cmha has -INFINITY scores on two nodes, meaning it
> won't be allowed to run on them. This is why it won't start after the
> one allowed node goes down, and why cleanup gets it working again
> (cleanup removes bans caused by resource failures).
>
> It's likely the resource previously failed the maximum allowed times
> (migration-threshold=3) on those two nodes.
>
> The next step would be to figure out why the resource is failing. The
> pacemaker logs will show any output from the resource agent.
>
>
> Resource was never started on these nodes. Maybe problem is in flow ? We
> deploy:
>
> 1) 1 node with all 3 IPs in corosync.conf
> 2) set no-quorum policy = ignore
> 3) add 2 nodes to corosync cluster
> 4) create resource + 1 location constrain
> 5) add 2 additional constrains
> 6) set no-quorum policy = stop
>
> The time between 4-5 is about 1 minute. And it's clear why 2 nodes were
> -INFINITY in this period. But why when we add 2 more constrains - they
> are not updating scores and cam this be changed ?
The resource may have never started on those nodes, but are you sure a
start wasn't attempted and failed? If the start failed, the -INFINITY
score would come from the failure, rather than only the cluster being
asymmetric.
>
>
>
> > aic-controller-58055.test.domain.local 1 0
> > sysinfo_aic-controller-12993.test.domain.local INFINITY
> > aic-controller-12993.test.domain.local 0 0
> > sysinfo_aic-controller-50186.test.domain.local -INFINITY
> > aic-controller-50186.test.domain.local 0 0
> > sysinfo_aic-controller-58055.test.domain.local INFINITY
> > aic-controller-58055.test.domain.local 0 0
> >
> > The problem starts when 1 node, goes offline
> (aic-controller-50186).
> > The resource cmha is stocked in stopped state.
> > Here is the showscores:
> > Resource Score
> > Node Stickiness #Fail
> > Migration-Threshold
> > cmha -INFINITY
> > aic-controller-12993.test.domain.local 1 0
> > cmha -INFINITY
> > aic-controller-50186.test.domain.local 1 0
> > cmha -INFINITY
> > aic-controller-58055.test.domain.local 1 0
> >
> > Even it has target-role=Started pacemaker skipping this resource.
> > And in logs I see:
> > pengine: info: native_print: cmha
> (ocf::heartbeat:cmha):
> > Stopped
> > pengine: info: native_color: Resource cmha cannot run
> anywhere
> > pengine: info: LogActions: Leave cmha (Stopped)
> >
> > To recover cmha resource I need to run either:
> > 1) crm resource cleanup cmha
> > 2) crm resource reprobe
> >
> > After any of the above commands, resource began to be picked up be
> > pacemaker and I see valid scores:
> > Resource Score
> > Node Stickiness #Fail
> > Migration-Threshold
> > cmha 100
> > aic-controller-58055.test.domain.local 1 0 3
> > cmha 101
> > aic-controller-12993.test.domain.local 1 0 3
> > cmha -INFINITY
> > aic-controller-50186.test.domain.local 1 0 3
> >
> > So the questions here - why cluster-recheck doesn't work, and
> should
> > it do reprobing ?
> > How to make migration work or what I missed in configuration that
> > prevents migration?
> >
> > corosync 2.3.4
> > pacemaker 1.1.14
More information about the Users
mailing list