<div dir="ltr">Thanks, Ken,<div><br></div><div class="gmail_extra"><div class="gmail_quote">On Fri, Aug 5, 2016 at 7:21 AM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="gmail-">On 08/05/2016 03:48 AM, Andreas Kurz wrote:<br>

> Hi,<br>

><br>

> On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov <<a href="mailto:koshikov@gmail.com">koshikov@gmail.com</a><br>

</span><div><div class="gmail-h5">> <mailto:<a href="mailto:koshikov@gmail.com">koshikov@gmail.com</a>>> wrote:<br>

><br>

>     Hello list,<br>

><br>

>     Can you, please, help me in debugging 1 resource not being started<br>

>     after node failover ?<br>

><br>

>     Here is configuration that I'm testing:<br>

>     3 nodes(kvm VM) cluster, that have:<br>

><br>

>     node 10: aic-controller-58055.test.<wbr>domain.local<br>

>     node 6: aic-controller-50186.test.<wbr>domain.local<br>

>     node 9: aic-controller-12993.test.<wbr>domain.local<br>

>     primitive cmha cmha \<br>

>             params conffile="/etc/cmha/cmha.conf"<br>

>     daemon="/usr/bin/cmhad" pidfile="/var/run/cmha/cmha.<wbr>pid" user=cmha \<br>

>             meta failure-timeout=30 resource-stickiness=1<br>

>     target-role=Started migration-threshold=3 \<br>

>             op monitor interval=10 on-fail=restart timeout=20 \<br>

>             op start interval=0 on-fail=restart timeout=60 \<br>

>             op stop interval=0 on-fail=block timeout=90<br>

><br>

><br>

> What is the output of crm_mon -1frA once a node is down ... any failed<br>

> actions?<br>

><br>

><br>

>     primitive sysinfo_aic-controller-12993.<wbr>test.domain.local<br>

>     ocf:pacemaker:SysInfo \<br>

>             params disk_unit=M disks="/ /var/log" min_disk_free=512M \<br>

>             op monitor interval=15s<br>

>     primitive sysinfo_aic-controller-50186.<wbr>test.domain.local<br>

>     ocf:pacemaker:SysInfo \<br>

>             params disk_unit=M disks="/ /var/log" min_disk_free=512M \<br>

>             op monitor interval=15s<br>

>     primitive sysinfo_aic-controller-58055.<wbr>test.domain.local<br>

>     ocf:pacemaker:SysInfo \<br>

>             params disk_unit=M disks="/ /var/log" min_disk_free=512M \<br>

>             op monitor interval=15s<br>

><br>

><br>

> You can use a clone for this sysinfo resource and a symmetric cluster<br>

> for a more compact configuration .... then you can skip all these<br>

> location constraints.<br>

><br>

><br>

>     location cmha-on-aic-controller-12993.<wbr>test.domain.local cmha 100:<br>

>     aic-controller-12993.test.<wbr>domain.local<br>

>     location cmha-on-aic-controller-50186.<wbr>test.domain.local cmha 100:<br>

>     aic-controller-50186.test.<wbr>domain.local<br>

>     location cmha-on-aic-controller-58055.<wbr>test.domain.local cmha 100:<br>

>     aic-controller-58055.test.<wbr>domain.local<br>

>     location sysinfo-on-aic-controller-<wbr>12993.test.domain.local<br>

>     sysinfo_aic-controller-12993.<wbr>test.domain.local inf:<br>

>     aic-controller-12993.test.<wbr>domain.local<br>

>     location sysinfo-on-aic-controller-<wbr>50186.test.domain.local<br>

>     sysinfo_aic-controller-50186.<wbr>test.domain.local inf:<br>

>     aic-controller-50186.test.<wbr>domain.local<br>

>     location sysinfo-on-aic-controller-<wbr>58055.test.domain.local<br>

>     sysinfo_aic-controller-58055.<wbr>test.domain.local inf:<br>

>     aic-controller-58055.test.<wbr>domain.local<br>

>     property cib-bootstrap-options: \<br>

>             have-watchdog=false \<br>

>             dc-version=1.1.14-70404b0 \<br>

>             cluster-infrastructure=<wbr>corosync \<br>

>             cluster-recheck-interval=15s \<br>

><br>

><br>

> Never tried such a low cluster-recheck-interval ... wouldn't do that. I<br>

> saw setups with low intervals burning a lot of cpu cycles in bigger<br>

> cluster setups and side-effects from aborted transitions. If you do this<br>

> for "cleanup" the cluster state because you see resource-agent errors<br>

> you should better fix the resource agent.<br>

<br>

</div></div>Strongly agree -- your recheck interval is lower than the various action<br>

timeouts. The only reason recheck interval should ever be set less than<br>

about 5 minutes is if you have time-based rules that you want to trigger<br>

with a finer granularity.<br>

<br>

Your issue does not appear to be coming from recheck interval, otherwise<br>

it would go away after the recheck interval passed.<br>

<span class="gmail-"><br></span></blockquote><div><br></div><div>As of small cluster-recheck-interval - this was only for testing.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="gmail-">

> Regards,<br>

> Andreas<br>

><br>

><br>

>             no-quorum-policy=stop \<br>

>             stonith-enabled=false \<br>

>             start-failure-is-fatal=false \<br>

>             symmetric-cluster=false \<br>

>             node-health-strategy=migrate-<wbr>on-red \<br>

>             last-lrm-refresh=1470334410<br>

><br>

>     When 3 nodes online, everything seemed OK, this is output of<br>

>     scoreshow.sh:<br>

>     Resource                                                Score<br>

>     Node                                   Stickiness #Fail<br>

>      Migration-Threshold<br>

>     cmha                                                    -INFINITY<br>

>     aic-controller-12993.test.<wbr>domain.local 1          0<br>

>     cmha<br>

>      101 aic-controller-50186.test.<wbr>domain.local 1          0<br>

>     cmha                                                    -INFINITY<br>

<br>

</span>Everything is not OK; cmha has -INFINITY scores on two nodes, meaning it<br>

won't be allowed to run on them. This is why it won't start after the<br>

one allowed node goes down, and why cleanup gets it working again<br>

(cleanup removes bans caused by resource failures).<br>

<br>

It's likely the resource previously failed the maximum allowed times<br>

(migration-threshold=3) on those two nodes.<br>

<br>

The next step would be to figure out why the resource is failing. The<br>

pacemaker logs will show any output from the resource agent.<br></blockquote><div><br></div><div>Resource was never started on these nodes. Maybe problem is in flow ? We deploy:</div><div><br></div><div>1) 1 node with all 3 IPs in corosync.conf</div><div>2) set no-quorum policy = ignore</div><div>3) add 2 nodes to corosync cluster</div><div>4) create resource + 1 location constrain</div><div>5) add 2 additional constrains</div><div>6) set no-quorum policy = stop</div><div><br></div><div>The time between 4-5 is about 1 minute. And it's clear why 2 nodes were -INFINITY in this period. But why when we add 2 more constrains - they are not updating scores and cam this be changed ?</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div class="gmail-HOEnZb"><div class="gmail-h5"><br>

>     aic-controller-58055.test.<wbr>domain.local 1          0<br>

>     sysinfo_aic-controller-12993.<wbr>test.domain.local          INFINITY<br>

>      aic-controller-12993.test.<wbr>domain.local 0          0<br>

>     sysinfo_aic-controller-50186.<wbr>test.domain.local          -INFINITY<br>

>     aic-controller-50186.test.<wbr>domain.local 0          0<br>

>     sysinfo_aic-controller-58055.<wbr>test.domain.local          INFINITY<br>

>      aic-controller-58055.test.<wbr>domain.local 0          0<br>

><br>

>     The problem starts when 1 node, goes offline (aic-controller-50186).<br>

>     The resource cmha is stocked in stopped state.<br>

>     Here is the showscores:<br>

>     Resource                                                Score<br>

>     Node                                   Stickiness #Fail<br>

>      Migration-Threshold<br>

>     cmha                                                    -INFINITY<br>

>     aic-controller-12993.test.<wbr>domain.local 1          0<br>

>     cmha                                                    -INFINITY<br>

>     aic-controller-50186.test.<wbr>domain.local 1          0<br>

>     cmha                                                    -INFINITY<br>

>     aic-controller-58055.test.<wbr>domain.local 1          0<br>

><br>

>     Even it has target-role=Started pacemaker skipping this resource.<br>

>     And in logs I see:<br>

>     pengine:     info: native_print:      cmha    (ocf::heartbeat:cmha):<br>

>      Stopped<br>

>     pengine:     info: native_color:      Resource cmha cannot run anywhere<br>

>     pengine:     info: LogActions:        Leave   cmha    (Stopped)<br>

><br>

>     To recover cmha resource I need to run either:<br>

>     1) crm resource cleanup cmha<br>

>     2) crm resource reprobe<br>

><br>

>     After any of the above commands, resource began to be picked up be<br>

>     pacemaker and I see valid scores:<br>

>     Resource                                                Score<br>

>     Node                                   Stickiness #Fail<br>

>      Migration-Threshold<br>

>     cmha                                                    100<br>

>     aic-controller-58055.test.<wbr>domain.local 1          0        3<br>

>     cmha                                                    101<br>

>     aic-controller-12993.test.<wbr>domain.local 1          0        3<br>

>     cmha                                                    -INFINITY<br>

>     aic-controller-50186.test.<wbr>domain.local 1          0        3<br>

><br>

>     So the questions here - why cluster-recheck doesn't work, and should<br>

>     it do reprobing ?<br>

>     How to make migration work or what I missed in configuration that<br>

>     prevents migration?<br>

><br>

>     corosync  2.3.4<br>

>     pacemaker 1.1.14<br>

<br>

______________________________<wbr>_________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>

<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/<wbr>mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</div></div></blockquote></div><br></div></div>