<div dir="ltr">Hi,<div class="gmail_extra"><br><div class="gmail_quote">On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov <span dir="ltr"><<a href="mailto:koshikov@gmail.com" target="_blank">koshikov@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><span style="font-size:13px">Hello list,</span><div style="font-size:13px"><br></div><div style="font-size:13px">Can you, please, help me in debugging 1 resource not being started after node failover ?</div><div style="font-size:13px"><br></div><div style="font-size:13px">Here is configuration that I'm testing:</div><div style="font-size:13px">3 nodes(kvm VM) cluster, that have:</div><div style="font-size:13px"><br></div><div style="font-size:13px"><div><div>node 10: aic-controller-58055.test.doma<wbr>in.local</div><div>node 6: aic-controller-50186.test.doma<wbr>in.local</div><div>node 9: aic-controller-12993.test.doma<wbr>in.local</div><div>primitive cmha cmha \</div><div>        params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad" pidfile="/var/run/cmha/cmha.pi<wbr>d" user=cmha \</div><div>        meta failure-timeout=30 resource-stickiness=1 target-role=Started migration-threshold=3 \</div><div>        op monitor interval=10 on-fail=restart timeout=20 \</div><div>        op start interval=0 on-fail=restart timeout=60 \</div><div>        op stop interval=0 on-fail=block timeout=90</div></div></div></div></blockquote><div><br></div><div>What is the output of crm_mon -1frA once a node is down ... any failed actions?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div style="font-size:13px"><div><div>primitive sysinfo_aic-controller-12993.t<wbr>est.domain.local ocf:pacemaker:SysInfo \</div><div>        params disk_unit=M disks="/ /var/log" min_disk_free=512M \</div><div>        op monitor interval=15s</div><div>primitive sysinfo_aic-controller-50186.t<wbr>est.domain.local ocf:pacemaker:SysInfo \</div><div>        params disk_unit=M disks="/ /var/log" min_disk_free=512M \</div><div>        op monitor interval=15s</div><div>primitive sysinfo_aic-controller-58055.t<wbr>est.domain.local ocf:pacemaker:SysInfo \</div><div>        params disk_unit=M disks="/ /var/log" min_disk_free=512M \</div><div>        op monitor interval=15s</div></div></div></div></blockquote><div><br></div><div>You can use a clone for this sysinfo resource and a symmetric cluster for a more compact configuration .... then you can skip all these location constraints.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div style="font-size:13px"><div><br></div><div>location cmha-on-aic-controller-12993.t<wbr>est.domain.local cmha 100: aic-controller-12993.test.doma<wbr>in.local</div><div>location cmha-on-aic-controller-50186.t<wbr>est.domain.local cmha 100: aic-controller-50186.test.doma<wbr>in.local</div><div>location cmha-on-aic-controller-58055.t<wbr>est.domain.local cmha 100: aic-controller-58055.test.doma<wbr>in.local</div><div>location sysinfo-on-aic-controller-1299<wbr>3.test.domain.local sysinfo_aic-controller-12993.t<wbr>est.domain.local inf: aic-controller-12993.test.doma<wbr>in.local</div><div>location sysinfo-on-aic-controller-5018<wbr>6.test.domain.local sysinfo_aic-controller-50186.t<wbr>est.domain.local inf: aic-controller-50186.test.doma<wbr>in.local</div><div>location sysinfo-on-aic-controller-5805<wbr>5.test.domain.local sysinfo_aic-controller-58055.t<wbr>est.domain.local inf: aic-controller-58055.test.doma<wbr>in.local</div><div>property cib-bootstrap-options: \</div><div>        have-watchdog=false \</div><div>        dc-version=1.1.14-70404b0 \</div><div>        cluster-infrastructure=corosyn<wbr>c \</div><div>        cluster-recheck-interval=15s \</div></div></div></blockquote><div><br></div><div>Never tried such a low cluster-recheck-interval ... wouldn't do that. I saw setups with low intervals burning a lot of cpu cycles in bigger cluster setups and side-effects from aborted transitions. If you do this for "cleanup" the cluster state because you see resource-agent errors you should better fix the resource agent.</div><div><br></div><div>Regards,</div><div>Andreas</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div style="font-size:13px"><div>        no-quorum-policy=stop \</div><div>        stonith-enabled=false \</div><div>        start-failure-is-fatal=false \</div><div>        symmetric-cluster=false \</div><div>        node-health-strategy=migrate-o<wbr>n-red \</div><div>        last-lrm-refresh=1470334410</div></div><div style="font-size:13px"><br></div><div style="font-size:13px">When 3 nodes online, everything seemed OK, this is output of scoreshow.sh:</div><div style="font-size:13px"><div>Resource                                                Score     Node                                   Stickiness #Fail    Migration-Threshold</div><div>cmha                                                    -INFINITY aic-controller-12993.test.doma<wbr>in.local 1          0</div><div>cmha                                                              101 aic-controller-50186.test.doma<wbr>in.local 1          0</div><div>cmha                                                    -INFINITY aic-controller-58055.test.doma<wbr>in.local 1          0</div></div><div style="font-size:13px"><div>sysinfo_aic-controller-12993.t<wbr>est.domain.local          INFINITY  aic-controller-12993.test.dom<wbr>ain.local 0          0</div><div>sysinfo_aic-controller-50186.t<wbr>est.domain.local          -INFINITY aic-controller-50186.test.doma<wbr>in.local 0          0</div><div>sysinfo_aic-controller-58055.t<wbr>est.domain.local          INFINITY  aic-controller-58055.test.dom<wbr>ain.local 0          0</div></div><div style="font-size:13px"><br></div><div style="font-size:13px">The problem starts when 1 node, goes offline (aic-controller-50186). The resource cmha is stocked in stopped state.</div><div style="font-size:13px">Here is the showscores:</div><div style="font-size:13px"><div>Resource                                                Score     Node                                   Stickiness #Fail    Migration-Threshold</div><div>cmha                                                    -INFINITY aic-controller-12993.test.doma<wbr>in.local 1          0</div><div>cmha                                                    -INFINITY aic-controller-50186.test.doma<wbr>in.local 1          0</div><div>cmha                                                    -INFINITY aic-controller-58055.test.doma<wbr>in.local 1          0</div></div><div style="font-size:13px"><br></div><div style="font-size:13px">Even it has target-role=Started pacemaker skipping this resource. And in logs I see:</div><div style="font-size:13px">pengine:     info: native_print:      cmha    (ocf::heartbeat:cmha):  Stopped<br></div><div style="font-size:13px">pengine:     info: native_color:      Resource cmha cannot run anywhere<br></div><div style="font-size:13px">pengine:     info: LogActions:        Leave   cmha    (Stopped)<br></div><div style="font-size:13px"><br></div><div style="font-size:13px">To recover cmha resource I need to run either:</div><div style="font-size:13px">1) crm resource cleanup cmha</div><div style="font-size:13px">2) crm resource reprobe</div><div style="font-size:13px"><br></div><div style="font-size:13px">After any of the above commands, resource began to be picked up be pacemaker and I see valid scores:</div><div style="font-size:13px"><div>Resource                                                Score     Node                                   Stickiness #Fail    Migration-Threshold</div><div>cmha                                                    100       aic-controller-58055.test.doma<wbr>in.local 1          0        3</div><div>cmha                                                    101       aic-controller-12993.test.doma<wbr>in.local 1          0        3</div><div>cmha                                                    -INFINITY aic-controller-50186.test.doma<wbr>in.local 1          0        3</div></div><div style="font-size:13px"><br></div><div style="font-size:13px">So the questions here - why cluster-recheck doesn't work, and should it do reprobing ?</div><div style="font-size:13px">How to make migration work or what I missed in configuration that prevents migration? </div><div style="font-size:13px"><br></div><div style="font-size:13px">corosync  2.3.4<br></div><div style="font-size:13px">pacemaker 1.1.14</div></div>
<br>______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>
<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/<wbr>mailman/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
<br></blockquote></div><br></div></div>