<div dir="ltr">Thanks for reply, Andreas<div><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Aug 5, 2016 at 1:48 AM, Andreas Kurz <span dir="ltr"><<a href="mailto:andreas.kurz@gmail.com" target="_blank">andreas.kurz@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">Hi,<div class="gmail_extra"><br><div class="gmail_quote"><span class="gmail-">On Fri, Aug 5, 2016 at 2:08 AM, Nikita Koshikov <span dir="ltr"><<a href="mailto:koshikov@gmail.com" target="_blank">koshikov@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><span style="font-size:13px">Hello list,</span><div style="font-size:13px"><br></div><div style="font-size:13px">Can you, please, help me in debugging 1 resource not being started after node failover ?</div><div style="font-size:13px"><br></div><div style="font-size:13px">Here is configuration that I'm testing:</div><div style="font-size:13px">3 nodes(kvm VM) cluster, that have:</div><div style="font-size:13px"><br></div><div style="font-size:13px"><div><div>node 10: aic-controller-58055.test.doma<wbr>in.local</div><div>node 6: aic-controller-50186.test.doma<wbr>in.local</div><div>node 9: aic-controller-12993.test.doma<wbr>in.local</div><div>primitive cmha cmha \</div><div>        params conffile="/etc/cmha/cmha.conf" daemon="/usr/bin/cmhad" pidfile="/var/run/cmha/cmha.pi<wbr>d" user=cmha \</div><div>        meta failure-timeout=30 resource-stickiness=1 target-role=Started migration-threshold=3 \</div><div>        op monitor interval=10 on-fail=restart timeout=20 \</div><div>        op start interval=0 on-fail=restart timeout=60 \</div><div>        op stop interval=0 on-fail=block timeout=90</div></div></div></div></blockquote><div><br></div></span><div>What is the output of crm_mon -1frA once a node is down ... any failed actions?</div></div></div></div></blockquote><div><br></div><div>No errors/failed actions. This is a little bit different lab(names changes), but have the same effect:</div><div><br></div><div><div>root@aic-controller-57150:~# crm_mon -1frA</div><div>Last updated: Fri Aug  5 20:14:05 2016          Last change: Fri Aug  5 19:38:34 2016 by root via crm_attribute on aic-controller-44151.test.domain.local</div><div>Stack: corosync</div><div>Current DC: aic-controller-57150.test.domain.local (version 1.1.14-70404b0) - partition with quorum</div><div>3 nodes and 7 resources configured</div><div><br></div><div>Online: [ aic-controller-57150.test.domain.local aic-controller-58381.test.domain.local ]</div><div>OFFLINE: [ aic-controller-44151.test.domain.local ]</div><div><br></div><div>Full list of resources:</div><div><br></div><div> sysinfo_aic-controller-44151.test.domain.local (ocf::pacemaker:SysInfo):       Stopped</div><div> sysinfo_aic-controller-57150.test.domain.local (ocf::pacemaker:SysInfo):       Started aic-controller-57150.test.domain.local</div><div> sysinfo_aic-controller-58381.test.domain.local (ocf::pacemaker:SysInfo):       Started aic-controller-58381.test.domain.local</div><div> Clone Set: clone_p_heat-engine [p_heat-engine]</div><div>     Started: [ aic-controller-57150.test.domain.local aic-controller-58381.test.domain.local ]</div><div> cmha   (ocf::heartbeat:cmha):  Stopped</div><div><br></div><div>Node Attributes:</div><div>* Node aic-controller-57150.test.domain.local:</div><div>    + arch                              : x86_64</div><div>    + cpu_cores                         : 3</div><div>    + cpu_info                          : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz</div><div>    + cpu_load                          : 1.04</div><div>    + cpu_speed                         : 4994.21</div><div>    + free_swap                         : 5150</div><div>    + os                                : Linux-3.13.0-85-generic</div><div>    + ram_free                          : 750</div><div>    + ram_total                         : 5000</div><div>    + root_free                         : 45932</div><div>    + var_log_free                      : 431543</div><div>* Node aic-controller-58381.test.domain.local:</div><div>    + arch                              : x86_64</div><div>    + cpu_cores                         : 3</div><div>    + cpu_info                          : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz</div><div>    + cpu_load                          : 1.16</div><div>    + cpu_speed                         : 4994.21</div><div>    + free_swap                         : 5150</div><div>    + os                                : Linux-3.13.0-85-generic</div><div>    + ram_free                          : 750</div><div>    + ram_total                         : 5000</div><div>    + root_free                         : 45932</div><div>    + var_log_free                      : 431542</div><div><br></div><div>Migration Summary:</div><div>* Node aic-controller-57150.test.domain.local:</div><div>* Node aic-controller-58381.test.domain.local:</div></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="gmail-"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div style="font-size:13px"><div><div>primitive sysinfo_aic-controller-12993.t<wbr>est.domain.local ocf:pacemaker:SysInfo \</div><div>        params disk_unit=M disks="/ /var/log" min_disk_free=512M \</div><div>        op monitor interval=15s</div><div>primitive sysinfo_aic-controller-50186.t<wbr>est.domain.local ocf:pacemaker:SysInfo \</div><div>        params disk_unit=M disks="/ /var/log" min_disk_free=512M \</div><div>        op monitor interval=15s</div><div>primitive sysinfo_aic-controller-58055.t<wbr>est.domain.local ocf:pacemaker:SysInfo \</div><div>        params disk_unit=M disks="/ /var/log" min_disk_free=512M \</div><div>        op monitor interval=15s</div></div></div></div></blockquote><div><br></div></span><div>You can use a clone for this sysinfo resource and a symmetric cluster for a more compact configuration .... then you can skip all these location constraints.</div><span class="gmail-"><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div style="font-size:13px"><div><br></div><div>location cmha-on-aic-controller-12993.t<wbr>est.domain.local cmha 100: aic-controller-12993.test.doma<wbr>in.local</div><div>location cmha-on-aic-controller-50186.t<wbr>est.domain.local cmha 100: aic-controller-50186.test.doma<wbr>in.local</div><div>location cmha-on-aic-controller-58055.t<wbr>est.domain.local cmha 100: aic-controller-58055.test.doma<wbr>in.local</div><div>location sysinfo-on-aic-controller-1299<wbr>3.test.domain.local sysinfo_aic-controller-12993.t<wbr>est.domain.local inf: aic-controller-12993.test.doma<wbr>in.local</div><div>location sysinfo-on-aic-controller-5018<wbr>6.test.domain.local sysinfo_aic-controller-50186.t<wbr>est.domain.local inf: aic-controller-50186.test.doma<wbr>in.local</div><div>location sysinfo-on-aic-controller-5805<wbr>5.test.domain.local sysinfo_aic-controller-58055.t<wbr>est.domain.local inf: aic-controller-58055.test.doma<wbr>in.local</div><div>property cib-bootstrap-options: \</div><div>        have-watchdog=false \</div><div>        dc-version=1.1.14-70404b0 \</div><div>        cluster-infrastructure=corosyn<wbr>c \</div><div>        cluster-recheck-interval=15s \</div></div></div></blockquote><div><br></div></span><div>Never tried such a low cluster-recheck-interval ... wouldn't do that. I saw setups with low intervals burning a lot of cpu cycles in bigger cluster setups and side-effects from aborted transitions. If you do this for "cleanup" the cluster state because you see resource-agent errors you should better fix the resource agent.</div></div></div></div></blockquote><div><br></div><div>This small interval is result of debugging cmha resource issue. In general all cluster have 190s, and because 15s didn't help - it will be rollback.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>Regards,</div><div>Andreas</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div class="gmail-h5"><div dir="ltr"><div style="font-size:13px"><div>        no-quorum-policy=stop \</div><div>        stonith-enabled=false \</div><div>        start-failure-is-fatal=false \</div><div>        symmetric-cluster=false \</div><div>        node-health-strategy=migrate-o<wbr>n-red \</div><div>        last-lrm-refresh=1470334410</div></div><div style="font-size:13px"><br></div><div style="font-size:13px">When 3 nodes online, everything seemed OK, this is output of scoreshow.sh:</div><div style="font-size:13px"><div>Resource                                                Score     Node                                   Stickiness #Fail    Migration-Threshold</div><div>cmha                                                    -INFINITY aic-controller-12993.test.doma<wbr>in.local 1          0</div><div>cmha                                                              101 aic-controller-50186.test.doma<wbr>in.local 1          0</div><div>cmha                                                    -INFINITY aic-controller-58055.test.doma<wbr>in.local 1          0</div></div><div style="font-size:13px"><div>sysinfo_aic-controller-12993.t<wbr>est.domain.local          INFINITY  aic-controller-12993.test.dom<wbr>ain.local 0          0</div><div>sysinfo_aic-controller-50186.t<wbr>est.domain.local          -INFINITY aic-controller-50186.test.doma<wbr>in.local 0          0</div><div>sysinfo_aic-controller-58055.t<wbr>est.domain.local          INFINITY  aic-controller-58055.test.dom<wbr>ain.local 0          0</div></div><div style="font-size:13px"><br></div><div style="font-size:13px">The problem starts when 1 node, goes offline (aic-controller-50186). The resource cmha is stocked in stopped state.</div><div style="font-size:13px">Here is the showscores:</div><div style="font-size:13px"><div>Resource                                                Score     Node                                   Stickiness #Fail    Migration-Threshold</div><div>cmha                                                    -INFINITY aic-controller-12993.test.doma<wbr>in.local 1          0</div><div>cmha                                                    -INFINITY aic-controller-50186.test.doma<wbr>in.local 1          0</div><div>cmha                                                    -INFINITY aic-controller-58055.test.doma<wbr>in.local 1          0</div></div><div style="font-size:13px"><br></div><div style="font-size:13px">Even it has target-role=Started pacemaker skipping this resource. And in logs I see:</div><div style="font-size:13px">pengine:     info: native_print:      cmha    (ocf::heartbeat:cmha):  Stopped<br></div><div style="font-size:13px">pengine:     info: native_color:      Resource cmha cannot run anywhere<br></div><div style="font-size:13px">pengine:     info: LogActions:        Leave   cmha    (Stopped)<br></div><div style="font-size:13px"><br></div><div style="font-size:13px">To recover cmha resource I need to run either:</div><div style="font-size:13px">1) crm resource cleanup cmha</div><div style="font-size:13px">2) crm resource reprobe</div><div style="font-size:13px"><br></div><div style="font-size:13px">After any of the above commands, resource began to be picked up be pacemaker and I see valid scores:</div><div style="font-size:13px"><div>Resource                                                Score     Node                                   Stickiness #Fail    Migration-Threshold</div><div>cmha                                                    100       aic-controller-58055.test.doma<wbr>in.local 1          0        3</div><div>cmha                                                    101       aic-controller-12993.test.doma<wbr>in.local 1          0        3</div><div>cmha                                                    -INFINITY aic-controller-50186.test.doma<wbr>in.local 1          0        3</div></div><div style="font-size:13px"><br></div><div style="font-size:13px">So the questions here - why cluster-recheck doesn't work, and should it do reprobing ?</div><div style="font-size:13px">How to make migration work or what I missed in configuration that prevents migration? </div><div style="font-size:13px"><br></div><div style="font-size:13px">corosync  2.3.4<br></div><div style="font-size:13px">pacemaker 1.1.14</div></div>
<br></div></div>______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>
<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman<wbr>/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc<wbr>/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
<br></blockquote></div><br></div></div>
<br>______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>
<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/<wbr>mailman/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
<br></blockquote></div><br></div></div>