<div dir="ltr">Just to add maybe a helpful observation: either "cib" or "pengine" process goes to ~100% CPU when this remote nodes errors happen.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Sep 27, 2016 at 2:36 PM, Radoslaw Garbacz <span dir="ltr"><<a href="mailto:radoslaw.garbacz@xtremedatainc.com" target="_blank">radoslaw.garbacz@xtremedatainc.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div>Hi,<br><br></div>I encountered the same problem with pacemaker built from github at around August 22.<br><br></div>Remote nodes go offline occasionally and stay so, their logs show same errors. The cluster is on AWS ec2 instances, the network works and is an unlikely reason.<br><br></div><div>Have there be any commits on github recently (after August 22) addressing this issue?<br><br><br></div>Logs:<br>[...]<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_abort:        crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_remote_header:        Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab'd 30636463<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_abort:        crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_remote_header:        Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab'd 30636463<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_abort:        crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_remote_header:        Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab'd 30636463<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info: lrmd_remote_client_msg:   Client disconnect detected in tls msg dispatcher.<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info: ipc_proxy_remove_provider:    <wbr>    ipc proxy connection for client ca8df213-6da7-4c42-8cb3-<wbr>b8bc0887f2ce pid 21815 destroyed because cluster node disconnected.<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info: cancel_recurring_action:  Cancelling ocf operation monitor_all_monitor_191000<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_send_tls:     Connection terminated rc = -53<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_send_tls:     Connection terminated rc = -10<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: crm_remote_send:  Failed to send remote msg, rc = -10<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error: lrmd_tls_send_msg:        Failed to send remote lrmd tls msg, rc = -10<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:  warning: send_client_notify:       Notification of client remote-lrmd-ip-10-237-223-67:<wbr>3121/b6034d3a-e296-492f-b296-<wbr>725735d17e22 failed<br>Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:   notice: lrmd_remote_client_destroy:   <wbr>    LRMD client disconnecting remote client - name: remote-lrmd-ip-10-237-223-67:<wbr>3121 id: b6034d3a-e296-492f-b296-<wbr>725735d17e22<br>Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:    error: ipc_proxy_accept: No ipc providers available for uid 0 gid 0<br>Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:    error: handle_new_connection:    Error in connection setup (19626-21815-14): Remote I/O error (121)<br>Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:    error: ipc_proxy_accept: No ipc providers available for uid 0 gid 0<br>Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:    error: handle_new_connection:    Error in connection setup (19626-21815-14): Remote I/O error (121)<br>[...]<br><div><br><br><br></div></div><div class="gmail_extra"><div><div class="h5"><br><div class="gmail_quote">On Thu, Jun 9, 2016 at 12:24 AM, Narayanamoorthy Srinivasan <span dir="ltr"><<a href="mailto:narayanamoorthys@gmail.com" target="_blank">narayanamoorthys@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Don't see any issues in network traffic.<div><br></div><div>Some more logs where the XML tags are incomplete:</div><div><br></div><div><div>2016-06-09T03:06:03.096449+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       <lrm_rsc_op id="fs-postgresql_last_0" operation_key="fs-postgresql_s<wbr>top_0" operation="stop" crm-debug-origin="do_update_re<wbr>source" crm_feature_set="3.0.10" transition-key="225:116:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1" transition-magic="0:0;225:116:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1" on_node="d00-50-56-94-24-dd" call-id="489" rc-code="0" op-status="0" interval="0" last-run="1459491026" last-rc-change="1459491026" exec-time="158" queue-time="0" op-digest="dfb0c861</div><div>2016-06-09T03:06:03.097136+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       <lrm_rsc_op id="fs-postgresql_last_failure<wbr>_0" operation_key="fs-postgresql_m<wbr>onitor_0" operation="monitor" crm-debug-origin="do_update_re<wbr>source" crm_feature_set="3.0.10" transition-key="41:4:7:8fbf83f<wbr>d-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;41:4:7:8<wbr>fbf83fd-241b-4623-8bbe-31d92e4<wbr>dfce1" on_node="d00-50-56-94-24-dd" call-id="5" rc-code="0" op-status="0" interval="0" last-run="1459429072" last-rc-change="1459429072" exec-time="315" queue-time="0" op-digest="df</div><div>2016-06-09T03:06:03.097361+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       <lrm_rsc_op id="fs-postgresql_monitor_1000<wbr>0" operation_key="fs-postgresql_m<wbr>onitor_10000" operation="monitor" crm-debug-origin="do_update_re<wbr>source" crm_feature_set="3.0.10" transition-key="224:107:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1" transition-magic="0:0;224:107:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1" on_node="d00-50-56-94-24-dd" call-id="365" rc-code="0" op-status="0" interval="10000" last-rc-change="1459490849" exec-time="185" queue-time="0" op-digest="cd8d3642c</div><div>2016-06-09T03:06:03.097582+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                     </lrm_resource></div><div>2016-06-09T03:06:03.097690+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                     <lrm_resource id="vip-admin-database-default<wbr>-proposal-controller" type="IPaddr2" class="ocf" provider="heartbeat"></div><div>2016-06-09T03:06:03.097797+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       <lrm_rsc_op id="vip-admin-database-default<wbr>-proposal-controller_last_0" operation_key="vip-admin-datab<wbr>ase-default-proposal-controlle<wbr>r_stop_0" operation="stop" crm-debug-origin="do_update_re<wbr>source" crm_feature_set="3.0.10" transition-key="228:116:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1" transition-magic="0:0;228:116:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1" on_node="d00-50-56-94-24-dd" call-id="487" rc-code="0" op-status="0" interval="0" last-run="1459491026" last-rc-chan</div><div>2016-06-09T03:06:03.098013+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       <lrm_rsc_op id="vip-admin-database-default<wbr>-proposal-controller_monitor_<wbr>10000" operation_key="vip-admin-datab<wbr>ase-default-proposal-controlle<wbr>r_monitor_10000" operation="monitor" crm-debug-origin="do_update_re<wbr>source" crm_feature_set="3.0.10" transition-key="227:107:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1" transition-magic="0:0;227:107:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1" on_node="d00-50-56-94-24-dd" call-id="369" rc-code="0" op-status="0" interval="10000" last-rc-chang</div><div>2016-06-09T03:06:03.098230+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                     </lrm_resource></div><div>2016-06-09T03:06:03.098337+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                     <lrm_resource id="postgresql" type="pgsql" class="ocf" provider="heartbeat"></div><div>2016-06-09T03:06:03.098468+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       <lrm_rsc_op id="postgresql_last_0" operation_key="postgresql_stop<wbr>_0" operation="stop" crm-debug-origin="do_update_re<wbr>source" crm_feature_set="3.0.10" transition-key="231:116:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1" transition-magic="0:0;231:116:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1" on_node="d00-50-56-94-24-dd" call-id="481" rc-code="0" op-status="0" interval="0" last-run="1459491025" last-rc-change="1459491025" exec-time="1334" queue-time="0" op-digest="f2317cad3d54c</div><div>2016-06-09T03:06:03.099061+05:<wbr>30 d18-fb-7b-18-f1-8e pacemaker_remoted[6153]:    error: Partial                       <lrm_rsc_op id="postgresql_monitor_10000" operation_key="postgresql_moni<wbr>tor_10000" operation="monitor" crm-debug-origin="do_update_re<wbr>source" crm_feature_set="3.0.10" transition-key="230:107:0:8fbf<wbr>83fd-241b-4623-8bbe-31d92e4dfc<wbr>e1" transition-magic="0:0;230:107:<wbr>0:8fbf83fd-241b-4623-8bbe-31d9<wbr>2e4dfce1" on_node="d00-50-56-94-24-dd" call-id="372" rc-code="0" op-status="0" interval="10000" last-rc-change="1459490852" exec-time="424" queue-time="0" op-digest="873ed4f07792aa8</div></div><div><br></div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 8, 2016 at 10:28 PM, Narayanamoorthy Srinivasan <span dir="ltr"><<a href="mailto:narayanamoorthys@gmail.com" target="_blank">narayanamoorthys@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>No recent network changes. Will check for abnormal traffic using wireshark.<br><br></div>I also notice that the XML lines are partial (no ending '>', closing " and sometimes partial words) in logs. Any lines > 472 characters are truncated to 472 characters. Wondering is it due to anyother limitations. <br><br></div>I can post some line tomorrow when i am back to work.<br><br></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 8, 2016 at 8:00 PM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div>On 06/08/2016 06:14 AM, Narayanamoorthy Srinivasan wrote:<br>
> I have a pacemaker cluster with two pacemaker remote nodes. Recently the<br>
> remote nodes started throwing below errors and SDB started self-fencing.<br>
> Appreciate if someone throws light on what could be the issue and the fix.<br>
><br>
> OS - SLES 12 SP1<br>
> Pacemaker Remote version - pacemaker-remote-1.1.13-14.7.x<wbr>86_64<br>
><br>
> 2016-06-08T14:11:46.009073+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser<br>
> error : AttValue: ' expected<br>
> 2016-06-08T14:11:46.009314+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error:<br>
> key="neutron-ha-tool_monitor_0<wbr>" operation="monitor"<br>
> crm-debug-origin="do_update_<br>
> 2016-06-08T14:11:46.009443+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error:<br>
>                                                      ^<br>
> 2016-06-08T14:11:46.009567+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser<br>
> error : attributes construct error<br>
> 2016-06-08T14:11:46.009697+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error:<br>
> key="neutron-ha-tool_monitor_0<wbr>" operation="monitor"<br>
> crm-debug-origin="do_update_<br>
> 2016-06-08T14:11:46.009824+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error:<br>
>                                                      ^<br>
> 2016-06-08T14:11:46.009948+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser<br>
> error : Couldn't find end of Start Tag lrm_rsc_op line 1<br>
> 2016-06-08T14:11:46.010070+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error:<br>
> key="neutron-ha-tool_monitor_0<wbr>" operation="monitor"<br>
> crm-debug-origin="do_update_<br>
> 2016-06-08T14:11:46.010191+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error:<br>
>                                                      ^<br>
> 2016-06-08T14:11:46.010460+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser<br>
> error : Premature end of data in tag lrm_resource line 1<br>
> 2016-06-08T14:11:46.010718+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error:<br>
> key="neutron-ha-tool_monitor_0<wbr>" operation="monitor"<br>
> crm-debug-origin="do_update_<br>
> 2016-06-08T14:11:46.010977+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error:<br>
>                                                      ^<br>
> 2016-06-08T14:11:46.011234+05:<wbr>30 d18-fb-7b-18-f1-8e<br>
> pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser<br>
> error : Premature end of data in tag lrm_resources line 1<br>
><br>
><br>
> --<br>
> Thanks & Regards<br>
> Moorthy<br>
<br>
</div></div>This sounds like the network traffic between the cluster nodes and the<br>
remote nodes is being corrupted. Have there been any network changes<br>
lately? Switch/firewall/etc. equipment/settings? MTU?<br>
<br>
You could try using a packet sniffer such as wireshark to see if the<br>
traffic looks abnormal in some way. The payload is XML so it should be<br>
more or less readable.<br>
<br>
<br>
______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>
<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman<wbr>/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc<wbr>/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
</blockquote></div><br><br clear="all"><br>-- <br><div data-smartmail="gmail_signature">Thanks & Regards<br>Moorthy</div>
</div>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div data-smartmail="gmail_signature">Thanks & Regards<br>Moorthy</div>
</div>
</div></div><br>______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>
<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman<wbr>/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc<wbr>/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br></div></div><div data-smartmail="gmail_signature"><div dir="ltr"><div>Best Regards,<br><br>Radoslaw Garbacz<br></div>XtremeData Incorporation<br></div></div>
</div>
</blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Best Regards,<br><br>Radoslaw Garbacz<br></div>XtremeData Incorporation<br></div></div>
</div>