[ClusterLabs] pacemaker_remoted XML parse error

Radoslaw Garbacz radoslaw.garbacz at xtremedatainc.com
Wed Sep 28 17:30:19 CEST 2016


Just to add maybe a helpful observation: either "cib" or "pengine" process
goes to ~100% CPU when this remote nodes errors happen.

On Tue, Sep 27, 2016 at 2:36 PM, Radoslaw Garbacz <
radoslaw.garbacz at xtremedatainc.com> wrote:

> Hi,
>
> I encountered the same problem with pacemaker built from github at around
> August 22.
>
> Remote nodes go offline occasionally and stay so, their logs show same
> errors. The cluster is on AWS ec2 instances, the network works and is an
> unlikely reason.
>
> Have there be any commits on github recently (after August 22) addressing
> this issue?
>
>
> Logs:
> [...]
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> crm_abort:        crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> crm_remote_header:        Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> crm_abort:        crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> crm_remote_header:        Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> crm_abort:        crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> crm_remote_header:        Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info:
> lrmd_remote_client_msg:   Client disconnect detected in tls msg dispatcher.
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info:
> ipc_proxy_remove_provider:        ipc proxy connection for client
> ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster
> node disconnected.
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:     info:
> cancel_recurring_action:  Cancelling ocf operation
> monitor_all_monitor_191000
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> crm_send_tls:     Connection terminated rc = -53
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> crm_send_tls:     Connection terminated rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> crm_remote_send:  Failed to send remote msg, rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> lrmd_tls_send_msg:        Failed to send remote lrmd tls msg, rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:  warning:
> send_client_notify:       Notification of client
> remote-lrmd-ip-10-237-223-67:3121/b6034d3a-e296-492f-b296-725735d17e22
> failed
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:   notice:
> lrmd_remote_client_destroy:       LRMD client disconnecting remote client
> - name: remote-lrmd-ip-10-237-223-67:3121 id: b6034d3a-e296-492f-b296-
> 725735d17e22
> Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> ipc_proxy_accept: No ipc providers available for uid 0 gid 0
> Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> handle_new_connection:    Error in connection setup (19626-21815-14):
> Remote I/O error (121)
> Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> ipc_proxy_accept: No ipc providers available for uid 0 gid 0
> Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:    error:
> handle_new_connection:    Error in connection setup (19626-21815-14):
> Remote I/O error (121)
> [...]
>
>
>
>
> On Thu, Jun 9, 2016 at 12:24 AM, Narayanamoorthy Srinivasan <
> narayanamoorthys at gmail.com> wrote:
>
>> Don't see any issues in network traffic.
>>
>> Some more logs where the XML tags are incomplete:
>>
>> 2016-06-09T03:06:03.096449+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:    error: Partial
>> <lrm_rsc_op id="fs-postgresql_last_0" operation_key="fs-postgresql_stop_0"
>> operation="stop" crm-debug-origin="do_update_resource"
>> crm_feature_set="3.0.10" transition-key="225:116:0:8fbf
>> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;225:116:
>> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd"
>> call-id="489" rc-code="0" op-status="0" interval="0" last-run="1459491026"
>> last-rc-change="1459491026" exec-time="158" queue-time="0"
>> op-digest="dfb0c861
>> 2016-06-09T03:06:03.097136+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:    error: Partial
>> <lrm_rsc_op id="fs-postgresql_last_failure_0"
>> operation_key="fs-postgresql_monitor_0" operation="monitor"
>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
>> transition-key="41:4:7:8fbf83fd-241b-4623-8bbe-31d92e4dfce1"
>> transition-magic="0:0;41:4:7:8fbf83fd-241b-4623-8bbe-31d92e4dfce1"
>> on_node="d00-50-56-94-24-dd" call-id="5" rc-code="0" op-status="0"
>> interval="0" last-run="1459429072" last-rc-change="1459429072"
>> exec-time="315" queue-time="0" op-digest="df
>> 2016-06-09T03:06:03.097361+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:    error: Partial
>> <lrm_rsc_op id="fs-postgresql_monitor_10000"
>> operation_key="fs-postgresql_monitor_10000" operation="monitor"
>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
>> transition-key="224:107:0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1"
>> transition-magic="0:0;224:107:0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1"
>> on_node="d00-50-56-94-24-dd" call-id="365" rc-code="0" op-status="0"
>> interval="10000" last-rc-change="1459490849" exec-time="185" queue-time="0"
>> op-digest="cd8d3642c
>> 2016-06-09T03:06:03.097582+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:    error: Partial
>> </lrm_resource>
>> 2016-06-09T03:06:03.097690+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:    error: Partial
>> <lrm_resource id="vip-admin-database-default-proposal-controller"
>> type="IPaddr2" class="ocf" provider="heartbeat">
>> 2016-06-09T03:06:03.097797+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:    error: Partial
>> <lrm_rsc_op id="vip-admin-database-default-proposal-controller_last_0"
>> operation_key="vip-admin-database-default-proposal-controller_stop_0"
>> operation="stop" crm-debug-origin="do_update_resource"
>> crm_feature_set="3.0.10" transition-key="228:116:0:8fbf
>> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;228:116:
>> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd"
>> call-id="487" rc-code="0" op-status="0" interval="0" last-run="1459491026"
>> last-rc-chan
>> 2016-06-09T03:06:03.098013+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:    error: Partial
>> <lrm_rsc_op id="vip-admin-database-default-proposal-controller_monitor_10000"
>> operation_key="vip-admin-database-default-proposal-controller_monitor_10000"
>> operation="monitor" crm-debug-origin="do_update_resource"
>> crm_feature_set="3.0.10" transition-key="227:107:0:8fbf
>> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;227:107:
>> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd"
>> call-id="369" rc-code="0" op-status="0" interval="10000" last-rc-chang
>> 2016-06-09T03:06:03.098230+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:    error: Partial
>> </lrm_resource>
>> 2016-06-09T03:06:03.098337+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:    error: Partial
>> <lrm_resource id="postgresql" type="pgsql" class="ocf" provider="heartbeat">
>> 2016-06-09T03:06:03.098468+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:    error: Partial
>> <lrm_rsc_op id="postgresql_last_0" operation_key="postgresql_stop_0"
>> operation="stop" crm-debug-origin="do_update_resource"
>> crm_feature_set="3.0.10" transition-key="231:116:0:8fbf
>> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;231:116:
>> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd"
>> call-id="481" rc-code="0" op-status="0" interval="0" last-run="1459491025"
>> last-rc-change="1459491025" exec-time="1334" queue-time="0"
>> op-digest="f2317cad3d54c
>> 2016-06-09T03:06:03.099061+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:    error: Partial
>> <lrm_rsc_op id="postgresql_monitor_10000" operation_key="postgresql_monitor_10000"
>> operation="monitor" crm-debug-origin="do_update_resource"
>> crm_feature_set="3.0.10" transition-key="230:107:0:8fbf
>> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;230:107:
>> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd"
>> call-id="372" rc-code="0" op-status="0" interval="10000"
>> last-rc-change="1459490852" exec-time="424" queue-time="0"
>> op-digest="873ed4f07792aa8
>>
>>
>> On Wed, Jun 8, 2016 at 10:28 PM, Narayanamoorthy Srinivasan <
>> narayanamoorthys at gmail.com> wrote:
>>
>>> No recent network changes. Will check for abnormal traffic using
>>> wireshark.
>>>
>>> I also notice that the XML lines are partial (no ending '>', closing "
>>> and sometimes partial words) in logs. Any lines > 472 characters are
>>> truncated to 472 characters. Wondering is it due to anyother limitations.
>>>
>>> I can post some line tomorrow when i am back to work.
>>>
>>>
>>> On Wed, Jun 8, 2016 at 8:00 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
>>>
>>>> On 06/08/2016 06:14 AM, Narayanamoorthy Srinivasan wrote:
>>>> > I have a pacemaker cluster with two pacemaker remote nodes. Recently
>>>> the
>>>> > remote nodes started throwing below errors and SDB started
>>>> self-fencing.
>>>> > Appreciate if someone throws light on what could be the issue and the
>>>> fix.
>>>> >
>>>> > OS - SLES 12 SP1
>>>> > Pacemaker Remote version - pacemaker-remote-1.1.13-14.7.x86_64
>>>> >
>>>> > 2016-06-08T14:11:46.009073+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser
>>>> > error : AttValue: ' expected
>>>> > 2016-06-08T14:11:46.009314+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error:
>>>> > key="neutron-ha-tool_monitor_0" operation="monitor"
>>>> > crm-debug-origin="do_update_
>>>> > 2016-06-08T14:11:46.009443+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error:
>>>> >                                                      ^
>>>> > 2016-06-08T14:11:46.009567+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser
>>>> > error : attributes construct error
>>>> > 2016-06-08T14:11:46.009697+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error:
>>>> > key="neutron-ha-tool_monitor_0" operation="monitor"
>>>> > crm-debug-origin="do_update_
>>>> > 2016-06-08T14:11:46.009824+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error:
>>>> >                                                      ^
>>>> > 2016-06-08T14:11:46.009948+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser
>>>> > error : Couldn't find end of Start Tag lrm_rsc_op line 1
>>>> > 2016-06-08T14:11:46.010070+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error:
>>>> > key="neutron-ha-tool_monitor_0" operation="monitor"
>>>> > crm-debug-origin="do_update_
>>>> > 2016-06-08T14:11:46.010191+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error:
>>>> >                                                      ^
>>>> > 2016-06-08T14:11:46.010460+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser
>>>> > error : Premature end of data in tag lrm_resource line 1
>>>> > 2016-06-08T14:11:46.010718+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error:
>>>> > key="neutron-ha-tool_monitor_0" operation="monitor"
>>>> > crm-debug-origin="do_update_
>>>> > 2016-06-08T14:11:46.010977+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error:
>>>> >                                                      ^
>>>> > 2016-06-08T14:11:46.011234+05:30 d18-fb-7b-18-f1-8e
>>>> > pacemaker_remoted[6190]:    error: XML Error: Entity: line 1: parser
>>>> > error : Premature end of data in tag lrm_resources line 1
>>>> >
>>>> >
>>>> > --
>>>> > Thanks & Regards
>>>> > Moorthy
>>>>
>>>> This sounds like the network traffic between the cluster nodes and the
>>>> remote nodes is being corrupted. Have there been any network changes
>>>> lately? Switch/firewall/etc. equipment/settings? MTU?
>>>>
>>>> You could try using a packet sniffer such as wireshark to see if the
>>>> traffic looks abnormal in some way. The payload is XML so it should be
>>>> more or less readable.
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc
>>>> /Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks & Regards
>>> Moorthy
>>>
>>
>>
>>
>> --
>> Thanks & Regards
>> Moorthy
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
>
> --
> Best Regards,
>
> Radoslaw Garbacz
> XtremeData Incorporation
>



-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://clusterlabs.org/pipermail/users/attachments/20160928/7d8883eb/attachment-0001.html>


More information about the Users mailing list