[Pacemaker] Corosync 1.4.7: zombie (defunct)

Wed Jan 7 12:36:19 EST 2015

Sorry, my fault. Forgot to include /usr/lib/lcrso/pacemaker.lcrso in my deb package. 

--
Best regards,
Sergey Arlashin

On Jan 7, 2015, at 2:18 PM, Sergey Arlashin <sergeyarl.maillist at gmail.com> wrote:

> After installing 1.1.12 on one of my nodes in staging environment I see the following error in corosync.log 
> 
> Jan  7 10:05:30 lb-node1 corosync[17022]:   [SERV  ] Service failed to load 'pacemaker'.
> 
> and also cannot get crm_mon to show any info.
> 
> # crm_mon -1
> Connection to cluster failed: Transport endpoint is not connected
> 
> # crm status
> ERROR: status: crm_mon exited with code 107. Output: 'Connection to cluster failed: Transport endpoint is not connected'
> 
> The same thing happened with 1.1.11 (I rebuilt 1.1.11 package from Ubuntu 14.04 for 12.04 that we're using).
> 
> 
> --
> Best regards,
> Sergey Arlashin
> 
> 
> 
> 
> 
> 
> 
> 
> On Jan 7, 2015, at 5:22 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
>> 
>>> On 7 Jan 2015, at 7:58 am, Sergey Arlashin <sergeyarl.maillist at gmail.com> wrote:
>>> 
>>> And one more question - can pacemaker 1.1.12 be used together with corosync 1.4.7 ?
>> 
>> It can be, depends entirely on which version of corosync it was built against.
>> 
>>> Or do I need to install corosync 2.x ? 
>> 
>> Wouldn't be a bad idea while you're at it
>> 
>>> 
>>> --
>>> Best regards,
>>> Sergey Arlashin
>>> 
>>> 
>>> On Jan 6, 2015, at 11:04 AM, Sergey Arlashin <sergeyarl.maillist at gmail.com> wrote:
>>> 
>>>> Thank you!
>>>> I'll try 1.1.12. 
>>>> 
>>>> --
>>>> Best regards,
>>>> Sergey Arlashin
>>>> 
>>>> 
>>>> On Jan 6, 2015, at 3:23 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>> 
>>>>> Yeah, I can imagine 1.1.6 behaving like this.
>>>>> I'd highly recommend 1.1.12
>>>>> 
>>>>>> On 5 Jan 2015, at 5:14 pm, Sergey Arlashin <sergeyarl.maillist at gmail.com> wrote:
>>>>>> 
>>>>>> Pacemaker 1.1.6
>>>>>> 
>>>>>> It runs on Ubuntu 12.04 LTS 64bit. 
>>>>>> 
>>>>>> Linux lb-node1 3.11.0-23-generic #40~precise1-Ubuntu SMP Wed Jun 4 22:06:36 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>>>> 
>>>>>> --
>>>>>> Best regards,
>>>>>> Sergey Arlashin
>>>>>> 
>>>>>> 
>>>>>> On Jan 5, 2015, at 7:59 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>> 
>>>>>>> pacemaker version?  it looks familiar but it depends on the version number.
>>>>>>> 
>>>>>>>> On 29 Dec 2014, at 10:24 pm, Sergey Arlashin <sergeyarl.maillist at gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi!
>>>>>>>> Recently I've noticed that one of my nodes had OFFLINE status in 'crm status' output. But it actually was not. I could ssh on this node. I could get 'crm status' from that node's console. After some time it became online. It happened several times without any obvious reason with other nodes. 
>>>>>>>> 
>>>>>>>> Still no error of fatal messages in logs. The only warning messages I could get from corosync.log were the following:
>>>>>>>> 
>>>>>>>> Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1346 -> 0.233.1347 not applied to 0.233.1354: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1347 -> 0.233.1348 not applied to 0.233.1354: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1348 -> 0.233.1349 not applied to 0.233.1354: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1349 -> 0.233.1350 not applied to 0.233.1354: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1350 -> 0.233.1351 not applied to 0.233.1354: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1351 -> 0.233.1352 not applied to 0.233.1354: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1352 -> 0.233.1353 not applied to 0.233.1354: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1353 -> 0.233.1354 not applied to 0.233.1354: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:34 lb-node2 attrd: [2240]: WARN: attrd_cib_callback: Update 491 for last-failure-Cachier=1419729443 failed: Application of an update diff failed
>>>>>>>> Dec 29 10:56:34 lb-node2 attrd: [2240]: WARN: attrd_cib_callback: Update 494 for fail-count-Cachier=1 failed: Application of an update diff failed
>>>>>>>> Dec 29 10:56:34 lb-node2 attrd: [2240]: WARN: attrd_cib_callback: Update 497 for probe_complete=true failed: Application of an update diff failed
>>>>>>>> Dec 29 10:56:34 lb-node2 attrd: [2240]: WARN: attrd_cib_callback: Update 500 for last-failure-Cachier=1419729443 failed: Application of an update diff failed
>>>>>>>> Dec 29 10:56:34 lb-node2 attrd: [2240]: WARN: attrd_cib_callback: Update 503 for fail-count-Cachier=1 failed: Application of an update diff failed
>>>>>>>> Dec 29 10:56:37 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1338 -> 0.233.1339 not applied to 0.233.1382: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:37 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1339 -> 0.233.1340 not applied to 0.233.1382: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:37 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1340 -> 0.233.1341 not applied to 0.233.1382: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:37 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1341 -> 0.233.1342 not applied to 0.233.1382: current "num_updates" is greater than required
>>>>>>>> Dec 29 10:56:37 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1342 -> 0.233.1343 not applied to 0.233.1382: current "num_updates" is greater than required
>>>>>>>> 
>>>>>>>> After exploring corosync processes with ps I found out that on all my nodes there are zombie corosync procs like:
>>>>>>>> 
>>>>>>>> root     13892  0.0  0.0      0     0 ?        Z    Dec26   0:04 [corosync] <defunct>
>>>>>>>> root     21793  0.0  0.0      0     0 ?        Z    Dec26   0:00 [corosync] <defunct>
>>>>>>>> root     27009  1.3  1.0 714292 10784 ?        Ssl  Dec18 223:38 /usr/sbin/corosync
>>>>>>>> 
>>>>>>>> Is it ok to have zombie corosync procs on nodes? Or does it suggest that something wrong is going on ? 
>>>>>>>> 
>>>>>>>> Thanks in advance
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Sergey Arlashin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>> 
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>> 
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>> 
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>