[ClusterLabs] Sudden stop of pacemaker functions

Wed Feb 17 13:15:01 UTC 2016

Hi Jan,
Here is the output from your command:

attrd: 609413
cib: 609409
corosync: 608778
crmd: 609415
lrmd: 609412
pengine: 609414
pacemakerd: 609407
stonithd: 609411

Regarding using a newer version, that's what I've been thinking about, 
but I've been using this combination of corosync/pacemaker for many 
years on a different hardware and hever had similar problem.
The main difference is that I have stonith enabled only the problematic 
cluster, but I also suspect that the node, which causes this problem may 
have some hardware issues.

BTW my last few tests with the newest corosync/pacemaker gave me very 
annoying delay, when commiting configuration changes (maybe it's a known 
problem?).

Best regards,
Klecho

On 17.02.2016 14:59, Jan Pokorný wrote:
> On 17/02/16 14:10 +0200, Klechomir wrote:
>> Having strange issue lately.
>> I have two node cluster with some cloned resources on it.
>> One of my nodes suddenly starts reporting all its resources down (some of
>> them are actually running), stops logging and reminds in this this state
>> forever, while still responding to crm commands.
>>
>> The curious thing is that restarting corosync/pacemaker doesn't change
>> anything.
>>
>> Here are the last lines in the log after restart:
>>
>> [...]
>> Feb 17 12:55:19 [609409] CLUSTER-1        cib:     info:
>> cib_process_replace:   Replaced 0.238.40 with 0.238.40 from CLUSTER-2
>> Feb 17 12:55:21 [609413] CLUSTER-1      attrd:  warning: attrd_cib_callback:
>> Update shutdown=(null) failed: No such device or address
>> Feb 17 12:55:22 [609413] CLUSTER-1      attrd:  warning: attrd_cib_callback:
>> Update terminate=(null) failed: No such device or address
>> Feb 17 12:55:25 [609413] CLUSTER-1      attrd:  warning: attrd_cib_callback:
>> Update pingd=(null) failed: No such device or address
>> Feb 17 12:55:26 [609413] CLUSTER-1      attrd:  warning: attrd_cib_callback:
>> Update fail-count-p_Samba_Server=(null) failed: No such device or address
>> Feb 17 12:55:26 [609413] CLUSTER-1      attrd:  warning: attrd_cib_callback:
>> Update master-p_Device_drbddrv1=(null) failed: No such device or address
>> Feb 17 12:55:27 [609413] CLUSTER-1      attrd:  warning: attrd_cib_callback:
>> Update last-failure-p_Samba_Server=(null) failed: No such device or address
>> Feb 17 12:55:27 [609413] CLUSTER-1      attrd:  warning: attrd_cib_callback:
>> Update probe_complete=(null) failed: No such device or address
>>
>> After that the logging on the problematic node stops.
> Note sure I follow, what does the following command produce:
>
>      for i in attrd cib corosync crmd lrmd pengine pacemakerd stonithd; do \
>      echo "${i}: $(pgrep ${i})"; done
>
> ?
>
>> Corosync is v2.1.0.26; Pacemaker v1.1.8
> Definitely try a most recent version of Pacemaker; what you are using
> is 3.5 years old and plentiful fixes landed since then.
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160217/8d5869fd/attachment-0002.html>