[ClusterLabs] Sudden stop of pacemaker functions

Wed Feb 17 07:10:07 EST 2016

Hi List,
Having strange issue lately.
I have two node cluster with some cloned resources on it.
One of my nodes suddenly starts reporting all its resources down (some 
of them are actually running), stops logging and reminds in this this 
state forever, while still responding to crm commands.

The curious thing is that restarting corosync/pacemaker doesn't change 
anything.

Here are the last lines in the log after restart:

Feb 17 12:55:17 [609415] CLUSTER-1       crmd:   notice: do_started:    
The local CRM is operational
Feb 17 12:55:17 [609415] CLUSTER-1       crmd:     info: 
do_state_transition:   State transition S_STARTING -> S_PENDING [ 
input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Feb 17 12:55:17 [609409] CLUSTER-1        cib:     info: 
cib_process_replace:   Digest matched on replace from CLUSTER-2: 
f7cb10ecaff6cfd1661ca7ec779192b3
Feb 17 12:55:17 [609409] CLUSTER-1        cib:     info: 
cib_process_replace:   Replaced 0.238.1 with 0.238.40 from CLUSTER-2
Feb 17 12:55:17 [609409] CLUSTER-1        cib:     info: 
cib_replace_notify:    Replaced: 0.238.1 -> 0.238.40 from CLUSTER-2
Feb 17 12:55:18 [609415] CLUSTER-1       crmd:     info: update_dc:     
Set DC to CLUSTER-2 (3.0.6)
Feb 17 12:55:19 [609411] CLUSTER-1 stonith-ng:     info: 
stonith_command:       Processed register from crmd.609415: OK (0)
Feb 17 12:55:19 [609411] CLUSTER-1 stonith-ng:     info: 
stonith_command:       Processed st_notify from crmd.609415: OK (0)
Feb 17 12:55:19 [609411] CLUSTER-1 stonith-ng:     info: 
stonith_command:       Processed st_notify from crmd.609415: OK (0)
Feb 17 12:55:19 [609415] CLUSTER-1       crmd:     info: 
erase_status_tag:      Deleting xpath: 
//node_state[@uname='CLUSTER-1']/transient_attributes
Feb 17 12:55:19 [609415] CLUSTER-1       crmd:     info: update_attrd:  
Connecting to attrd... 5 retries remaining
Feb 17 12:55:19 [609415] CLUSTER-1       crmd:   notice: 
do_state_transition:   State transition S_PENDING -> S_NOT_DC [ 
input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Feb 17 12:55:19 [609413] CLUSTER-1      attrd:   notice: 
attrd_local_callback:  Sending full refresh (origin=crmd)
Feb 17 12:55:19 [609409] CLUSTER-1        cib:     info: 
cib_process_replace:   Digest matched on replace from CLUSTER-2: 
f7cb10ecaff6cfd1661ca7ec779192b3
Feb 17 12:55:19 [609409] CLUSTER-1        cib:     info: 
cib_process_replace:   Replaced 0.238.40 with 0.238.40 from CLUSTER-2
Feb 17 12:55:21 [609413] CLUSTER-1      attrd:  warning: 
attrd_cib_callback:    Update shutdown=(null) failed: No such device or 
address
Feb 17 12:55:22 [609413] CLUSTER-1      attrd:  warning: 
attrd_cib_callback:    Update terminate=(null) failed: No such device or 
address
Feb 17 12:55:25 [609413] CLUSTER-1      attrd:  warning: 
attrd_cib_callback:    Update pingd=(null) failed: No such device or address
Feb 17 12:55:26 [609413] CLUSTER-1      attrd:  warning: 
attrd_cib_callback:    Update fail-count-p_Samba_Server=(null) failed: 
No such device or address
Feb 17 12:55:26 [609413] CLUSTER-1      attrd:  warning: 
attrd_cib_callback:    Update master-p_Device_drbddrv1=(null) failed: No 
such device or address
Feb 17 12:55:27 [609413] CLUSTER-1      attrd:  warning: 
attrd_cib_callback:    Update last-failure-p_Samba_Server=(null) failed: 
No such device or address
Feb 17 12:55:27 [609413] CLUSTER-1      attrd:  warning: 
attrd_cib_callback:    Update probe_complete=(null) failed: No such 
device or address

After that the logging on the problematic node stops.

Corosync is v2.1.0.26; Pacemaker v1.1.8

Regards,
Klecho