[ClusterLabs] Pacemaker startup retries

Tue Sep 4 04:31:55 EDT 2018

Hi 

>Check the pacemaker logs on both bodes around the time it happens.

This scenary happens when one node is starting and the other doesn't
have the corosync&pacemaker services started. So only one node to check
logs

>One of the nodes will be the DC, and will have "pengine:" logs with
>"saving inputs".

No "saving inputs" message on logs on startup

>The first thing I'd look for is who requested fencing. The DC will have
>stonith logs with "Client ... wants to fence ...". The client will
>either be crmd (i.e. the cluster itself) or some external program.

It's crmd

Aug 31 10:59:20 [30612] node1 stonith-ng: notice: handle_request: Client
crmd.30616.aa4a8de3 wants to fence (reboot) 'node2' with device '(any)'
Aug 31 10:59:37 [30612] node1 stonith-ng: notice: handle_request: Client
crmd.30616.aa4a8de3 wants to fence (reboot) 'node2' with device '(any)'
Aug 31 10:59:53 [30612] node1 stonith-ng: notice: handle_request: Client
crmd.30616.aa4a8de3 wants to fence (reboot) 'node2' with device '(any)'

>If it's the cluster, I'd look at the "pengine:" logs on the DC before
>that, to see if there are any hints (node unclean, etc.). Then keep
>going backward until the ultimate cause is found.

The following are the pengine logs previous to the fist fencing:

Aug 31 10:58:58 [30615] node1 pengine: info: crm_log_init: Changed
active directory to /var/lib/pacemaker/cores/hacluster
Aug 31 10:58:58 [30615] node1 pengine: info: qb_ipcs_us_publish: server
name: pengine
Aug 31 10:58:58 [30615] node1 pengine: info: main: Starting pengine
Aug 31 10:59:20 [30615] node1 pengine: notice: unpack_config: On loss of
CCM Quorum: Ignore
Aug 31 10:59:20 [30615] node1 pengine: info:
determine_online_status_fencing: Node node1 is active
Aug 31 10:59:20 [30615] node1 pengine: info: determine_online_status:
Node node1 is online
Aug 31 10:59:20 [30615] node1 pengine: info: clone_print: Clone Set:
fencing [st-fence_propio]
Aug 31 10:59:20 [30615] node1 pengine: info: short_print: Stopped: [
node1 node2 ]
Aug 31 10:59:20 [30615] node1 pengine: info: clone_print: Master/Slave
Set: ms_drbd_databasestorage [p_drbd_databasestorage]
Aug 31 10:59:20 [30615] node1 pengine: info: short_print: Stopped: [
node1 node2 ]
Aug 31 10:59:20 [30615] node1 pengine: info: clone_print: Master/Slave
Set: ms_drbd_datoswebstorage [p_drbd_datoswebstorage]
Aug 31 10:59:20 [30615] node1 pengine: info: short_print: Stopped: [
node1 node2 ]
Aug 31 10:59:20 [30615] node1 pengine: info: group_print: Resource
Group: rg_database
Aug 31 10:59:20 [30615] node1 pengine: info: native_print: p_fs_database
(ocf::heartbeat:Filesystem): Stopped
Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
p_ip_databasestorageip (ocf::heartbeat:IPaddr): Stopped
Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
p_ip_pub_database (ocf::heartbeat:IPaddr): Stopped
Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
p_moverip_database (lsb:moverip_database): Stopped
Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
servicio_enviamailpacemakerdatabase (lsb:enviamailpacemakerdatabase):
Stopped
Aug 31 10:59:20 [30615] node1 pengine: info: group_print: Resource
Group: rg_datosweb
Aug 31 10:59:20 [30615] node1 pengine: info: native_print: p_fs_datosweb
(ocf::heartbeat:Filesystem): Stopped
Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
p_ip_datoswebstorageip (ocf::heartbeat:IPaddr): Stopped
Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
p_ip_pub_datosweb (ocf::heartbeat:IPaddr): Stopped
Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
p_moverip_datosweb (lsb:moverip_datosweb): Stopped
Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
servicio_enviamailpacemakerdatosweb (lsb:enviamailpacemakerdatosweb):
Stopped
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
st-fence_propio:1 cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
p_drbd_databasestorage:1 cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: master_color:
ms_drbd_databasestorage: Promoted 0 instances of a possible 1 to master
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
p_drbd_datoswebstorage:1 cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: master_color:
ms_drbd_datoswebstorage: Promoted 0 instances of a possible 1 to master
Aug 31 10:59:20 [30615] node1 pengine: info: rsc_merge_weights:
p_fs_database: Rolling back scores from p_ip_databasestorageip
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
p_fs_database cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: rsc_merge_weights:
p_ip_databasestorageip: Rolling back scores from p_ip_pub_database
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
p_ip_databasestorageip cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: rsc_merge_weights:
p_ip_pub_database: Rolling back scores from p_moverip_database
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
p_ip_pub_database cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: rsc_merge_weights:
p_moverip_database: Rolling back scores from
servicio_enviamailpacemakerdatabase
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
p_moverip_database cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
servicio_enviamailpacemakerdatabase cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: rsc_merge_weights:
p_fs_datosweb: Rolling back scores from p_ip_datoswebstorageip
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
p_fs_datosweb cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: rsc_merge_weights:
p_ip_datoswebstorageip: Rolling back scores from p_ip_pub_datosweb
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
p_ip_datoswebstorageip cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: rsc_merge_weights:
p_ip_pub_datosweb: Rolling back scores from p_moverip_datosweb
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
p_ip_pub_datosweb cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: rsc_merge_weights:
p_moverip_datosweb: Rolling back scores from
servicio_enviamailpacemakerdatosweb
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
p_moverip_datosweb cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: native_color: Resource
servicio_enviamailpacemakerdatosweb cannot run anywhere
Aug 31 10:59:20 [30615] node1 pengine: info: RecurringOp: Start
recurring monitor (31s) for p_drbd_databasestorage:0 on node1
Aug 31 10:59:20 [30615] node1 pengine: info: RecurringOp: Start
recurring monitor (31s) for p_drbd_databasestorage:0 on node1
Aug 31 10:59:20 [30615] node1 pengine: info: RecurringOp: Start
recurring monitor (31s) for p_drbd_datoswebstorage:0 on node1
Aug 31 10:59:20 [30615] node1 pengine: info: RecurringOp: Start
recurring monitor (31s) for p_drbd_datoswebstorage:0 on node1
Aug 31 10:59:20 [30615] node1 pengine: warning: stage6: Scheduling Node
node2 for STONITH

Any more clues?
Thanks
Cesar

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180904/9dad5d8e/attachment.html>