[ClusterLabs] Pacemaker startup retries
Ken Gaillot
kgaillot at redhat.com
Tue Sep 4 12:35:45 EDT 2018
On Tue, 2018-09-04 at 10:31 +0200, c.hernandez wrote:
> Hi
> >Check the pacemaker logs on both bodes around the time it happens.
>
> This scenary happens when one node is starting and the other doesn't
> have the corosync&pacemaker services started. So only one node to
> check logs
>
>
> >One of the nodes will be the DC, and will have "pengine:" logs with
> >"saving inputs".
>
> No "saving inputs" message on logs on startup
>
>
> >The first thing I'd look for is who requested fencing. The DC will
> have
> >stonith logs with "Client ... wants to fence ...". The client will
> >either be crmd (i.e. the cluster itself) or some external program.
>
> It's crmd
>
> Aug 31 10:59:20 [30612] node1 stonith-ng: notice: handle_request:
> Client crmd.30616.aa4a8de3 wants to fence (reboot) 'node2' with
> device '(any)'
> Aug 31 10:59:37 [30612] node1 stonith-ng: notice: handle_request:
> Client crmd.30616.aa4a8de3 wants to fence (reboot) 'node2' with
> device '(any)'
> Aug 31 10:59:53 [30612] node1 stonith-ng: notice: handle_request:
> Client crmd.30616.aa4a8de3 wants to fence (reboot) 'node2' with
> device '(any)'
>
>
> >If it's the cluster, I'd look at the "pengine:" logs on the DC
> before
> >that, to see if there are any hints (node unclean, etc.). Then keep
> >going backward until the ultimate cause is found.
>
> The following are the pengine logs previous to the fist fencing:
>
> Aug 31 10:58:58 [30615] node1 pengine: info: crm_log_init:
> Changed active directory to /var/lib/pacemaker/cores/hacluster
> Aug 31 10:58:58 [30615] node1 pengine: info:
> qb_ipcs_us_publish: server name: pengine
> Aug 31 10:58:58 [30615] node1 pengine: info: main: Starting
> pengine
> Aug 31 10:59:20 [30615] node1 pengine: notice: unpack_config:
> On loss of CCM Quorum: Ignore
> Aug 31 10:59:20 [30615] node1 pengine: info:
> determine_online_status_fencing: Node node1 is active
> Aug 31 10:59:20 [30615] node1 pengine: info:
> determine_online_status: Node node1 is online
> Aug 31 10:59:20 [30615] node1 pengine: info: clone_print:
> Clone Set: fencing [st-fence_propio]
> Aug 31 10:59:20 [30615] node1 pengine: info: short_print:
> Stopped: [ node1 node2 ]
> Aug 31 10:59:20 [30615] node1 pengine: info: clone_print:
> Master/Slave Set: ms_drbd_databasestorage [p_drbd_databasestorage]
> Aug 31 10:59:20 [30615] node1 pengine: info: short_print:
> Stopped: [ node1 node2 ]
> Aug 31 10:59:20 [30615] node1 pengine: info: clone_print:
> Master/Slave Set: ms_drbd_datoswebstorage [p_drbd_datoswebstorage]
> Aug 31 10:59:20 [30615] node1 pengine: info: short_print:
> Stopped: [ node1 node2 ]
> Aug 31 10:59:20 [30615] node1 pengine: info: group_print:
> Resource Group: rg_database
> Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
> p_fs_database (ocf::heartbeat:Filesystem): Stopped
> Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
> p_ip_databasestorageip (ocf::heartbeat:IPaddr): Stopped
> Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
> p_ip_pub_database (ocf::heartbeat:IPaddr): Stopped
> Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
> p_moverip_database (lsb:moverip_database): Stopped
> Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
> servicio_enviamailpacemakerdatabase
> (lsb:enviamailpacemakerdatabase): Stopped
> Aug 31 10:59:20 [30615] node1 pengine: info: group_print:
> Resource Group: rg_datosweb
> Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
> p_fs_datosweb (ocf::heartbeat:Filesystem): Stopped
> Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
> p_ip_datoswebstorageip (ocf::heartbeat:IPaddr): Stopped
> Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
> p_ip_pub_datosweb (ocf::heartbeat:IPaddr): Stopped
> Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
> p_moverip_datosweb (lsb:moverip_datosweb): Stopped
> Aug 31 10:59:20 [30615] node1 pengine: info: native_print:
> servicio_enviamailpacemakerdatosweb
> (lsb:enviamailpacemakerdatosweb): Stopped
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource st-fence_propio:1 cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource p_drbd_databasestorage:1 cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info: master_color:
> ms_drbd_databasestorage: Promoted 0 instances of a possible 1 to
> master
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource p_drbd_datoswebstorage:1 cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info: master_color:
> ms_drbd_datoswebstorage: Promoted 0 instances of a possible 1 to
> master
> Aug 31 10:59:20 [30615] node1 pengine: info:
> rsc_merge_weights: p_fs_database: Rolling back scores from
> p_ip_databasestorageip
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource p_fs_database cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info:
> rsc_merge_weights: p_ip_databasestorageip: Rolling back scores
> from p_ip_pub_database
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource p_ip_databasestorageip cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info:
> rsc_merge_weights: p_ip_pub_database: Rolling back scores from
> p_moverip_database
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource p_ip_pub_database cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info:
> rsc_merge_weights: p_moverip_database: Rolling back scores from
> servicio_enviamailpacemakerdatabase
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource p_moverip_database cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource servicio_enviamailpacemakerdatabase cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info:
> rsc_merge_weights: p_fs_datosweb: Rolling back scores from
> p_ip_datoswebstorageip
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource p_fs_datosweb cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info:
> rsc_merge_weights: p_ip_datoswebstorageip: Rolling back scores
> from p_ip_pub_datosweb
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource p_ip_datoswebstorageip cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info:
> rsc_merge_weights: p_ip_pub_datosweb: Rolling back scores from
> p_moverip_datosweb
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource p_ip_pub_datosweb cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info:
> rsc_merge_weights: p_moverip_datosweb: Rolling back scores from
> servicio_enviamailpacemakerdatosweb
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource p_moverip_datosweb cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info: native_color:
> Resource servicio_enviamailpacemakerdatosweb cannot run anywhere
> Aug 31 10:59:20 [30615] node1 pengine: info: RecurringOp:
> Start recurring monitor (31s) for p_drbd_databasestorage:0 on node1
> Aug 31 10:59:20 [30615] node1 pengine: info: RecurringOp:
> Start recurring monitor (31s) for p_drbd_databasestorage:0 on node1
> Aug 31 10:59:20 [30615] node1 pengine: info: RecurringOp:
> Start recurring monitor (31s) for p_drbd_datoswebstorage:0 on node1
> Aug 31 10:59:20 [30615] node1 pengine: info: RecurringOp:
> Start recurring monitor (31s) for p_drbd_datoswebstorage:0 on node1
> Aug 31 10:59:20 [30615] node1 pengine: warning: stage6:
> Scheduling Node node2 for STONITH
>
>
>
> Any more clues?
The first fencing is legitimate -- the node hasn't been seen at start-
up, and so needs to be fenced. The second fencing will be the one of
interest. Also, look for the result of the first fencing.
> Thanks
> Cesar
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list