[ClusterLabs] Pacemaker startup retries

Tue Sep 4 12:35:45 EDT 2018

On Tue, 2018-09-04 at 10:31 +0200, c.hernandez wrote:
> Hi
> >Check the pacemaker logs on both bodes around the time it happens.
> 
> This scenary happens when one node is starting and the other doesn't
> have the corosync&pacemaker services started. So only one node to
> check logs
> 
> 
> >One of the nodes will be the DC, and will have "pengine:" logs with
> >"saving inputs".
> 
> No "saving inputs" message  on logs on startup
> 
> 
> >The first thing I'd look for is who requested fencing. The DC will
> have
> >stonith logs with "Client ... wants to fence ...". The client will
> >either be crmd (i.e. the cluster itself) or some external program.
> 
> It's crmd
> 
> Aug 31 10:59:20 [30612] node1 stonith-ng:   notice: handle_request:  
>  Client crmd.30616.aa4a8de3 wants to fence (reboot) 'node2' with
> device '(any)'
> Aug 31 10:59:37 [30612] node1 stonith-ng:   notice: handle_request:  
>  Client crmd.30616.aa4a8de3 wants to fence (reboot) 'node2' with
> device '(any)'
> Aug 31 10:59:53 [30612] node1 stonith-ng:   notice: handle_request:  
>  Client crmd.30616.aa4a8de3 wants to fence (reboot) 'node2' with
> device '(any)'
> 
> 
> >If it's the cluster, I'd look at the "pengine:" logs on the DC
> before
> >that, to see if there are any hints (node unclean, etc.). Then keep
> >going backward until the ultimate cause is found.
> 
> The following are the pengine logs previous to the fist fencing:
> 
> Aug 31 10:58:58 [30615] node1    pengine:     info: crm_log_init:  
>  Changed active directory to /var/lib/pacemaker/cores/hacluster
> Aug 31 10:58:58 [30615] node1    pengine:     info:
> qb_ipcs_us_publish:    server name: pengine
> Aug 31 10:58:58 [30615] node1    pengine:     info: main:    Starting
> pengine
> Aug 31 10:59:20 [30615] node1    pengine:   notice: unpack_config:  
>  On loss of CCM Quorum: Ignore
> Aug 31 10:59:20 [30615] node1    pengine:     info:
> determine_online_status_fencing:    Node node1 is active
> Aug 31 10:59:20 [30615] node1    pengine:     info:
> determine_online_status:    Node node1 is online
> Aug 31 10:59:20 [30615] node1    pengine:     info: clone_print:    
> Clone Set: fencing [st-fence_propio]
> Aug 31 10:59:20 [30615] node1    pengine:     info: short_print:  
>       Stopped: [ node1 node2 ]
> Aug 31 10:59:20 [30615] node1    pengine:     info: clone_print:    
> Master/Slave Set: ms_drbd_databasestorage [p_drbd_databasestorage]
> Aug 31 10:59:20 [30615] node1    pengine:     info: short_print:  
>       Stopped: [ node1 node2 ]
> Aug 31 10:59:20 [30615] node1    pengine:     info: clone_print:    
> Master/Slave Set: ms_drbd_datoswebstorage [p_drbd_datoswebstorage]
> Aug 31 10:59:20 [30615] node1    pengine:     info: short_print:  
>       Stopped: [ node1 node2 ]
> Aug 31 10:59:20 [30615] node1    pengine:     info: group_print:    
> Resource Group: rg_database
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_print:  
>       p_fs_database    (ocf::heartbeat:Filesystem):    Stopped
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_print:  
>       p_ip_databasestorageip    (ocf::heartbeat:IPaddr):    Stopped
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_print:  
>       p_ip_pub_database    (ocf::heartbeat:IPaddr):    Stopped
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_print:  
>       p_moverip_database    (lsb:moverip_database):    Stopped
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_print:  
>       servicio_enviamailpacemakerdatabase  
>  (lsb:enviamailpacemakerdatabase):    Stopped
> Aug 31 10:59:20 [30615] node1    pengine:     info: group_print:    
> Resource Group: rg_datosweb
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_print:  
>       p_fs_datosweb    (ocf::heartbeat:Filesystem):    Stopped
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_print:  
>       p_ip_datoswebstorageip    (ocf::heartbeat:IPaddr):    Stopped
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_print:  
>       p_ip_pub_datosweb    (ocf::heartbeat:IPaddr):    Stopped
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_print:  
>       p_moverip_datosweb    (lsb:moverip_datosweb):    Stopped
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_print:  
>       servicio_enviamailpacemakerdatosweb  
>  (lsb:enviamailpacemakerdatosweb):    Stopped
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource st-fence_propio:1 cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource p_drbd_databasestorage:1 cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info: master_color:  
>  ms_drbd_databasestorage: Promoted 0 instances of a possible 1 to
> master
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource p_drbd_datoswebstorage:1 cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info: master_color:  
>  ms_drbd_datoswebstorage: Promoted 0 instances of a possible 1 to
> master
> Aug 31 10:59:20 [30615] node1    pengine:     info:
> rsc_merge_weights:    p_fs_database: Rolling back scores from
> p_ip_databasestorageip
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource p_fs_database cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info:
> rsc_merge_weights:    p_ip_databasestorageip: Rolling back scores
> from p_ip_pub_database
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource p_ip_databasestorageip cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info:
> rsc_merge_weights:    p_ip_pub_database: Rolling back scores from
> p_moverip_database
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource p_ip_pub_database cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info:
> rsc_merge_weights:    p_moverip_database: Rolling back scores from
> servicio_enviamailpacemakerdatabase
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource p_moverip_database cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource servicio_enviamailpacemakerdatabase cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info:
> rsc_merge_weights:    p_fs_datosweb: Rolling back scores from
> p_ip_datoswebstorageip
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource p_fs_datosweb cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info:
> rsc_merge_weights:    p_ip_datoswebstorageip: Rolling back scores
> from p_ip_pub_datosweb
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource p_ip_datoswebstorageip cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info:
> rsc_merge_weights:    p_ip_pub_datosweb: Rolling back scores from
> p_moverip_datosweb
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource p_ip_pub_datosweb cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info:
> rsc_merge_weights:    p_moverip_datosweb: Rolling back scores from
> servicio_enviamailpacemakerdatosweb
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource p_moverip_datosweb cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info: native_color:  
>  Resource servicio_enviamailpacemakerdatosweb cannot run anywhere
> Aug 31 10:59:20 [30615] node1    pengine:     info: RecurringOp:    
> Start recurring monitor (31s) for p_drbd_databasestorage:0 on node1
> Aug 31 10:59:20 [30615] node1    pengine:     info: RecurringOp:    
> Start recurring monitor (31s) for p_drbd_databasestorage:0 on node1
> Aug 31 10:59:20 [30615] node1    pengine:     info: RecurringOp:    
> Start recurring monitor (31s) for p_drbd_datoswebstorage:0 on node1
> Aug 31 10:59:20 [30615] node1    pengine:     info: RecurringOp:    
> Start recurring monitor (31s) for p_drbd_datoswebstorage:0 on node1
> Aug 31 10:59:20 [30615] node1    pengine:  warning: stage6:  
>  Scheduling Node node2 for STONITH
> 
> 
> 
> Any more clues?

The first fencing is legitimate -- the node hasn't been seen at start-
up, and so needs to be fenced. The second fencing will be the one of
interest. Also, look for the result of the first fencing.

> Thanks
> Cesar
-- 
Ken Gaillot <kgaillot at redhat.com>