[ClusterLabs] Pacemaker startup retries

Wed Sep 5 10:17:44 EDT 2018

On Wed, 2018-09-05 at 13:31 +0200, Cesar Hernandez wrote:
> Hi
> 
> > 
> > The first fencing is legitimate -- the node hasn't been seen at
> > start-
> > up, and so needs to be fenced. The second fencing will be the one
> > of
> > interest. Also, look for the result of the first fencing.
> 
> The first fencing has finished with OK, as well as the other two
> fencing operations.
> 
> Aug 31 10:59:31 [30612] node1 stonith-ng:   notice:
> log_operation:     Operation 'reboot' [31075] (call 2 from
> crmd.30616) for host 'node2' with device 'st-fence_propio:0'
> returned: 0 (OK)
> 
> And the next log entries are:
> 
> Aug 31 10:59:31 [30612] node1 stonith-ng:   notice:
> remote_op_done:    Operation reboot of node2 by node1 for crmd.30616@
> node1.2d7857e7: OK
> Aug 31 10:59:31 [30616] node1       crmd:   notice:
> tengine_stonith_callback:  Stonith operation 2/81:0:0:c64efa2b-9366-
> 4c07-b5f1-6a2dbee79fe7: OK (0)
> Aug 31 10:59:31 [30616] node1       crmd:     info:
> crm_get_peer:      Created entry 48db3347-5bbe-4cd4-b7ba-
> db4c697c3146/0x55adbeb587f0 for node node2/0 (2 total)
> Aug 31 10:59:31 [30616] node1       crmd:     info:
> peer_update_callback:      node2 is now in unknown state
> Aug 31 10:59:31 [30616] node1       crmd:     info:
> crm_get_peer:      Node 0 has uuid node2
> Aug 31 10:59:31 [30616] node1       crmd:   notice:
> crm_update_peer_state_iter:        crmd_peer_down: Node node2[0] -
> state is now lost (was (null))
> Aug 31 10:59:31 [30616] node1       crmd:     info:
> peer_update_callback:      node2 is now lost (was in unknown state)
> Aug 31 10:59:31 [30616] node1       crmd:     info:
> crm_update_peer_proc:      crmd_peer_down: Node node2[0] - all
> processes are now offline
> Aug 31 10:59:31 [30616] node1       crmd:     info:
> peer_update_callback:      Client node2/peer now has status [offline]
> (DC=true, changed=     1)
> Aug 31 10:59:31 [30616] node1       crmd:     info:
> crm_update_peer_expected:  crmd_peer_down: Node node2[0] - expected
> state is now down (was (null))
> Aug 31 10:59:31 [30616] node1       crmd:     info:
> erase_status_tag:  Deleting xpath: //node_state[@uname='node2']/lrm
> Aug 31 10:59:31 [30616] node1       crmd:     info:
> erase_status_tag:  Deleting xpath:
> //node_state[@uname='node2']/transient_attributes
> Aug 31 10:59:31 [30616] node1       crmd:   notice:
> tengine_stonith_notify:    Peer node2 was terminated (reboot) by
> node1 for node1: OK (ref=2d7857e7-7e88-482a-812f-b343218974dc) by
> client crmd.30616
> 
> After some other entries I see:
> 
> Aug 31 10:59:37 [30615] node1    pengine:  warning:
> pe_fence_node:     Node node2 will be fenced because the peer has not
> been seen by the cluster

Ah, this rings a bell. Despite having fenced the node, the cluster
still considers the node unseen. That was a regression in 1.1.14 that
was fixed in 1.1.15. :-(

> Aug 31 10:59:37 [30615] node1    pengine:  warning:
> determine_online_status:   Node node2 is unclean
> 
> The server lasts aprox 2 minutes to reboot, so it's normal to haven't
> been seen after just 6 seconds. But I don't know why the server is
> rebooted three times:
> 
> Aug 31 10:59:31 [30616] node1       crmd:   notice:
> tengine_stonith_notify:	Peer node2 was terminated (reboot) by
> node1 for node1: OK (ref=2d7857e7-7e88-482a-812f-b343218974dc) by
> client crmd.30616
> ....
> Aug 31 10:59:53 [30616] node1       crmd:   notice:
> tengine_stonith_notify:	Peer node2 was terminated (reboot) by
> node1 for node1: OK (ref=2835cb08-362d-4d39-9133-3a7dcefb913c) by
> client crmd.30616
> ....
> Aug 31 11:00:05 [30616] node1       crmd:   notice:
> tengine_stonith_notify:	Peer node2 was terminated (reboot) by
> node1 for node1: OK (ref=17931f5b-76ea-4e3a-a792-535cea50afca) by
> client crmd.30616
> 
> 
> After the last message, I only see it stops fencing and start
> resources:
> 
> 
> Aug 31 11:00:05 [30616] node1       crmd:   notice:
> tengine_stonith_notify:    Peer node2 was terminated (reboot) by
> node1 for node1: OK (ref=17931f5b-76ea-4e3a-a792-535cea50afca) by
> client crmd.30616
> Aug 31 11:00:05 [30611] node1        cib:     info:
> cib_process_request:       Completed cib_modify operation for section
> status: OK (rc=0, origin=local/crmd/68, version=0.60.30)
> Aug 31 11:00:05 [30616] node1       crmd:     info:
> erase_status_tag:  Deleting xpath: //node_state[@uname='node2']/lrm
> Aug 31 11:00:05 [30616] node1       crmd:     info:
> erase_status_tag:  Deleting xpath:
> //node_state[@uname='node2']/transient_attributes
> Aug 31 11:00:05 [30616] node1       crmd:     info:
> cib_fencing_updated:       Fencing update 68 for node2: complete
> Aug 31 11:00:05 [30616] node1       crmd:   notice:
> te_rsc_command:    Initiating action 66: start p_fs_database_start_0
> on node1 (local)
> Aug 31 11:00:05 [30616] node1       crmd:     info:
> do_lrm_rsc_op:     Performing key=66:2:0:c64efa2b-9366-4c07-b5f1-
> 6a2dbee79fe7 op=p_fs_database_start_0
> Aug 31 11:00:05 [30613] node1       lrmd:     info:
> log_execute:       executing - rsc:p_fs_database action:start
> call_id:70
> Aug 31 11:00:05 [30616] node1       crmd:   notice:
> te_rsc_command:    Initiating action 77: start p_fs_datosweb_start_0
> on node1 (local)
> Aug 31 11:00:05 [30616] node1       crmd:     info:
> do_lrm_rsc_op:     Performing key=77:2:0:c64efa2b-9366-4c07-b5f1-
> 6a2dbee79fe7 op=p_fs_datosweb_start_0
> Aug 31 11:00:05 [30613] node1       lrmd:     info:
> log_execute:       executing - rsc:p_fs_datosweb action:start
> call_id:71
> Aug 31 11:00:05 [30611] node1        cib:     info:
> cib_process_request:       Completed cib_delete operation for section
> //node_state[@uname='node2']/lrm: OK (rc=0, origin=local/crmd/69,
> version=0.60.30)
> Aug 31 11:00:05 [30611] node1        cib:     info:
> cib_process_request:       Completed cib_delete operation for section
> //node_state[@uname='node2']/transient_attributes: OK (rc=0,
> origin=local/crmd/70, version=0.60.30)
> Aug 31 11:00:05 [30611] node1        cib:     info:
> cib_process_request:       Completed cib_modify operation for section
> status: OK (rc=0, origin=local/crmd/71, version=0.60.30)
> Aug 31 11:00:05 [30616] node1       crmd:     info:
> cib_fencing_updated:       Fencing update 71 for node2: complete
> Aug 31 11:00:05 [30611] node1        cib:     info:
> cib_process_request:       Completed cib_delete operation for section
> //node_state[@uname='node2']/lrm: OK (rc=0, origin=local/crmd/72,
> version=0.60.30)
> Aug 31 11:00:05 [30611] node1        cib:     info:
> cib_process_request:       Completed cib_delete operation for section
> //node_state[@uname='node2']/transient_attributes: OK (rc=0,
> origin=local/crmd/73, version=0.60.30)
> Filesystem(p_fs_datosweb)[962]: 2018/08/31_11:00:05 INFO: Running
> start for /dev/drbd/by-res/datoswebstorage on /mnt/datosweb
> Filesystem(p_fs_database)[961]: 2018/08/31_11:00:05 INFO: Running
> start for /dev/drbd/by-res/databasestorage on /mnt/database
> 
> 
> ......
> 
> 
> Can you give me more clues?
> 
> Thanks a lot
> 
> Cesar
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>