[ClusterLabs] clone resource not get restarted on fail
Ken Gaillot
kgaillot at redhat.com
Mon Feb 13 10:02:44 EST 2017
On 02/13/2017 07:57 AM, he.hailong5 at zte.com.cn wrote:
> Pacemaker 1.1.10
>
> Corosync 2.3.3
>
>
> this is a 3 nodes cluster configured with 3 clone resources, each
> attached wih a vip resource of IPAddr2:
>
>
> >crm status
>
>
> Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
>
>
> router_vip (ocf::heartbeat:IPaddr2): Started paas-controller-1
>
> sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-3
>
> apigateway_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2
>
> Clone Set: sdclient_rep [sdclient]
>
> Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
>
> Clone Set: router_rep [router]
>
> Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
>
> Clone Set: apigateway_rep [apigateway]
>
> Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
>
>
> It is observed that sometimes the clone resource is stuck to monitor
> when the service fails:
>
>
> router_vip (ocf::heartbeat:IPaddr2): Started paas-controller-1
>
> sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2
>
> apigateway_vip (ocf::heartbeat:IPaddr2): Started paas-controller-3
>
> Clone Set: sdclient_rep [sdclient]
>
> Started: [ paas-controller-1 paas-controller-2 ]
>
> Stopped: [ paas-controller-3 ]
>
> Clone Set: router_rep [router]
>
> router (ocf::heartbeat:router): Started
> paas-controller-3 FAILED
>
> Started: [ paas-controller-1 paas-controller-2 ]
>
> Clone Set: apigateway_rep [apigateway]
>
> apigateway (ocf::heartbeat:apigateway): Started
> paas-controller-3 FAILED
>
> Started: [ paas-controller-1 paas-controller-2 ]
>
>
> in the example above. the sdclient_rep get restarted on node 3, while
> the other two hang at monitoring on node 3, here are the ocf logs:
>
>
> abnormal (apigateway_rep):
>
> 2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main===
> Starting health check.
>
> 2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main===
> health check succeed.
>
> 2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main===
> Starting health check.
>
> 2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main===
> Failed: docker daemon is not running.
>
> 2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main===
> Starting health check.
>
> 2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main===
> Failed: docker daemon is not running.
>
> 2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main===
> Starting health check.
>
> 2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main===
> Failed: docker daemon is not running.
>
>
> normal (sdclient_rep):
>
> 2017-02-13 18:27:52 [23507]===print_log sdclient_monitor run_func
> main=== health check succeed.
>
> 2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func
> main=== Starting health check.
>
> 2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func
> main=== Failed: docker daemon is not running.
>
> 2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main===
> Starting stop the container.
>
> 2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main===
> docker daemon lost, pretend stop succeed.
>
> 2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main===
> Starting run the container.
>
> 2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main===
> docker daemon lost, try again in 5 secs.
>
> 2017-02-13 18:28:00 [23763]===print_log sdclient_start run_func main===
> docker daemon lost, try again in 5 secs.
>
> 2017-02-13 18:28:05 [23763]===print_log sdclient_start run_func main===
> docker daemon lost, try again in 5 secs.
>
>
> If I disable 2 clone resource, the switch over test for one clone
> resource works as expected: fail the service -> monitor fails -> stop
> -> start
>
>
> Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
>
>
> sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2
>
> Clone Set: sdclient_rep [sdclient]
>
> Started: [ paas-controller-1 paas-controller-2 ]
>
> Stopped: [ paas-controller-3 ]
>
>
> what's the reason behind????
Can you show the configuration of the three clones, their operations,
and any constraints?
Normally, the response is controlled by the monitor operation's on-fail
attribute (which defaults to restart).
More information about the Users
mailing list