[Pacemaker] Network outage debugging

Wed Nov 13 00:22:31 UTC 2013

> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
> 
>> On 13 Nov 2013, at 6:10 am, Sean Lutner <sean at rentul.net> wrote:
>> 
>> The folks testing the cluster I've been building have run a script which blocks all traffic except SSH on one node of the cluster for 15 seconds to mimic a network failure. During this time, the network being "down" seems to cause some odd behavior from pacemaker resulting in it dying.
>> 
>> The cluster is two nodes and running four custom resources on EC2 instances. The OS is CentOS 6.4 with the config below:
>> 
>> I've attached the /var/log/messages and /var/log/cluster/corosync.log from the time period during the test. I've having some difficulty in piecing together what happened and am hoping someone can shed some light on the problem. Any indications why pacemaker is dying on that node?
> 
> Because corosync is dying underneath it:
> 
> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: send_ais_text:    Sending message 28 via cpg: FAILED (rc=2): Library error: Connection timed out (110)
> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: pcmk_cpg_dispatch:    Connection to the CPG API failed: 2
> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: cib_ais_destroy:    Corosync connection lost!  Exiting.
> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:     info: terminate_cib:    cib_ais_destroy: Exiting fast...

Is that the expected behavior? Is it because the DC was the other node?

I did notice that there was an attempted fence operation but it didn't look successful. 

> 
> 
>> 
>> 
>> [root at ip-10-50-3-122 ~]# pcs config
>> Corosync Nodes:
>> 
>> Pacemaker Nodes:
>> ip-10-50-3-122 ip-10-50-3-251 
>> 
>> Resources: 
>> Resource: ClusterEIP_54.215.143.166 (provider=pacemaker type=EIP class=ocf)
>> Attributes: first_network_interface_id=eni-e4e0b68c second_network_interface_id=eni-35f9af5d first_private_ip=10.50.3.191 second_private_ip=10.50.3.91 eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s 
>> Operations: monitor interval=5s
>> Clone: EIP-AND-VARNISH-clone
>> Group: EIP-AND-VARNISH
>>  Resource: Varnish (provider=redhat type=varnish.sh class=ocf)
>>   Operations: monitor interval=5s
>>  Resource: Varnishlog (provider=redhat type=varnishlog.sh class=ocf)
>>   Operations: monitor interval=5s
>>  Resource: Varnishncsa (provider=redhat type=varnishncsa.sh class=ocf)
>>   Operations: monitor interval=5s
>> Resource: ec2-fencing (type=fence_ec2 class=stonith)
>> Attributes: ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list pcmk_host_list=HA01 HA02 
>> Operations: monitor start-delay=30s interval=0 timeout=150s
>> 
>> Location Constraints:
>> Ordering Constraints:
>> ClusterEIP_54.215.143.166 then Varnish
>> Varnish then Varnishlog
>> Varnishlog then Varnishncsa
>> Colocation Constraints:
>> Varnish with ClusterEIP_54.215.143.166
>> Varnishlog with Varnish
>> Varnishncsa with Varnishlog
>> 
>> Cluster Properties:
>> dc-version: 1.1.8-7.el6-394e906
>> cluster-infrastructure: cman
>> last-lrm-refresh: 1384196963
>> no-quorum-policy: ignore
>> stonith-enabled: true
>> 
>> <net-failure-messages-110913.out><net-failure-corosync-110913.out>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org