[Pacemaker] Getting split brain after all reboot of a cluster node

Thu Mar 6 16:11:21 UTC 2014

Le 06/03/2014 10:12, Gianluca Cecchi a écrit :
> On Wed, Mar 5, 2014 at 9:28 AM, Anne Nicolas  wrote:
>> Hi
>>
>> I'm having trouble setting a very simple cluster with 2 nodes. After all
>> reboot I'm getting split brain that I have to solve by hand then.
>> Looking for a solution for that one...
>>
>> Both nodes have 4 network interfaces. We use 3 of them: one for an IP
>> cluster, one for a bridge for a vm and the last one for the private
>> network of the cluster
>>
>> I'm using
>> drbd : 8.3.9
>> drbd-utils: 8.3.9
>>
>> DRBD configuration:
>> ============
>> $ cat global_common.conf
>> global {
>>         usage-count no;
>>         disable-ip-verification;
>>  }
>> common { syncer { rate 500M; } }
>>
>> cat server.res
>> resource server {
>>         protocol C;
>>         net {
>>                  cram-hmac-alg sha1;
>>                  shared-secret "eafcupps";
>>             }
>>  on dzacupsvr {
>>     device     /dev/drbd0;
>>     disk       /dev/vg0/server;
>>     address    172.16.1.1:7788;
>>     flexible-meta-disk  internal;
>>   }
>>   on dzacupsvr2 {
>>     device     /dev/drbd0;
>>     disk       /dev/vg0/server;
>>     address    172.16.1.2:7788;
>>     flexible-meta-disk  internal;
>>   }
>> }
>>
> 
> [snip]
> 
>>
>> After looking for more information, I've added fences in drbd configuration
>>
>> handlers {
>>     fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>>     after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>>   }
>> but still without any success...
>>
>> Any help appreciated
>>
>> Cheers
>>
>> --
>> Anne
> 
> Hello Anne,
> for sure follow the stonith advises from digimer and emmanuel.
> As a starting point I think you can add this part in your resource
> definition that seems missing at the moment:
> 
> resource <resource> {
>   disk {
>     fencing resource-only;
>     ...
>   }
> }
> 
> This should manage without problem a clean shutdown of cluster's nodes
> and some failure scenarios.
> But it doesn't completely protect you from data corruption in some
> cases (such as intercommunication network that suddenly goes down and
> up with both nodes active where both could become primary in some
> moments).
> 
> At least this worked for me during initial tests before stonith
> configuration with
> SLES 11 sp2 (corosync/pacemaker)
> CentOS 6.5 (cman/pacemaker)
> 

Thanks a lot for all these information. I'm setting up all this to make
it work properly. Some parts were indeed missing and not obvious for me.

One more question about something I can hardly understand. What should
happen and how to manage when the private link between nodes gets down
for one reason. Doing that, crm status gives both nodes masters. Is
there a way to manage this kind of failure ?

Thanks in advance

-- 
Anne
http://mageia.org