<html>

<head>

<style><!--

.hmmessage P

{

margin:0px;

padding:0px

}

body.hmmessage

{

font-size: 12pt;

font-family:Calibri

}

--></style></head>

<body class='hmmessage'><div dir='ltr'>> >  > > }<br><div>> >  > > handlers {<br>> >  > > fence-peer "/usr/lib/drbd/rhcs_fence";<br>> >  > > }<br>> >  > > }<br>> >  > ><br>> >  > ><br>> >  > rhcs_fence is wrong fence-peer utility. You should use<br>> >  > /usr/lib/drbd/crm-fence-peer.sh and<br>> >  > /usr/lib/drbd/crm-unfence-peer.sh instead.<br>> ><br>> > But my understanging (probably wrong) was that the fence-peer handler is<br>> > meant to be called for STONITH, not for "usual" promotions/demotions<br>> > to/from Primary/Secondary.<br>> ><br>> > If I use the aforementioned pair of handlers (crm-*.sh) for<br>> > fence/unfence, do I still get STONITH behavior for "split brain cases"?<br>> ><br>> <br>> Correct. The 'rhcs_fence' handler passes fence calls on to cman, which <br>> you have set to redirect on to pacemaker. This isn't what it was <br>> designed for, and hasn't been tested. It was meant to be an updated <br>> replacement for obliterate-peer.sh in cman+rgmanager clusters directly <br>> (no pacemaker).<br><br>Well, since it is a CMAN cluster after all and rhcs_fence relies only (besides /proc/drbd) on cman_tool and fence_node (which should be correctly working), I thought it would be the correct fence script choice, but I will obviously accept your suggestion and use the crm-* scripts instead.<br><br>Anyway, I'm afraid that the real problem lurks elsewhere, since, as I stated before, a simple master/slave promotion/demotion should not lead to fencing, I suppose.<br><br>As suggested by <a class="t_atc ICName" id="ReadMessageContacticTmReadMessageContact2_senderName">Nikita Staroverov                   </a>                                                    , I pasted relevant (I hope) excerpts from logs on first node (the one surviving the stonith) at the time of one "stonith fest" :) just after committing a CIB update with new resources.<br><br>http://pastebin.com/0eQqsftb<br><br>I can recall that seconds before being shot, the second node "lost contact" with cluster (I was issuing "pcs status" and "crm_mon -Arf1" from an SSH session and suddenly it went "cluster not connected" or something like that).<br><br>Maybe (apart from the aforementioned improper use of rhcs_fence) there are issues with some timeout settings on cluster/DRBD operations and almost certainly the nodes have problems with their clock (still finding a reasonable/reachable NTP source), but I do not know if these can be relevant issues.<br><br>Many thanks again for your suggestions.<br><br>Regards,<br>Giuseppe<br></div>                                      </div></body>

</html>