[ClusterLabs] big trouble with a DRBD resource

Thu Aug 10 12:11:31 UTC 2017

On Wed, Aug 09, 2017 at 06:48:01PM +0200, Lentes, Bernd wrote:
> 
> 
> ----- Am 8. Aug 2017 um 15:36 schrieb Lars Ellenberg lars.ellenberg at linbit.com:
>  
> > crm shell in "auto-commit"?
> > never seen that.
> 
> i googled for "crmsh autocommit pacemaker" and found that: https://github.com/ClusterLabs/crmsh/blob/master/ChangeLog
> See line 650. Don't know what that means.
> > 
> > You are sure you did not forget this necessary piece?
> > ms WebDataClone WebData \
> >    meta master-max="1" master-node-max="1" clone-max="2"
> >    clone-node-max="1" notify="true"
> 
> I didn't come so far. I followed that guide (http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html#_configure_the_cluster_for_drbd),
> but didn't use the shadow cib.

if you use crmsh "interactively",
crmsh does implicitly use a shadow cib,
and will only commit changes once you "commit",
see "crm configure help commit"

At least that's my experience with crmsh for the last nine years or so.

> The cluster is in testing, not in production, so i thought "nothing
> severe can happen". Misjudged. My error.
> After configuring the primitive without the ms clone my resource
> ClusterMon reacted promptly and sent 20000 snmp traps to my management
> station in 193 seconds, which triggered 20000 e-Mails ...
> I understand now that the cluster missed the ms clone configuration.
> But so much traps in such a short period. Is that intended ? Or a bug ?

If you configure a resource to fail immediately,
but in a way that pacemaker thinks can be "recovered" from
by stoping and restarting, then pacemaker will do so.
If that results in 20000 "actions" within 192 seconds,
that's 100 actions per second, then that seems "quick",
but not a bug per se.
if every single such action triggers a trap,
because you configured the system to send traps for every action,
that's yet a different thing.

So what now?
Where exactly is the "big trouble with DRBD"?
Someone was "almost" following some tutorial, and got in trouble.

How could we keep that from happening to the next person?
Any suggestions which component or behavior we should improve, and how?

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT