[Pacemaker] starting resources with failed stonith resource

Tue Jan 7 10:41:18 EST 2014

Hi list,

I recently had some trouble with a dual-node mysql cluster, which runs
in master-slave mode with Percona resource manager. While analyzing
what happened to the cluster, I found this in syslog (network trouble,
the cluster lost disk/iscsi access on both nodes, this is a piece from
the former master trying to start up again when recovering
connectivity):

Jan  6 07:26:49 infante pengine: [3839]: notice: get_failcount:
Failcount for MasterSlave_mysql on infante has expired (limit was 60s)
Jan  6 07:26:49 infante pengine: [3839]: notice: get_failcount:
Failcount for MasterSlave_mysql on infante has expired (limit was 60s)
Jan  6 07:26:49 infante pengine: [3839]: WARN:
common_apply_stickiness: Forcing p-stonith-ingstad away from infante
after 1000000 failures (max=1000000)
Jan  6 07:26:49 infante pengine: [3839]: notice: LogActions: Start
prim_mysql:0#011(infante)
Jan  6 07:26:49 infante pengine: [3839]: notice: LogActions: Start
prim_mysql:1#011(ingstad)

I don't understand it: if this means that the stonith devices have
failed a million times, why is it trying to start the mysql resource?
It's agains Pacemaker policies to start resources on a cluster without
working stonith devices, isn't it?

-- 
Frank Van Damme
Make everything as simple as possible, but not simpler. - Albert Einstein