[ClusterLabs] cloned pingd resource problem

Wed Mar 30 10:46:59 EDT 2016

On 03/30/2016 08:38 AM, fatcharly at gmx.de wrote:
> Hi,
> 
> I`m running a two node cluster on a fully updated CentOS 7 (pacemaker-1.1.13-10.el7_2.2.x86_64 pcs-0.9.143-15.el7.x86_64) . I see on one of our nodes a lot of this in the logfiles:
> 
> Mar 30 12:32:13 localhost crmd[12986]:  notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Mar 30 12:32:13 localhost pengine[12985]:  notice: On loss of CCM Quorum: Ignore
> Mar 30 12:32:13 localhost pengine[12985]: warning: Processing failed op monitor for ping_fw:0 on kathie2: unknown error (1)
> Mar 30 12:32:13 localhost pengine[12985]: warning: Processing failed op start for ping_fw:1 on stacy2: unknown error (1)
> Mar 30 12:32:13 localhost pengine[12985]: warning: Forcing ping_fw-clone away from stacy2 after 1000000 failures (max=1000000)
> Mar 30 12:32:13 localhost pengine[12985]: warning: Forcing ping_fw-clone away from stacy2 after 1000000 failures (max=1000000)

Pacemaker monitors the resource by calling its resource agent's status
action every 45 seconds. The first warning above indicates that the
resource agent returned a generic error code on kathie2, which in this
case (ocf:pacemaker:ping) means that the specified IP (192.168.16.1) did
not respond to ping.

The second warning indicates that the instance on stacy2 failed to
start, which again in this case means that the IP did not respond to a
ping from that node. The last two warnings indicate that pacemaker
retried the start continuously and eventually gave up.

> Mar 30 12:32:13 localhost pengine[12985]:  notice: Calculated Transition 1823: /var/lib/pacemaker/pengine/pe-input-355.bz2
> Mar 30 12:32:13 localhost crmd[12986]:  notice: Transition 1823 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-355.bz2): Complete
> Mar 30 12:32:13 localhost crmd[12986]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> 
> The configuration looks like this:
> 
> Clone: ping_fw-clone
>   Resource: ping_fw (class=ocf provider=pacemaker type=ping)
>    Attributes: dampen=5s multiplier=1000 host_list=192.168.16.1 timeout=60
>    Operations: start interval=0s timeout=60 (ping_fw-start-interval-0s)
>                stop interval=0s timeout=20 (ping_fw-stop-interval-0s)
>                monitor interval=45 (ping_fw-monitor-interval-45)
> 
> 
> What can I do to resolve the problem ? 

The problem is that ping from the nodes to 192.168.16.1 does not always
work. This could be expected in your environment, or could indicate a
networking issue. But it's outside pacemaker's control; pacemaker is
simply monitoring it and reporting when there's a problem.

> Any suggestions are welcome
> 
> Kind regards
> 
> fatcharly