[Pacemaker] Failing to move around IPaddr2 resource

Thu Feb 23 19:46:07 EST 2012

On Mon, Feb 20, 2012 at 12:44 PM, Anlu Wang <anlu at mixpanel.com> wrote:
> I have three servers that I'm trying to create IP failover on with
> heartbeat. I have three IPs, one for each machine, and I want an IP to be
> assigned to a different machine when it goes down. This is all working
> splendidly.
>
> But in addition, I also want an IP to be assigned to a different machine
> when either the internal OR external network interface goes down. To do
> this, I have a ping resource on each machine that pings the other 2 machines
> internal and external ips (so 4 IPs total being pinged on each machine).
> This is where I'm having problems.
>
> When I take down a network interface manually with ifdown, sometimes it
> fails to stop IP resources on the machines. This is what crm_mon outputs:
>
> ============
> Last updated: Sun Feb 19 19:29:53 2012
> Stack: Heartbeat
> Current DC: anlutest2 (32769730-5e5e-40d6-baa0-9748131232da) - partition
> with quorum
> Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
> 3 Nodes configured, unknown expected votes
> 6 Resources configured.
> ============
>
> Online: [ anlutest1 anlutest3 anlutest2 ]
>
> address01       (ocf::heartbeat:IPaddr2):       Started anlutest2
> (unmanaged) FAILED
> address02       (ocf::heartbeat:IPaddr2):       Started anlutest3
> address03       (ocf::heartbeat:IPaddr2):       Started anlutest1
> (unmanaged) FAILED
> ping01  (ocf::pacemaker:ping):  Started anlutest1
> ping02  (ocf::pacemaker:ping):  Started anlutest2
> ping03  (ocf::pacemaker:ping):  Started anlutest3
>
> Failed actions:
>     address01_stop_0 (node=anlutest2, call=454, rc=1, status=complete):
> unknown error
>     address03_stop_0 (node=anlutest1, call=104, rc=1, status=complete):
> unknown error
>
> The reason for this seems to be detailed in the syslog:
>
> Feb 19 19:25:06 anlutest1 lrmd: [27108]: info: rsc:address03:104: stop
> Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM
> operation address01_monitor_5000 (call=100, status=1, cib-update=0,
> confirmed=true) Cancelled
> Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM
> operation address03_monitor_5000 (call=102, status=1, cib-update=0,
> confirmed=true) Cancelled
> Feb 19 19:25:06 anlutest1 IPaddr2[32290]: [32350]: INFO: IP status = ok,
> IP_CIP=
> Feb 19 19:25:06 anlutest1 IPaddr2[32291]: [32351]: INFO: IP status = ok,
> IP_CIP=
> Feb 19 19:25:06 anlutest1 IPaddr2[32290]: [32354]: INFO: ip -f inet addr
> delete 50.97.234.170/29 dev eth1
> Feb 19 19:25:06 anlutest1 IPaddr2[32291]: [32355]: INFO: ip -f inet addr
> delete 50.97.234.172/29 dev eth1
> Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM
> operation address01_stop_0 (call=103, rc=0, cib-update=135, confirmed=true)
> ok
> Feb 19 19:25:06 anlutest1 lrmd: [27108]: info: RA output:
> (address03:stop:stderr) RTNETLINK answers: Cannot assign requested address
> Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM
> operation address03_stop_0 (call=104, rc=1, cib-update=136, confirmed=true)
> unknown error
> Feb 19 19:25:07 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush
> message from anlutest2
> Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: Update
> relayed from anlutest2
> Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-address03 (INFINITY)
> Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_perform_update: Sent
> update 377: fail-count-address03=INFINITY
> Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: Update
> relayed from anlutest2
> Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-address03 (1329701107)
> Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_perform_update: Sent
> update 379: last-failure-address03=1329701107
> Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush
> message from anlutest2
> Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush
> message from anlutest2
>
> But I have no idea what the RTNETLINK error is. Googling around seems to
> show some issues about Ubuntu wireless drivers, but these interfaces are all
> wired. Does anyone have any idea what is going on? I suspect there might be
> some sort of weird IP assigning going on, due to the pingd resource not
> reporting their scores all at the same time maybe?

Shouldn't be.
The question is, why would we be /assigning/ an IP during a /stop/ action.

>
> When I manually go and cleanup the failed nodes, they get properly assigned
> to the nodes that aren't down, so if we can't resolve the underlying issue,
> is there a way to automatically attempt to cleanup failed resources a
> limited number of times?

I don't think you want to start the IP somewhere else if its still
active on the original node.

>
> My configuration is here, in case there's anything wrong with it.

Looks like you forgot to attach it.

>
> Anlu
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>