[Pacemaker] Race condition in pacemaker/lrmd cooperation right after live migration

Mon Jul 4 21:44:23 EDT 2011

Looks like the VirtualDomain RA isn't correctly implementing stop.
Stop of an undefined domain shouldn't produce an error.

On Mon, Jul 4, 2011 at 9:51 PM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
> Hi all,
>
> There is feeling that race condition is possible during live migration
> of resources.
>
> I put one node to standby mode, that made all resources migrate to
> another one.
> Virtual machines were successfully live-migrated, but then marked as
> FAILED almost immediately.
> Logs show some interesting details:
> =========
> Jul  4 10:21:48 s01-1 VirtualDomain[22988]: INFO:
> mgmt01.c01.ttc.prague.cz.vds-ok.com: live migration to s01-0 succeeded.
> Jul  4 10:21:48 s01-1 lrmd: [7741]: info: RA output:
> (mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:migrate_to:stdout) Domain
> mgmt01.c01.ttc.prague.cz.vds-ok.com has been undefined
> Jul  4 10:21:48 s01-0 VirtualDomain[4641]: INFO:
> mgmt01.c01.ttc.prague.cz.vds-ok.com: live migration from s01-1 succeeded.
> Jul  4 10:21:49 s01-0 lrmd: [1927]: info: RA output:
> (mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:migrate_from:stderr)
> mgmt01.c01.ttc.prague.cz.vds-ok.com-vm is active on more than one node,
> returning the default value for <null>
> Jul  4 10:21:49 s01-1 crmd: [7744]: info: do_lrm_rsc_op: Performing
> key=110:695:0:7ae65826-5d35-41c0-945a-8336ecb0bc3c
> op=mgmt01.c01.ttc.prague.cz.vds-ok.com-vm_stop_0 )
> Jul  4 10:21:49 s01-1 lrmd: [7741]: info:
> rsc:mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:1006: stop
> Jul  4 10:21:49 s01-1 VirtualDomain[24062]: ERROR: Virtual domain
> mgmt01.c01.ttc.prague.cz.vds-ok.com has no state during stop operation,
> bailing out.
> Jul  4 10:21:49 s01-1 crmd: [7744]: info: process_lrm_event: LRM
> operation mgmt01.c01.ttc.prague.cz.vds-ok.com-vm_stop_0 (call=1006,
> rc=0, cib-update=1031, confirmed=true) ok
> =========
> Note that line with "is active on more than one node" follows "migration
> from s01-1 succeeded" immediately in syslog (in both local and remote
> files), so it was put into syslog queue immediately after former one.
>
> From what I understand, lrmd made decision to fail resource just because
> 'stop' operation was not yet run on another node.
>
> What else can it be if my feeling is wrong?
>
> Version of pacemaker is 'almost' 1.1-devel tip.
> cluster-glue is 1.0.7
> I use own version of VirtualDomain RA, but it has the same migration
> logic as a stock one.
>
> Best,
> Vladislav
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>