[Pacemaker] Race condition in pacemaker/lrmd cooperation right after live migration
    Vladislav Bogdanov 
    bubble at hoster-ok.com
       
    Fri Jul  8 06:50:19 UTC 2011
    
    
  
05.07.2011 10:05, Andrew Beekhof wrote:
> On Tue, Jul 5, 2011 at 2:37 PM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>> 05.07.2011 04:44, Andrew Beekhof wrote:
>>> Looks like the VirtualDomain RA isn't correctly implementing stop.
>>> Stop of an undefined domain shouldn't produce an error.
>>
>> Nope, it just cries in logs, nothing more.
> 
> Hmmm.  Better log a bug then and include a crm_report.
> I've almost caught up on emails now, so I should be able to start
> looking at tarballs and bugs soon.
http://developerbugs.linux-foundation.org/show_bug.cgi?id=2615
with hb_report, not crm_report because of
http://developerbugs.linux-foundation.org/show_bug.cgi?id=2614
Best,
Vladislav
> 
>>
>> process_lrm_event: LRM operation
>> mgmt01.c01.ttc.prague.cz.vds-ok.com-vm_stop_0 (call=1006, rc=0,
>> cib-update=1031, confirmed=true) ok
>>
>> And, that stop operation is fired a little bit after lrmd made its verdict.
>>
>>>
>>> On Mon, Jul 4, 2011 at 9:51 PM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>>> Hi all,
>>>>
>>>> There is feeling that race condition is possible during live migration
>>>> of resources.
>>>>
>>>> I put one node to standby mode, that made all resources migrate to
>>>> another one.
>>>> Virtual machines were successfully live-migrated, but then marked as
>>>> FAILED almost immediately.
>>>> Logs show some interesting details:
>>>> =========
>>>> Jul  4 10:21:48 s01-1 VirtualDomain[22988]: INFO:
>>>> mgmt01.c01.ttc.prague.cz.vds-ok.com: live migration to s01-0 succeeded.
>>>> Jul  4 10:21:48 s01-1 lrmd: [7741]: info: RA output:
>>>> (mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:migrate_to:stdout) Domain
>>>> mgmt01.c01.ttc.prague.cz.vds-ok.com has been undefined
>>>> Jul  4 10:21:48 s01-0 VirtualDomain[4641]: INFO:
>>>> mgmt01.c01.ttc.prague.cz.vds-ok.com: live migration from s01-1 succeeded.
>>>> Jul  4 10:21:49 s01-0 lrmd: [1927]: info: RA output:
>>>> (mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:migrate_from:stderr)
>>>> mgmt01.c01.ttc.prague.cz.vds-ok.com-vm is active on more than one node,
>>>> returning the default value for <null>
>>>> Jul  4 10:21:49 s01-1 crmd: [7744]: info: do_lrm_rsc_op: Performing
>>>> key=110:695:0:7ae65826-5d35-41c0-945a-8336ecb0bc3c
>>>> op=mgmt01.c01.ttc.prague.cz.vds-ok.com-vm_stop_0 )
>>>> Jul  4 10:21:49 s01-1 lrmd: [7741]: info:
>>>> rsc:mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:1006: stop
>>>> Jul  4 10:21:49 s01-1 VirtualDomain[24062]: ERROR: Virtual domain
>>>> mgmt01.c01.ttc.prague.cz.vds-ok.com has no state during stop operation,
>>>> bailing out.
>>>> Jul  4 10:21:49 s01-1 crmd: [7744]: info: process_lrm_event: LRM
>>>> operation mgmt01.c01.ttc.prague.cz.vds-ok.com-vm_stop_0 (call=1006,
>>>> rc=0, cib-update=1031, confirmed=true) ok
>>>> =========
>>>> Note that line with "is active on more than one node" follows "migration
>>>> from s01-1 succeeded" immediately in syslog (in both local and remote
>>>> files), so it was put into syslog queue immediately after former one.
>>>>
>>>> From what I understand, lrmd made decision to fail resource just because
>>>> 'stop' operation was not yet run on another node.
>>>>
>>>> What else can it be if my feeling is wrong?
>>>>
>>>> Version of pacemaker is 'almost' 1.1-devel tip.
>>>> cluster-glue is 1.0.7
>>>> I use own version of VirtualDomain RA, but it has the same migration
>>>> logic as a stock one.
>>>>
>>>> Best,
>>>> Vladislav
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
    
    
More information about the Pacemaker
mailing list