[Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.

Wed May 15 01:11:40 EDT 2013

Hi Andrew,

Thank you for comments.

> > The guest located it to the shared disk.
> 
> What is on the shared disk?  The whole OS or app-specific data (i.e. nothing pacemaker needs directly)?

Shared disk has all the OS and the all data.
The placement of this shared disk is similar in KVM where the problem does not occur.

 * We understand that we are different in movement in the difference of the hyper visor.
 * However, it seems to be necessary to evade this problem to use Pacemaker in vSphere5.1 environment.

Best Regards,
Hideo Yamauchi.

--- On Wed, 2013/5/15, Andrew Beekhof <andrew at beekhof.net> wrote:

> 
> On 13/05/2013, at 4:14 PM, renayama19661014 at ybb.ne.jp wrote:
> 
> > Hi All,
> > 
> > We constituted a simple cluster in environment of vSphere5.1.
> > 
> > We composed it of two ESXi servers and shared disk.
> > 
> > The guest located it to the shared disk.
> 
> What is on the shared disk?  The whole OS or app-specific data (i.e. nothing pacemaker needs directly)?
> 
> > 
> > 
> > Step 1) Constitute a cluster.(A DC node is an active node.)
> > 
> > ============
> > Last updated: Mon May 13 14:16:09 2013
> > Stack: Heartbeat
> > Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> > Version: 1.0.13-30bb726
> > 2 Nodes configured, unknown expected votes
> > 2 Resources configured.
> > ============
> > 
> > Online: [ pgsr01 pgsr02 ]
> > 
> > Resource Group: test-group
> >     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
> >     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> > Clone Set: clnPingd
> >     Started: [ pgsr01 pgsr02 ]
> > 
> > Node Attributes:
> > * Node pgsr01:
> >    + default_ping_set                  : 100       
> > * Node pgsr02:
> >    + default_ping_set                  : 100       
> > 
> > Migration summary:
> > * Node pgsr01: 
> > * Node pgsr02: 
> > 
> > 
> > Step 2) Strace does the pengine process of the DC node.
> > 
> > [root at pgsr01 ~]# ps -ef |grep heartbeat
> > root      2072     1  0 13:56 ?        00:00:00 heartbeat: master control process
> > root      2075  2072  0 13:56 ?        00:00:00 heartbeat: FIFO reader        
> > root      2076  2072  0 13:56 ?        00:00:00 heartbeat: write: bcast eth1  
> > root      2077  2072  0 13:56 ?        00:00:00 heartbeat: read: bcast eth1   
> > root      2078  2072  0 13:56 ?        00:00:00 heartbeat: write: bcast eth2  
> > root      2079  2072  0 13:56 ?        00:00:00 heartbeat: read: bcast eth2   
> > 496       2082  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/ccm
> > 496       2083  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/cib
> > root      2084  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/lrmd -r
> > root      2085  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/stonithd
> > 496       2086  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/attrd
> > 496       2087  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/crmd
> > 496       2089  2087  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/pengine
> > root      2182     1  0 14:15 ?        00:00:00 /usr/lib64/heartbeat/pingd -D -p /var/run//pingd-default_ping_set -a default_ping_set -d 5s -m 100 -i 1 -h 192.168.101.254
> > root      2287  1973  0 14:16 pts/0    00:00:00 grep heartbea
> > 
> > [root at pgsr01 ~]# strace -p 2089
> > Process 2089 attached - interrupt to quit
> > restart_syscall(<... resuming interrupted call ...>) = 0
> > times({tms_utime=5, tms_stime=6, tms_cutime=0, tms_cstime=0}) = 429527557
> > recvfrom(5, 0xa93ff7, 953, 64, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
> > poll([{fd=5, events=0}], 1, 0)          = 0 (Timeout)
> > recvfrom(5, 0xa93ff7, 953, 64, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
> > poll([{fd=5, events=0}], 1, 0)          = 0 (Timeout)
> > (snip)
> > 
> > 
> > Step 3) Disconnect the shared disk which an active node was placed.
> > 
> > Step 4) Cut off pingd of the standby node. 
> >        The score of pingd is reflected definitely, but handling of pengine blocks it.
> > 
> > ~ # esxcfg-vswitch -N vmnic1 -p "ap-db" vSwitch1
> > ~ # esxcfg-vswitch -N vmnic2 -p "ap-db" vSwitch1
> > 
> > 
> > (snip)
> > brk(0xd05000)                           = 0xd05000
> > brk(0xeed000)                           = 0xeed000
> > brk(0xf2d000)                           = 0xf2d000
> > fstat(6, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f86a255a000
> > write(6, "BZh51AY&SY\327\373\370\203\0\t(_\200UPX\3\377\377%cT \277\377\377"..., 2243) = 2243
> > brk(0xb1d000)                           = 0xb1d000
> > fsync(6                                ------------------------------> BLOCKED
> > (snip)
> > 
> > 
> > ============
> > Last updated: Mon May 13 14:19:15 2013
> > Stack: Heartbeat
> > Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> > Version: 1.0.13-30bb726
> > 2 Nodes configured, unknown expected votes
> > 2 Resources configured.
> > ============
> > 
> > Online: [ pgsr01 pgsr02 ]
> > 
> > Resource Group: test-group
> >     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
> >     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> > Clone Set: clnPingd
> >     Started: [ pgsr01 pgsr02 ]
> > 
> > Node Attributes:
> > * Node pgsr01:
> >    + default_ping_set                  : 100       
> > * Node pgsr02:
> >    + default_ping_set                  : 0             : Connectivity is lost
> > 
> > Migration summary:
> > * Node pgsr01: 
> > * Node pgsr02: 
> > 
> > 
> > Step 4) Reconnect communication of pingd of the standby node.
> >        The score of pingd is reflected definitely, but handling of pengine blocks it.
> > 
> > 
> > ~ # esxcfg-vswitch -M vmnic1 -p "ap-db" vSwitch1
> > ~ # esxcfg-vswitch -M vmnic2 -p "ap-db" vSwitch1
> > 
> > ============
> > Last updated: Mon May 13 14:19:40 2013
> > Stack: Heartbeat
> > Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> > Version: 1.0.13-30bb726
> > 2 Nodes configured, unknown expected votes
> > 2 Resources configured.
> > ============
> > 
> > Online: [ pgsr01 pgsr02 ]
> > 
> > Resource Group: test-group
> >     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
> >     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> > Clone Set: clnPingd
> >     Started: [ pgsr01 pgsr02 ]
> > 
> > Node Attributes:
> > * Node pgsr01:
> >    + default_ping_set                  : 100       
> > * Node pgsr02:
> >    + default_ping_set                  : 100       
> > 
> > Migration summary:
> > * Node pgsr01: 
> > * Node pgsr02: 
> > 
> > 
> > --------- A block state of pengine continues -----
> > 
> > Step 5) Cut off pingd of the active node. 
> >        The score of pingd is reflected definitely, but handling of pengine blocks it.
> > 
> > 
> > ~ # esxcfg-vswitch -N vmnic1 -p "ap-db" vSwitch1
> > ~ # esxcfg-vswitch -N vmnic2 -p "ap-db" vSwitch1
> > 
> > 
> > ============
> > Last updated: Mon May 13 14:20:32 2013
> > Stack: Heartbeat
> > Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> > Version: 1.0.13-30bb726
> > 2 Nodes configured, unknown expected votes
> > 2 Resources configured.
> > ============
> > 
> > Online: [ pgsr01 pgsr02 ]
> > 
> > Resource Group: test-group
> >     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
> >     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> > Clone Set: clnPingd
> >     Started: [ pgsr01 pgsr02 ]
> > 
> > Node Attributes:
> > * Node pgsr01:
> >    + default_ping_set                  : 0             : Connectivity is lost
> > * Node pgsr02:
> >    + default_ping_set                  : 100       
> > 
> > Migration summary:
> > * Node pgsr01: 
> > * Node pgsr02: 
> > 
> > --------- A block state of pengine continues -----
> > 
> > 
> > After that the movement to the standby node of the resource does not happen because in condition transition is not made because a block state of pengine continues.
> > In the vSphere environment, time considerably passes, and blocking is canceled, and transition is generated.
> > * The IO blocking of pengine seems to occur repeatedly
> > * Other processes may be blocked, too.
> > * It took it from trouble to FO completion more than one hour.
> > 
> > This problem shows that resource movement may not occur after disk trouble in vSphere environment.
> > 
> > Because our user thinks that I use Pacemaker in vSphere environment, the solution to this problem is necessary.
> > 
> > Do not you know the example which solved a similar problem on vSphere?
> > 
> > We think that it is necessary to evade a block of pengine if there is not a solution example.
> > 
> > For example...
> > 1. crmd watches a request to pengine with a timer...
> > 2. pengine writes in it with a timer and watches processing....
> > ..etc...
> > 
> > * This problem does not seem to occur in KVM.
> > * There is the possibility of the difference of the hyper visor.
> > * In addition, even an actual machine of Linux did not generate the problem.
> > 
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
>