[Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.

Wed May 15 01:03:50 EDT 2013

On 13/05/2013, at 4:14 PM, renayama19661014 at ybb.ne.jp wrote:

> Hi All,
> 
> We constituted a simple cluster in environment of vSphere5.1.
> 
> We composed it of two ESXi servers and shared disk.
> 
> The guest located it to the shared disk.

What is on the shared disk?  The whole OS or app-specific data (i.e. nothing pacemaker needs directly)?

> 
> 
> Step 1) Constitute a cluster.(A DC node is an active node.)
> 
> ============
> Last updated: Mon May 13 14:16:09 2013
> Stack: Heartbeat
> Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> Version: 1.0.13-30bb726
> 2 Nodes configured, unknown expected votes
> 2 Resources configured.
> ============
> 
> Online: [ pgsr01 pgsr02 ]
> 
> Resource Group: test-group
>     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
>     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> Clone Set: clnPingd
>     Started: [ pgsr01 pgsr02 ]
> 
> Node Attributes:
> * Node pgsr01:
>    + default_ping_set                  : 100       
> * Node pgsr02:
>    + default_ping_set                  : 100       
> 
> Migration summary:
> * Node pgsr01: 
> * Node pgsr02: 
> 
> 
> Step 2) Strace does the pengine process of the DC node.
> 
> [root at pgsr01 ~]# ps -ef |grep heartbeat
> root      2072     1  0 13:56 ?        00:00:00 heartbeat: master control process
> root      2075  2072  0 13:56 ?        00:00:00 heartbeat: FIFO reader        
> root      2076  2072  0 13:56 ?        00:00:00 heartbeat: write: bcast eth1  
> root      2077  2072  0 13:56 ?        00:00:00 heartbeat: read: bcast eth1   
> root      2078  2072  0 13:56 ?        00:00:00 heartbeat: write: bcast eth2  
> root      2079  2072  0 13:56 ?        00:00:00 heartbeat: read: bcast eth2   
> 496       2082  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/ccm
> 496       2083  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/cib
> root      2084  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/lrmd -r
> root      2085  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/stonithd
> 496       2086  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/attrd
> 496       2087  2072  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/crmd
> 496       2089  2087  0 13:57 ?        00:00:00 /usr/lib64/heartbeat/pengine
> root      2182     1  0 14:15 ?        00:00:00 /usr/lib64/heartbeat/pingd -D -p /var/run//pingd-default_ping_set -a default_ping_set -d 5s -m 100 -i 1 -h 192.168.101.254
> root      2287  1973  0 14:16 pts/0    00:00:00 grep heartbea
> 
> [root at pgsr01 ~]# strace -p 2089
> Process 2089 attached - interrupt to quit
> restart_syscall(<... resuming interrupted call ...>) = 0
> times({tms_utime=5, tms_stime=6, tms_cutime=0, tms_cstime=0}) = 429527557
> recvfrom(5, 0xa93ff7, 953, 64, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=5, events=0}], 1, 0)          = 0 (Timeout)
> recvfrom(5, 0xa93ff7, 953, 64, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=5, events=0}], 1, 0)          = 0 (Timeout)
> (snip)
> 
> 
> Step 3) Disconnect the shared disk which an active node was placed.
> 
> Step 4) Cut off pingd of the standby node. 
>        The score of pingd is reflected definitely, but handling of pengine blocks it.
> 
> ~ # esxcfg-vswitch -N vmnic1 -p "ap-db" vSwitch1
> ~ # esxcfg-vswitch -N vmnic2 -p "ap-db" vSwitch1
> 
> 
> (snip)
> brk(0xd05000)                           = 0xd05000
> brk(0xeed000)                           = 0xeed000
> brk(0xf2d000)                           = 0xf2d000
> fstat(6, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f86a255a000
> write(6, "BZh51AY&SY\327\373\370\203\0\t(_\200UPX\3\377\377%cT \277\377\377"..., 2243) = 2243
> brk(0xb1d000)                           = 0xb1d000
> fsync(6                                ------------------------------> BLOCKED
> (snip)
> 
> 
> ============
> Last updated: Mon May 13 14:19:15 2013
> Stack: Heartbeat
> Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> Version: 1.0.13-30bb726
> 2 Nodes configured, unknown expected votes
> 2 Resources configured.
> ============
> 
> Online: [ pgsr01 pgsr02 ]
> 
> Resource Group: test-group
>     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
>     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> Clone Set: clnPingd
>     Started: [ pgsr01 pgsr02 ]
> 
> Node Attributes:
> * Node pgsr01:
>    + default_ping_set                  : 100       
> * Node pgsr02:
>    + default_ping_set                  : 0             : Connectivity is lost
> 
> Migration summary:
> * Node pgsr01: 
> * Node pgsr02: 
> 
> 
> Step 4) Reconnect communication of pingd of the standby node.
>        The score of pingd is reflected definitely, but handling of pengine blocks it.
> 
> 
> ~ # esxcfg-vswitch -M vmnic1 -p "ap-db" vSwitch1
> ~ # esxcfg-vswitch -M vmnic2 -p "ap-db" vSwitch1
> 
> ============
> Last updated: Mon May 13 14:19:40 2013
> Stack: Heartbeat
> Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> Version: 1.0.13-30bb726
> 2 Nodes configured, unknown expected votes
> 2 Resources configured.
> ============
> 
> Online: [ pgsr01 pgsr02 ]
> 
> Resource Group: test-group
>     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
>     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> Clone Set: clnPingd
>     Started: [ pgsr01 pgsr02 ]
> 
> Node Attributes:
> * Node pgsr01:
>    + default_ping_set                  : 100       
> * Node pgsr02:
>    + default_ping_set                  : 100       
> 
> Migration summary:
> * Node pgsr01: 
> * Node pgsr02: 
> 
> 
> --------- A block state of pengine continues -----
> 
> Step 5) Cut off pingd of the active node. 
>        The score of pingd is reflected definitely, but handling of pengine blocks it.
> 
> 
> ~ # esxcfg-vswitch -N vmnic1 -p "ap-db" vSwitch1
> ~ # esxcfg-vswitch -N vmnic2 -p "ap-db" vSwitch1
> 
> 
> ============
> Last updated: Mon May 13 14:20:32 2013
> Stack: Heartbeat
> Current DC: pgsr01 (85a81130-4fed-4932-ab4c-21ac2320186f) - partition with quorum
> Version: 1.0.13-30bb726
> 2 Nodes configured, unknown expected votes
> 2 Resources configured.
> ============
> 
> Online: [ pgsr01 pgsr02 ]
> 
> Resource Group: test-group
>     Dummy1     (ocf::pacemaker:Dummy): Started pgsr01
>     Dummy2     (ocf::pacemaker:Dummy): Started pgsr01
> Clone Set: clnPingd
>     Started: [ pgsr01 pgsr02 ]
> 
> Node Attributes:
> * Node pgsr01:
>    + default_ping_set                  : 0             : Connectivity is lost
> * Node pgsr02:
>    + default_ping_set                  : 100       
> 
> Migration summary:
> * Node pgsr01: 
> * Node pgsr02: 
> 
> --------- A block state of pengine continues -----
> 
> 
> After that the movement to the standby node of the resource does not happen because in condition transition is not made because a block state of pengine continues.
> In the vSphere environment, time considerably passes, and blocking is canceled, and transition is generated.
> * The IO blocking of pengine seems to occur repeatedly
> * Other processes may be blocked, too.
> * It took it from trouble to FO completion more than one hour.
> 
> This problem shows that resource movement may not occur after disk trouble in vSphere environment.
> 
> Because our user thinks that I use Pacemaker in vSphere environment, the solution to this problem is necessary.
> 
> Do not you know the example which solved a similar problem on vSphere?
> 
> We think that it is necessary to evade a block of pengine if there is not a solution example.
> 
> For example...
> 1. crmd watches a request to pengine with a timer...
> 2. pengine writes in it with a timer and watches processing....
> ..etc...
> 
> * This problem does not seem to occur in KVM.
> * There is the possibility of the difference of the hyper visor.
> * In addition, even an actual machine of Linux did not generate the problem.
> 
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org