[Pacemaker] cLVM stuck

Thu Feb 9 10:50:08 EST 2012

Hello,

On 02/09/2012 03:29 PM, Karl Rößmann wrote:
> Hi all,
> 
> we run a three Node HA Cluster using cLVM and Xen.
> 
> After installing some online updates node by node

While cluster was in maintenance-mode or when cluster was shut down on
the node that received updates?

> the cLVM is stuck, and the last (updated) Node does not want to join the
> the cLVM.
> Two nodes are still running
> The xen VMs are running (they have there disks on cLVs),
> but commands like 'lvdisplay' do not work.
> 
> Is there a way to recover the cLVM without restarting the whole cluster ?

Any logentries about the controld on orion1 ... what is the output of
"crm_mon -1fr"? ... looks like there is a problem with starting the
dlm_controld.pcmk .... and I wonder why orion1 is not fenced on stop
errors, or did that happen?

Did you inspect the output of "dlm_tool ls/dump" on all nodes where the
controld is running?

Your crm_mon output shows orion1 offline ... seems not to be timely
related to your logs?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> we have the latest
> SuSE Sles SP1 and HA-Extension including
> pacemaker-1.1.5-5.9.11.1
> corosync-1.3.3-0.3.1
> 
> Some ERROR messages:
> Feb  9 12:06:42 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
> operation clvm:0_start_0 (15) Timed Out (timeout=240000ms)
> Feb  9 12:13:41 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
> operation cluvg1:2_start_0 (19) Timed Out (timeout=240000ms)
> Feb  9 12:16:21 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
> operation cluvg1:2_stop_0 (20) Timed Out (timeout=100000ms)
> Feb  9 13:39:10 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
> operation clvm:0_start_0 (15) Timed Out (timeout=240000ms)
> Feb  9 13:53:38 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
> operation cluvg1:2_start_0 (19) Timed Out (timeout=240000ms)
> Feb  9 13:56:18 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
> operation cluvg1:2_stop_0 (20) Timed Out (timeout=100000ms)
> 
> 
> 
> Feb  9 12:11:55 orion2 crm_resource: [13025]: ERROR:
> resource_ipc_timeout: No messages received in 60 seconds
> Feb  9 12:13:41 orion2 crmd: [5882]: ERROR: send_msg_via_ipc: Unknown
> Sub-system (13025_crm_resource)... discarding message.
> Feb  9 12:14:41 orion2 crmd: [5882]: ERROR: print_elem: Aborting
> transition, action lost: [Action 35]: In-flight (id: cluvg1:2_start_0,
> loc: orion1, priority: 0)
> Feb  9 13:54:38 orion2 crmd: [5882]: ERROR: print_elem: Aborting
> transition, action lost: [Action 35]: In-flight (id: cluvg1:2_start_0,
> loc: orion1, priority: 0)
> 
> 
> 
> Some additional information:
> 
> crm_mon -1:
> ============
> Last updated: Thu Feb  9 15:10:34 2012
> Stack: openais
> Current DC: orion2 - partition with quorum
> Version: 1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60
> 3 Nodes configured, 3 expected votes
> 17 Resources configured.
> ============
> 
> Online: [ orion2 orion7 ]
> OFFLINE: [ orion1 ]
> 
>  Clone Set: dlm_clone [dlm]
>      Started: [ orion2 orion7 ]
>      Stopped: [ dlm:0 ]
>  Clone Set: clvm_clone [clvm]
>      Started: [ orion2 orion7 ]
>      Stopped: [ clvm:0 ]
>  sbd_stonith    (stonith:external/sbd): Started orion2
>  Clone Set: cluvg1_clone [cluvg1]
>      Started: [ orion2 orion7 ]
>      Stopped: [ cluvg1:2 ]
>  styx   (ocf::heartbeat:Xen):   Started orion7
>  shib   (ocf::heartbeat:Xen):   Started orion7
>  wiki   (ocf::heartbeat:Xen):   Started orion2
>  horde  (ocf::heartbeat:Xen):   Started orion7
>  www    (ocf::heartbeat:Xen):   Started orion7
>  enventory      (ocf::heartbeat:Xen):   Started orion2
>  mailrelay      (ocf::heartbeat:Xen):   Started orion2
> 
> 
> 
> crm configure show
> node orion1 \
>         attributes standby="off"
> node orion2 \
>         attributes standby="off"
> node orion7 \
>         attributes standby="off"
> primitive cluvg1 ocf:heartbeat:LVM \
>         params volgrpname="cluvg1" \
>         op start interval="0" timeout="240s" \
>         op stop interval="0" timeout="100s" \
>         meta target-role="Started"
> primitive clvm ocf:lvm2:clvmd \
>         params daemon_timeout="30" \
>         op start interval="0" timeout="240s" \
>         op stop interval="0" timeout="100s" \
>         meta target-role="Started"
> primitive dlm ocf:pacemaker:controld \
>         op monitor interval="120s" \
>         op start interval="0" timeout="240s" \
>         op stop interval="0" timeout="100s" \
>         meta target-role="Started"
> primitive enventory ocf:heartbeat:Xen \
>         meta target-role="Started" allow-migrate="true" \
>         operations $id="enventory-operations" \
>         op monitor interval="10" timeout="30" \
>         op migrate_from interval="0" timeout="600" \
>         op migrate_to interval="0" timeout="600" \
>         params xmfile="/etc/xen/vm/enventory" shutdown_timeout="60"
> primitive horde ocf:heartbeat:Xen \
>         meta target-role="Started" is-managed="true" allow-migrate="true" \
>         operations $id="horde-operations" \
>         op monitor interval="10" timeout="30" \
>         op migrate_from interval="0" timeout="600" \
>         op migrate_to interval="0" timeout="600" \
>         params xmfile="/etc/xen/vm/horde" shutdown_timeout="120"
> primitive sbd_stonith stonith:external/sbd \
>         params
> sbd_device="/dev/disk/by-id/scsi-360080e50001c150e0000019e4df6d4d5-part1" \
>         meta target-role="started"
> ...
> ...
> ...
> clone cluvg1_clone cluvg1 \
>         meta interleave="true" target-role="started" is-managed="true"
> clone clvm_clone clvm \
>         meta globally-unique="false" interleave="true"
> target-role="started"
> clone dlm_clone dlm \
>         meta globally-unique="false" interleave="true"
> target-role="started"
> colocation cluvg1_with_clvm inf: cluvg1_clone clvm_clone
> colocation clvm_with_dlm inf: clvm_clone dlm_clone
> colocation enventory_with_cluvg1 inf: enventory cluvg1_clone
> colocation horde_with_cluvg1 inf: horde cluvg1_clone
> ...
> ... more Xen VMs
> ...
> order cluvg1_before_enventory inf: cluvg1_clone enventory
> order cluvg1_before_horde inf: cluvg1_clone horde
> order clvm_before_cluvg1 inf: clvm_clone cluvg1_clone
> order dlm_before_clvm inf: dlm_clone clvm_clone
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="3" \
>         stonith-timeout="420s" \
>         last-lrm-refresh="1328792018"
> rsc_defaults $id="rsc_defaults-options" \
>         resource-stickiness="10"
> op_defaults $id="op_defaults-options" \
>         record-pending="false"
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120209/465d542d/attachment-0003.sig>